Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/opt/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     612.7
Date:                Mon, 07 Dec 2020   Prob (F-statistic):           4.47e-37
Time:                        08:38:02   Log-Likelihood:                -8.8880
No. Observations:                  50   AIC:                             25.78
Df Residuals:                      46   BIC:                             33.42
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9228      0.103     47.928      0.000       4.716       5.130
x1             0.5097      0.016     32.175      0.000       0.478       0.542
x2             0.5268      0.062      8.459      0.000       0.401       0.652
x3            -0.0216      0.001    -15.505      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        0.232   Durbin-Watson:                   2.629
Prob(Omnibus):                  0.891   Jarque-Bera (JB):                0.408
Skew:                          -0.119   Prob(JB):                        0.816
Kurtosis:                       2.626   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.38371873  4.88526271  5.34526944  5.73503073  6.03619906  6.24380195
  6.36705898  6.42786701  6.45720262  6.49003269  6.55956949  6.69181399
  6.90128396  7.18862873  7.54052256  7.93185395  8.32985145  8.69946788
  9.00913848  9.23596696  9.36948722  9.41338296  9.38488244  9.31192793
  9.22858484  9.16944493  9.16394293  9.23152053  9.37843214  9.59671732
  9.86550868 10.15446029 10.42873345 10.65472061 10.80556696 10.86558261
 10.83282043 10.71939591 10.54949662 10.35540832 10.17221043 10.03201261
  9.95867924  9.96390858 10.04531074 10.18680035 10.36123938 10.53489577
 10.67298557 10.74538781]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.71555946 10.5463876  10.25852992  9.89906269  9.52995483  9.21289577
  8.99419173  8.89342834  8.89867537  8.96940771]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.922828
x1                  0.509682
np.sin(x1)          0.526764
I((x1 - 5) ** 2)   -0.021564
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.715559
1    10.546388
2    10.258530
3     9.899063
4     9.529955
5     9.212896
6     8.994192
7     8.893428
8     8.898675
9     8.969408
dtype: float64