Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.980
Model:                            OLS   Adj. R-squared:                  0.978
Method:                 Least Squares   F-statistic:                     737.0
Date:                Wed, 11 Jan 2023   Prob (F-statistic):           7.05e-39
Time:                        23:51:04   Log-Likelihood:                -3.5084
No. Observations:                  50   AIC:                             15.02
Df Residuals:                      46   BIC:                             22.66
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1451      0.092     55.782      0.000       4.959       5.331
x1             0.4809      0.014     33.804      0.000       0.452       0.509
x2             0.4444      0.056      7.947      0.000       0.332       0.557
x3            -0.0187      0.001    -14.941      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        1.839   Durbin-Watson:                   2.390
Prob(Omnibus):                  0.399   Jarque-Bera (JB):                0.995
Skew:                          -0.238   Prob(JB):                        0.608
Kurtosis:                       3.501   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.67853681  5.12426498  5.5347932   5.88590093  6.16210869  6.35922129
  6.48501713  6.55797017  6.60421478  6.65325186  6.73310199  6.8657017
  7.06329935  7.32644279  7.64388932  7.99445284  8.3504847   8.68241624
  8.96361687  9.17476941  9.3070438   9.36354793  9.35881694  9.3164252
  9.26511308  9.23406499  9.24811447  9.3236641   9.46599062  9.66837796
  9.91322067 10.17491647 10.4240726  10.63233499 10.77704674 10.8449706
 10.83446378 10.75574792 10.62922961 10.48214775 10.34409817 10.24217123
 10.19650083 10.21695658 10.30152218 10.43662671 10.59937431 10.76130587
 10.89307482 10.96926868]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.96239986 10.83773582 10.61270506 10.32702496 10.03297753  9.782609
  9.61498706  9.54663574  9.56748973  9.64335875]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f8326ca7710>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.145058
x1                  0.480859
np.sin(x1)          0.444421
I((x1 - 5) ** 2)   -0.018661
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.962400
1    10.837736
2    10.612705
3    10.327025
4    10.032978
5     9.782609
6     9.614987
7     9.546636
8     9.567490
9     9.643359
dtype: float64