Load in the packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
Read in the Galapagos data and check the first few lines:
gala = pd.read_csv("data/gala.csv", index_col=0)
gala.head()
Drop the endemics variable permanently from the data frame:
gala.drop('Endemics', axis=1, inplace=True)
gala.head()
Fit a linear model
lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz + Adjacent', data=gala).fit()
lmod.summary()
The warnings can be ignored for now. The first is always assumed. The second can be a problem.
Compute LS estimates using the formula:
X = gala.iloc[:,1:]
X.insert(0,'intercept',1)
XtXi = np.linalg.inv(X.T @ X)
np.dot(XtXi @ X.T, gala['Species'])
A somewhat more efficient way to do the calculation
np.linalg.solve(X.T @ X, np.dot(X.T,gala['Species']))
Computation using the QR decomposition
q, r = np.linalg.qr(X)
f = np.dot(np.transpose(q), gala['Species'])
f
This function from scipy uses the fact that r is upper triangular. The np.linalg.solve does not.
sp.linalg.solve_triangular(r, f)
Add a predictor which is a linear combination of other predictors. In contrast to R, the parameter is estimated. The second warning indicates that the design matrix is singular (as indeed it is).
gala['Adiff'] = gala['Area'] - gala['Adjacent']
lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz + Adjacent + Adiff', data=gala).fit()
lmod.summary()
Orthogonality example
odor = pd.read_csv("data/odor.csv")
odor.head()
Covariance of the predictors is diagonal
odor.iloc[:,1:].cov()
LS estimates with all three predictors
lmod = smf.ols(formula='odor ~ temp + gas + pack', data=odor).fit()
lmod.params
Covariance of the parameter estimates
np.round(lmod.cov_params(),2)
see that estimates do not change when a predictor is dropped from the model
lmod = smf.ols(formula='odor ~ gas + pack', data=odor).fit()
lmod.params
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels