Chapter 2 - Estimation¶

Load in the packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf

2.6 Example¶

Read in the Galapagos data and check the first few lines:

gala = pd.read_csv("data/gala.csv", index_col=0)
gala.head()

Drop the endemics variable permanently from the data frame:

gala.drop('Endemics', axis=1, inplace=True)
gala.head()

Fit a linear model

lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
lmod.summary()

The warnings can be ignored for now. The first is always assumed. The second can be a problem.

Compute LS estimates using the formula:

X = gala.iloc[:,1:]
X.insert(0,'intercept',1)
XtXi = np.linalg.inv(X.T @ X)
np.dot(XtXi @ X.T, gala['Species'])

array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

A somewhat more efficient way to do the calculation

np.linalg.solve(X.T @ X, np.dot(X.T,gala['Species']))

array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

2.9 QR decomposition¶

Computation using the QR decomposition

q, r = np.linalg.qr(X)
f = np.dot(np.transpose(q), gala['Species'])
f

array([-466.84219318,  381.40557435,  256.25047255,    5.40764552,
       -119.49834019,  257.69436853])

This function from scipy uses the fact that r is upper triangular. The np.linalg.solve does not.

sp.linalg.solve_triangular(r, f)

array([ 7.06822071, -0.02393834,  0.31946476,  0.00914396, -0.24052423,
       -0.07480483])

2.10 Identifiability¶

Add a predictor which is a linear combination of other predictors. In contrast to R, the parameter is estimated. The second warning indicates that the design matrix is singular (as indeed it is).

gala['Adiff'] = gala['Area'] - gala['Adjacent']
lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent + Adiff', data=gala).fit()
lmod.summary()

2.11 Orthogonality¶

Orthogonality example

odor = pd.read_csv("data/odor.csv")
odor.head()

Covariance of the predictors is diagonal

odor.iloc[:,1:].cov()

LS estimates with all three predictors

lmod = smf.ols(formula='odor ~ temp + gas + pack', data=odor).fit()
lmod.params

Intercept    15.200
temp        -12.125
gas         -17.000
pack        -21.375
dtype: float64

Covariance of the parameter estimates

np.round(lmod.cov_params(),2)

see that estimates do not change when a predictor is dropped from the model

lmod = smf.ols(formula='odor ~ gas + pack', data=odor).fit()
lmod.params

Intercept    15.200
gas         -17.000
pack        -21.375
dtype: float64

%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels

	Species	Endemics	Area	Elevation	Nearest	Scruz	Adjacent
Baltra	58	23	25.09	346	0.6	0.6	1.84
Bartolome	31	21	1.24	109	0.6	26.3	572.33
Caldwell	3	3	0.21	114	2.8	58.7	0.78
Champion	25	9	0.10	46	1.9	47.4	0.18
Coamano	2	1	0.05	77	1.9	1.9	903.82

	Species	Area	Elevation	Nearest	Scruz	Adjacent
Baltra	58	25.09	346	0.6	0.6	1.84
Bartolome	31	1.24	109	0.6	26.3	572.33
Caldwell	3	0.21	114	2.8	58.7	0.78
Champion	25	0.10	46	1.9	47.4	0.18
Coamano	2	0.05	77	1.9	1.9	903.82

Dep. Variable:	Species	R-squared:	0.766
Model:	OLS	Adj. R-squared:	0.717
Method:	Least Squares	F-statistic:	15.70
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	6.84e-07
Time:	11:13:58	Log-Likelihood:	-162.54
No. Observations:	30	AIC:	337.1
Df Residuals:	24	BIC:	345.5
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.0682	19.154	0.369	0.715	-32.464	46.601
Area	-0.0239	0.022	-1.068	0.296	-0.070	0.022
Elevation	0.3195	0.054	5.953	0.000	0.209	0.430
Nearest	0.0091	1.054	0.009	0.993	-2.166	2.185
Scruz	-0.2405	0.215	-1.117	0.275	-0.685	0.204
Adjacent	-0.0748	0.018	-4.226	0.000	-0.111	-0.038

Omnibus:	12.683	Durbin-Watson:	2.476
Prob(Omnibus):	0.002	Jarque-Bera (JB):	13.498
Skew:	1.136	Prob(JB):	0.00117
Kurtosis:	5.374	Cond. No.	1.90e+03

	temp	gas	pack
temp	0.571429	0.000000	0.000000
gas	0.000000	0.571429	0.000000
pack	0.000000	0.000000	0.571429

	Intercept	temp	gas	pack
Intercept	86.46	-0.0	-0.0	-0.0
temp	-0.00	162.1	-0.0	0.0
gas	-0.00	-0.0	162.1	0.0
pack	-0.00	0.0	0.0	162.1

Software	Version
Python	3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython	6.5.0
OS	Darwin 17.7.0 x86_64 i386 64bit
pandas	0.23.4
numpy	1.15.1
matplotlib	2.2.3
seaborn	0.9.0
scipy	1.1.0
patsy	0.5.0
statsmodels	0.9.0
Wed Sep 26 11:13:58 2018 BST