3. Inference¶

Read in the packages, data and exclude an unused variable.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.2 Testing examples¶

Read in the Galapagos data

gala = pd.read_csv("data/gala.csv", index_col=0)
gala.drop('Endemics', axis=1, inplace=True)
gala.head()

Fit the model:

lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
lmod.summary()

Total sum of squares and residual sum of squares

lmod.centered_tss, lmod.ssr

(381081.36666666664, 89231.36633005117)

Degrees of freedom for the numerator and denominator of the F-test

lmod.df_model, lmod.df_resid

(5.0, 24.0)

Mean square of numerator and denominator

lmod.mse_model, lmod.mse_resid

(58370.0000673231, 3717.9735970854654)

F-statistic reproduced

lmod.mse_model/ lmod.mse_resid

15.699412204831035

Compute p-value

1-sp.stats.f.cdf(lmod.fvalue, lmod.df_model, lmod.df_resid)

6.837892995159578e-07

F-test for comparing two models. (Turn off the warnings about NaNs in the output - we don't care).

%%capture --no-display
lmods = smf.ols(formula='Species ~ Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
sm.stats.anova_lm(lmods,lmod)

%%capture --no-display
lmods = smf.ols(formula='Species ~ Elevation + Nearest + Scruz', data=gala).fit()
sm.stats.anova_lm(lmods,lmod)

Use I() notation to test replacing by the sum of two variables

%%capture --no-display
lmods = smf.ols(formula='Species ~ I(Area+Adjacent) + Elevation + Nearest + Scruz', data=gala).fit()
sm.stats.anova_lm(lmods,lmod)

Need to use glm function to get offset functionality

lmod = smf.glm(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
lmods = smf.glm(formula='Species ~ Area + Nearest + Scruz + Adjacent', offset=(0.5*gala['Elevation']), data=gala).fit()
fstat = (lmods.deviance-lmod.deviance)/(lmod.deviance/lmod.df_resid)
fstat

11.318196837955563

1-sp.stats.f.cdf(fstat, 1, lmod.df_resid)

0.002573836486092773

Compute t-statistic and corresponding p-value

lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
tstat=(lmod.params['Elevation']-0.5)/lmod.bse['Elevation']
tstat, 2*sp.stats.t.cdf(tstat, lmod.df_resid)

(-3.3642527904358723, 0.0025738364860927276)

tstat**2

11.318196837955554

3.3 Permutation tests¶

Permutation of the response

lmod = smf.ols(formula='Species ~ Nearest + Scruz', data=gala).fit()
np.random.permutation(gala.Species)

array([  2,   2,  25,   5,  12,  93,  31,  21,  10, 444,  58,  18,  51,
       285,  70,  24, 104,  62, 347,  16, 237, 108,  97,   3, 280,  44,
        40,   2,  58,   8])

Compute the f-statistic for a sample of permutations

fstats = np.zeros(4000)
for i in range(0,4000):
    gala['ysamp'] = np.random.permutation(gala.Species)
    lmodi = smf.ols(formula='ysamp ~ Nearest + Scruz', data=gala).fit()
    fstats[i] = lmodi.fvalue

Proportion of permuted f-statistics that exceed the observed value

np.mean(fstats > lmod.fvalue)

0.5575

Which is close to the observed p-value

lmod.f_pvalue

0.5549254563908441

Do t-test with permutation

tstats = np.zeros(4000)
for i in range(0, 4000):
    gala['ssamp'] = np.random.permutation(gala.Scruz)
    lmodi = smf.ols(formula='Species ~ Nearest + ssamp', data=gala).fit()
    tstats[i] = lmodi.tvalues[2]

Proportion of permuted t-statistcs which exceed the observed t-statistic in absolute value

np.mean(np.fabs(tstats) > np.fabs(lmod.tvalues[2]))

0.25475

Very close to observed p-value:

lmod.pvalues[2]

0.2833295186486556

3.5 Confidence intervals for beta¶

lmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
qt = np.array(sp.stats.t.interval(0.95,24))
lmod.params[1] + lmod.bse[1]*qt

array([-0.12720729,  0.09405181])

lmod.conf_int()

Drawing the confidence ellipses appears to require some effort in coding

3.6 Bootstrapped Confidence Intervals¶

Bootstrap the residuals

breps = 4000
coefmat = np.empty((breps,6))
resids = lmod.resid
preds = lmod.predict()
for i in range(0,breps):
    gala['ysamp'] = preds + np.random.choice(resids,30)
    lmodi = smf.ols(formula='ysamp ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
    coefmat[i,:] = lmodi.params

Turn into pandas dataframe and add columns

coefmat = pd.DataFrame(coefmat, columns=("intercept", "area","elevation","nearest","Scruz","adjacent"))

Compute the quantiles by column for bootstrap confidence intervals

coefmat.quantile((0.025,0.975))

Plot the kernel density estimate

coefmat.area.plot.density()
xci = coefmat.area.quantile((0.025,0.975)).ravel()
plt.axvline(x=xci[0], linestyle='--')
plt.axvline(x=xci[1], linestyle='--')            
plt.show()

%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels

	Species	Area	Elevation	Nearest	Scruz	Adjacent
Baltra	58	25.09	346	0.6	0.6	1.84
Bartolome	31	1.24	109	0.6	26.3	572.33
Caldwell	3	0.21	114	2.8	58.7	0.78
Champion	25	0.10	46	1.9	47.4	0.18
Coamano	2	0.05	77	1.9	1.9	903.82

Dep. Variable:	Species	R-squared:	0.766
Model:	OLS	Adj. R-squared:	0.717
Method:	Least Squares	F-statistic:	15.70
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	6.84e-07
Time:	11:14:01	Log-Likelihood:	-162.54
No. Observations:	30	AIC:	337.1
Df Residuals:	24	BIC:	345.5
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.0682	19.154	0.369	0.715	-32.464	46.601
Area	-0.0239	0.022	-1.068	0.296	-0.070	0.022
Elevation	0.3195	0.054	5.953	0.000	0.209	0.430
Nearest	0.0091	1.054	0.009	0.993	-2.166	2.185
Scruz	-0.2405	0.215	-1.117	0.275	-0.685	0.204
Adjacent	-0.0748	0.018	-4.226	0.000	-0.111	-0.038

Omnibus:	12.683	Durbin-Watson:	2.476
Prob(Omnibus):	0.002	Jarque-Bera (JB):	13.498
Skew:	1.136	Prob(JB):	0.00117
Kurtosis:	5.374	Cond. No.	1.90e+03

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	25.0	93469.08399	0.0	NaN	NaN	NaN
1	24.0	89231.36633	1.0	4237.71766	1.139792	0.296318

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	26.0	158291.628568	0.0	NaN	NaN	NaN
1	24.0	89231.366330	2.0	69060.262238	9.287352	0.00103

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	25.0	109591.120801	0.0	NaN	NaN	NaN
1	24.0	89231.366330	1.0	20359.754471	5.476035	0.027926

	0	1
Intercept	3.279124	176.591809
Area	-0.127207	0.094052
Elevation	-0.193113	0.260883
Nearest	-3.878662	3.157125
Scruz	-1.022820	0.851209
Adjacent	-0.089997	0.056935

	intercept	area	elevation	nearest	Scruz	adjacent
0.025	19.728984	-0.111007	-0.146154	-3.038894	-0.733632	-0.074728
0.975	167.173424	0.083704	0.242310	3.003985	0.912724	0.055717

Software	Version
Python	3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython	6.5.0
OS	Darwin 17.7.0 x86_64 i386 64bit
pandas	0.23.4
numpy	1.15.1
matplotlib	2.2.3
seaborn	0.9.0
scipy	1.1.0
patsy	0.5.0
statsmodels	0.9.0
Wed Sep 26 11:14:52 2018 BST