Chapter 6: Diagnostics¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
%matplotlib inline
%precision 5

'%.5f'

Check Error Assumptions¶

Read in the data for the example:

savings = pd.read_csv("data/savings.csv",index_col=0)
savings.head()

lmod = smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings).fit()
lmod.summary()

The simple residual-fitted plot:

plt.scatter(lmod.fittedvalues, lmod.resid)
plt.ylabel("Residuals")
plt.xlabel("Fitted values")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d48ffd0>

This plot is slightly better for checking for constant variance:

plt.scatter(lmod.fittedvalues, np.sqrt(abs(lmod.resid)))
plt.ylabel("sqrt |Residuals|")
plt.xlabel("Fitted values")

Text(0.5,0,'Fitted values')

Check that there is no significant relationship:

ddf = pd.DataFrame({'x':lmod.fittedvalues,'y':np.sqrt(abs(lmod.resid))})
dmod = smf.ols('y ~ x',data=ddf).fit()
dmod.summary()

Simulate some residual-fitted plots:

n = 50
x = np.random.sample(n)
y = np.random.normal(size=n)
plt.scatter(x,y)
plt.title("No problem plot")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d661a20>

plt.scatter(x,y*x)
plt.title("Increasing variance")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d784748>

plt.scatter(x,y*np.sqrt(x))
plt.title("Mildly increasing variance")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d7dde10>

y = np.cos(2*x*np.pi) + np.random.normal(size=n)
plt.scatter(x,y)
plt.title("Lack of fit plot")
sx = np.sort(x)
plt.plot(sx,np.cos(2*sx*np.pi),c='red')
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d907320>

Plots of residuals against predictors:

plt.scatter(savings.pop15, lmod.resid)
plt.xlabel("%pop under 15")
plt.ylabel("Residuals")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d948ef0>

plt.scatter(savings.pop75, lmod.resid)
plt.xlabel("%pop over 75")
plt.ylabel("Residuals")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1d9f5e48>

Check for equality of the variances in the two groups seen in pop15

numres = lmod.resid[savings.pop15 > 35]
denres = lmod.resid[savings.pop15 < 35]
fstat = np.var(numres,ddof=1)/np.var(denres,ddof=1)
2*(1-sp.stats.f.cdf(fstat,len(numres)-1,len(denres)-1))

0.013575950424159933

Do the residual fitted plot for the Galapagos data:

gala = pd.read_csv("data/gala.csv")
gmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
plt.scatter(gmod.fittedvalues, gmod.resid)
plt.ylabel("Residuals")
plt.xlabel("Fitted values")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1dae9f98>

See how transformation improves the plot:

gmod = smf.ols(formula='np.sqrt(Species) ~ Area + Elevation + Nearest + Scruz  + Adjacent', data=gala).fit()
plt.scatter(gmod.fittedvalues, gmod.resid)
plt.ylabel("Residuals")
plt.xlabel("Fitted values")
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1da02908>

Check the QQ plot of the residuals. Plot is assigned to foo so it doesn't print twice.

foo = sm.qqplot(lmod.resid, line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Histogram is not a good way to check:

plt.hist(lmod.resid)
plt.xlabel("Residuals")

Text(0.5,0,'Residuals')

Generate some random QQ plots from known distributions:

foo = sm.qqplot(np.random.normal(size=50),line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

foo = sm.qqplot(np.exp(np.random.normal(size=50)),line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

foo = sm.qqplot(np.random.standard_t(1,size=50),line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

foo = sm.qqplot(np.random.sample(size=50),line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Do the Shapiro test - second value is the p-value

sp.stats.shapiro(lmod.resid)

(0.98698, 0.85240)

Correlated Errors¶

globwarm = pd.read_csv("data/globwarm.csv")
globwarm.head()

lmod = smf.ols(formula='nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman',
               data=globwarm).fit()
lmod.summary()

plt.scatter(globwarm.year[lmod.resid.keys()], lmod.resid)
plt.axhline(0)

<matplotlib.lines.Line2D at 0x1c1e1dd128>

plt.scatter(lmod.resid.iloc[:-1],lmod.resid.iloc[1:])
plt.axhline(0,alpha=0.5)
plt.axvline(0,alpha=0.5)

<matplotlib.lines.Line2D at 0x109907860>

This is the statistic. P-values do not seem to be available. 2 is considered null but less than 1 is a sign of a problem.

sm.stats.stattools.durbin_watson(lmod.resid)

0.8166063726402724

Finding Unusual Observations¶

lmod = smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings).fit()
diagv = lmod.get_influence()
hatv = pd.Series(diagv.hat_matrix_diag, savings.index)
hatv.head()

Australia    0.067713
Austria      0.120384
Belgium      0.087482
Bolivia      0.089471
Brazil       0.069559
dtype: float64

Make the half-normal plot:

n=50
ix = np.arange(1,n+1)
halfq = sp.stats.norm.ppf((n+ix)/(2*n+1)),
plt.scatter(halfq, np.sort(hatv))

<matplotlib.collections.PathCollection at 0x10a6d70f0>

Should tag these on the plot

hatv.sort_values().iloc[-5:]

South Rhodesia    0.160809
Ireland           0.212236
Japan             0.223310
United States     0.333688
Libya             0.531457
dtype: float64

One of the built-in plots. Not sure about residuals^2 as a diagnostic.

foo = sm.graphics.plot_leverage_resid2(lmod)

Would prefer R style plot here.

foo = sm.graphics.influence_plot(lmod)

Make a QQ plot of the Pearson residuals. These are not the same as the standardised residuals in R.

foo = sm.qqplot(lmod.resid_pearson)

Make the example plots with outliers. (Seems like this could be done more neatly)

np.random.seed(123)
testdata = pd.DataFrame({'x' : np.arange(1,11), 'y' : np.arange(1,11) + np.random.normal(size=10)})
p1 = pd.DataFrame({'x': [5.5], 'y':[12]})
alldata = testdata.append(p1,ignore_index=True)

marksize = np.ones(11)
marksize[10] = 3
plt.scatter(alldata.x, alldata.y, s= marksize*5)
slope, intercept = np.polyfit(testdata.x, testdata.y,1)
plt.plot(testdata.x, intercept + slope * testdata.x)
slope, intercept = np.polyfit(alldata.x, alldata.y,1)
plt.plot(alldata.x, intercept + slope * alldata.x, color='red')

[<matplotlib.lines.Line2D at 0x1c1eefa160>]

p1 = pd.DataFrame({'x': [15], 'y':[15.1]})
alldata = testdata.append(p1,ignore_index=True)
plt.scatter(alldata.x, alldata.y, s= marksize*5)
slope, intercept = np.polyfit(testdata.x, testdata.y,1)
plt.plot(testdata.x, intercept + slope * testdata.x)
slope, intercept = np.polyfit(alldata.x, alldata.y,1)
plt.plot(alldata.x, intercept + slope * alldata.x, color='red')

[<matplotlib.lines.Line2D at 0x1c1ed47358>]

p1 = pd.DataFrame({'x': [15], 'y':[5.1]})
alldata = testdata.append(p1,ignore_index=True)
plt.scatter(alldata.x, alldata.y, s= marksize*5)
slope, intercept = np.polyfit(testdata.x, testdata.y,1)
plt.plot(testdata.x, intercept + slope * testdata.x)
slope, intercept = np.polyfit(alldata.x, alldata.y,1)
plt.plot(alldata.x, intercept + slope * alldata.x, color='red')

[<matplotlib.lines.Line2D at 0x1c1e1aa470>]

Create the studentized residuals

stud = pd.Series(diagv.resid_studentized_external, savings.index)
(pd.Series.idxmax(abs(stud)), np.max(abs(stud)))

('Zambia', 2.8535583382284377)

Find the Bonferroni-adjusted critical value:

sp.stats.t.ppf(0.05/(2*50),44)

-3.525801306486005

Star example that has multiple outliers:

star = pd.read_csv("data/star.csv")
star.head()

Put the line on the plot:

lmod = smf.ols(formula='light ~ temp',data=star).fit()
plt.scatter(star.temp, star.light)
plt.plot([3.4, 4.7], [lmod.params[0] + lmod.params[1]*3.4, lmod.params[0] + lmod.params[1]*4.7])

[<matplotlib.lines.Line2D at 0x1c1ef590f0>]

Look for outliers:

stud = lmod.get_influence().resid_studentized_external
np.min(stud), np.max(stud)

(-2.049392729378057, 1.905847201766315)

Fit the line without the four points:

lmodr = smf.ols(formula='light ~ temp',data=star[star.temp > 3.6]).fit()
plt.scatter(star.temp, star.light)
plt.plot([3.4, 4.7], [lmod.params[0] + lmod.params[1]*3.4, lmod.params[0] + lmod.params[1]*4.7])
plt.plot([3.4, 4.7], [lmodr.params[0] + lmodr.params[1]*3.4, lmodr.params[0] + lmodr.params[1]*4.7],color='red')

[<matplotlib.lines.Line2D at 0x1c1df331d0>]

Influential observations¶

lmod = smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings).fit()
diagv = lmod.get_influence()
cooks = pd.Series(diagv.cooks_distance[0], savings.index)
n=50
ix = np.arange(1,n+1)
halfq = sp.stats.norm.ppf((n+ix)/(2*n+1)),
plt.scatter(halfq, np.sort(cooks))

<matplotlib.collections.PathCollection at 0x1c1e3d76a0>

Would be best placed on the plot.

cooks.sort_values().iloc[-5:]

Philippines    0.045221
Ireland        0.054396
Zambia         0.096633
Japan          0.142816
Libya          0.268070
dtype: float64

First column is the all data estimates and second is one without largest cook stat.

lmodi = smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings[cooks < 0.2]).fit()
pd.DataFrame({'with':lmod.params,'without':lmodi.params})

Would be better to autodetect extreme cases and label them. Note that dfbetas is the change in standard errors and not the change in coefficient as in R.

p15d = diagv.dfbetas[:,1]
plt.scatter(np.arange(1,51),p15d)
plt.axhline(0)
ix = 22
plt.annotate(savings.index[ix],(ix, p15d[ix]))
ix = 48
plt.annotate(savings.index[ix],(ix, p15d[ix]))

Text(48,-0.483244,'Libya')

Checking the structure of the model¶

Partial regression plot

d = smf.ols(formula='sr ~ pop75 + dpi + ddpi', data=savings).fit().resid
m = smf.ols(formula='pop15 ~ pop75 + dpi + ddpi', data=savings).fit().resid
plt.scatter(m,d)
plt.xlabel("pop15 residuals")
plt.ylabel("sr residuals")
plt.plot([-10,8], [-10*lmod.params[1], 8*lmod.params[1]])

[<matplotlib.lines.Line2D at 0x1c1dd06e10>]

Partial residual plot

pr = lmod.resid + savings.pop15*lmod.params[1]
plt.scatter(savings.pop15, pr)
plt.xlabel("pop15")
plt.ylabel("partial residuals")
plt.plot([20,50], [20*lmod.params[1], 50*lmod.params[1]])

[<matplotlib.lines.Line2D at 0x1c1df27a58>]

smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings[savings.pop15 > 35]).fit().summary()

smf.ols(formula='sr ~ pop15 + pop75 + dpi + ddpi', data=savings[savings.pop15 < 35]).fit().summary()

Plot the young and old groups with different colors:

savings['age'] = np.where(savings.pop15 > 35, 'red', 'blue')
plt.scatter(savings.ddpi, savings.sr, color=savings.age)
plt.xlabel("ddpi")
plt.ylabel("sr")

Text(0,0.5,'sr')

Would be nice to have a legend but matplotlib does not make this easy. Fortunately, seaborn makes this more convenient:

savings['age'] = np.where(savings.pop15 > 35, 'young', 'old')
sns.lmplot('ddpi','sr',data=savings, hue='age')

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.FacetGrid at 0x1c1f0ffcc0>

Can facet this also:

sns.lmplot('ddpi','sr',data=savings, col='age')

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.FacetGrid at 0x1c1f1cf588>

%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels

	sr	pop15	pop75	dpi	ddpi
Australia	11.43	29.35	2.87	2329.68	2.87
Austria	12.07	23.32	4.41	1507.99	3.93
Belgium	13.17	23.80	4.43	2108.47	3.82
Bolivia	5.75	41.89	1.67	189.13	0.22
Brazil	12.88	42.19	0.83	728.47	4.56

Dep. Variable:	sr	R-squared:	0.338
Model:	OLS	Adj. R-squared:	0.280
Method:	Least Squares	F-statistic:	5.756
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.000790
Time:	11:15:04	Log-Likelihood:	-135.10
No. Observations:	50	AIC:	280.2
Df Residuals:	45	BIC:	289.8
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	28.5661	7.355	3.884	0.000	13.753	43.379
pop15	-0.4612	0.145	-3.189	0.003	-0.753	-0.170
pop75	-1.6915	1.084	-1.561	0.126	-3.874	0.491
dpi	-0.0003	0.001	-0.362	0.719	-0.002	0.002
ddpi	0.4097	0.196	2.088	0.042	0.015	0.805

Omnibus:	0.866	Durbin-Watson:	1.934
Prob(Omnibus):	0.649	Jarque-Bera (JB):	0.493
Skew:	0.241	Prob(JB):	0.782
Kurtosis:	3.064	Cond. No.	2.04e+04

Dep. Variable:	y	R-squared:	0.061
Model:	OLS	Adj. R-squared:	0.041
Method:	Least Squares	F-statistic:	3.117
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.0838
Time:	11:15:04	Log-Likelihood:	-47.153
No. Observations:	50	AIC:	98.31
Df Residuals:	48	BIC:	102.1
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	1.170	Durbin-Watson:	2.050
Prob(Omnibus):	0.557	Jarque-Bera (JB):	1.184
Skew:	0.273	Prob(JB):	0.553
Kurtosis:	2.479	Cond. No.	39.2

	nhtemp	wusa	jasper	westgreen	chesapeake	tornetrask	urals	mongolia	tasman	year
0	NaN	-0.66	-0.03	0.03	-0.66	0.33	-1.49	0.83	-0.12	1000
1	NaN	-0.63	-0.07	0.09	-0.67	0.21	-1.44	0.96	-0.17	1001
2	NaN	-0.60	-0.11	0.18	-0.67	0.13	-1.39	0.99	-0.22	1002
3	NaN	-0.55	-0.14	0.30	-0.68	0.08	-1.34	0.95	-0.26	1003
4	NaN	-0.51	-0.15	0.41	-0.68	0.06	-1.30	0.87	-0.31	1004

Dep. Variable:	nhtemp	R-squared:	0.476
Model:	OLS	Adj. R-squared:	0.446
Method:	Least Squares	F-statistic:	15.47
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	5.03e-16
Time:	11:15:05	Log-Likelihood:	50.995
No. Observations:	145	AIC:	-83.99
Df Residuals:	136	BIC:	-57.20
Df Model:	8
Covariance Type:	nonrobust

Omnibus:	12.501	Durbin-Watson:	0.817
Prob(Omnibus):	0.002	Jarque-Bera (JB):	17.388
Skew:	0.490	Prob(JB):	0.000168
Kurtosis:	4.384	Cond. No.	12.9

	with	without
Intercept	28.566087	24.524046
pop15	-0.461193	-0.391440
pop75	-1.691498	-1.280867
dpi	-0.000337	-0.000319
ddpi	0.409695	0.610279

Dep. Variable:	sr	R-squared:	0.156
Model:	OLS	Adj. R-squared:	-0.032
Method:	Least Squares	F-statistic:	0.8302
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.523
Time:	11:15:07	Log-Likelihood:	-64.176
No. Observations:	23	AIC:	138.4
Df Residuals:	18	BIC:	144.0
Df Model:	4
Covariance Type:	nonrobust

Omnibus:	1.851	Durbin-Watson:	1.904
Prob(Omnibus):	0.396	Jarque-Bera (JB):	1.540
Skew:	0.497	Prob(JB):	0.463
Kurtosis:	2.213	Cond. No.	8.87e+03

Dep. Variable:	sr	R-squared:	0.507
Model:	OLS	Adj. R-squared:	0.418
Method:	Least Squares	F-statistic:	5.663
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.00273
Time:	11:15:07	Log-Likelihood:	-63.072
No. Observations:	27	AIC:	136.1
Df Residuals:	22	BIC:	142.6
Df Model:	4
Covariance Type:	nonrobust

Omnibus:	0.363	Durbin-Watson:	2.189
Prob(Omnibus):	0.834	Jarque-Bera (JB):	0.396
Skew:	-0.241	Prob(JB):	0.820
Kurtosis:	2.654	Cond. No.	3.01e+04

Software	Version
Python	3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython	6.5.0
OS	Darwin 17.7.0 x86_64 i386 64bit
pandas	0.23.4
numpy	1.15.1
matplotlib	2.2.3
seaborn	0.9.0
scipy	1.1.0
patsy	0.5.0
statsmodels	0.9.0
Wed Sep 26 11:15:08 2018 BST

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.1622	0.348	6.215	0.000	1.463	2.862
x	-0.0614	0.035	-1.766	0.084	-0.131	0.009

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.2426	0.027	-8.980	0.000	-0.296	-0.189
wusa	0.0774	0.043	1.803	0.074	-0.008	0.162
jasper	-0.2288	0.078	-2.929	0.004	-0.383	-0.074
westgreen	0.0096	0.042	0.229	0.819	-0.073	0.092
chesapeake	-0.0321	0.034	-0.943	0.347	-0.099	0.035
tornetrask	0.0927	0.045	2.057	0.042	0.004	0.182
urals	0.1854	0.091	2.027	0.045	0.005	0.366
mongolia	0.0420	0.046	0.917	0.361	-0.049	0.133
tasman	0.1155	0.030	3.834	0.000	0.056	0.175

	temp	light
0	4.37	5.23
1	4.56	5.74
2	4.26	4.93
3	4.56	5.74
4	4.30	5.19

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.4340	21.155	-0.115	0.910	-46.879	42.011
pop15	0.2739	0.439	0.624	0.541	-0.649	1.197
pop75	-3.5485	3.033	-1.170	0.257	-9.921	2.824
dpi	0.0004	0.005	0.084	0.934	-0.010	0.011
ddpi	0.3955	0.290	1.363	0.190	-0.214	1.005