Chapter 14: Categorical Predictors¶

Import the packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from scipy import stats

Two-level factor¶

Read in the data

sexab = pd.read_csv("data/sexab.csv", index_col=0)
sexab.head()

Construct something similar to the by function summary from LMR.

lfuncs = ['min','median','max']
sexab.groupby('csa').agg({'cpa': lfuncs,'ptsd': lfuncs}).round(1)

sexab.boxplot('ptsd',by='csa')
plt.suptitle("")
plt.show()

sns.pairplot(x_vars="cpa", y_vars="ptsd", data=sexab, hue="csa",size=5)
plt.show()

/anaconda/lib/python3.7/site-packages/seaborn/axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)

stats.ttest_ind(sexab.ptsd[sexab.csa == 'Abused'], sexab.ptsd[sexab.csa == 'NotAbused'])

Ttest_indResult(statistic=8.938657095668173, pvalue=2.1719334389283135e-13)

Define dummy variables and including an intercept does fit a model but the meaning of the coefficients and standard errors are problematic. Look out for the design matrix is singular warning at the end.

df1 = (sexab.csa == 'Abused').astype(int)
df2 = (sexab.csa == 'NotAbused').astype(int)
X = np.column_stack((df1,df2))
lmod = sm.OLS(sexab.ptsd,sm.add_constant(X)).fit()
lmod.summary()

Use only one of the dummy variables:

lmod = sm.OLS(sexab.ptsd,sm.add_constant(df2)).fit()
lmod.summary()

Or we can just not have an intercept term:

lmod = sm.OLS(sexab.ptsd,X).fit()
lmod.summary()

Using the formula method produces R-like results

lmod = smf.ols('ptsd ~ csa', sexab).fit()
lmod.summary()

Can view the design matrix: (picking out the first two rows of each level of csa).

import patsy
selcols = [0,1,45,46]
patsy.dmatrix('~ csa', sexab)[selcols,:]

array([[1., 0.],
       [1., 0.],
       [1., 1.],
       [1., 1.]])

Can check the types of variables:

sexab.csa.dtype, sexab.ptsd.dtype

(dtype('O'), dtype('float64'))

pandas has a way to create the dummy variables:

sac = pd.concat([sexab,pd.get_dummies(sexab.csa)],axis=1)
lmod = smf.ols('ptsd ~ Abused', sac).fit()
lmod.summary()

Alternatively, we can set the reference level to NotAbused:

lmod = smf.ols('ptsd ~ C(csa,Treatment(reference="NotAbused"))', sexab).fit()
lmod.summary()

Factors and Quantitative predictors¶

lmod4 = smf.ols('ptsd ~ csa*cpa', sexab).fit()
lmod4.summary()

Take a look at a chunk from the design matrix:

patsy.dmatrix('~ csa*cpa', sexab)[40:50,]

array([[ 1.     ,  0.     ,  3.0775 ,  0.     ],
       [ 1.     ,  0.     ,  5.26785,  0.     ],
       [ 1.     ,  0.     ,  3.41136,  0.     ],
       [ 1.     ,  0.     ,  1.35316,  0.     ],
       [ 1.     ,  0.     ,  5.11921,  0.     ],
       [ 1.     ,  1.     ,  1.49181,  1.49181],
       [ 1.     ,  1.     ,  0.60961,  0.60961],
       [ 1.     ,  1.     ,  1.43335,  1.43335],
       [ 1.     ,  1.     , -0.33664, -0.33664],
       [ 1.     ,  1.     , -3.12036, -3.12036]])

Plot the fit by group - seem like a lot of work!

abused = (sexab.csa == "Abused")
plt.scatter(sexab.cpa[abused], sexab.ptsd[abused], marker='x',label="abused")
xl,xu = [-3, 9]
a, b = (lmod4.params[0], lmod4.params[2])
plt.plot([xl,xu], [a+xl*b,a+xu*b])
plt.scatter(sexab.cpa[~abused], sexab.ptsd[~abused], marker='o',label="not abused")
a, b = (lmod4.params[0]+lmod4.params[1], lmod4.params[2]+lmod4.params[3])
plt.plot([xl,xu], [a+xl*b,a+xu*b])
plt.legend()
plt.show()

Can produce essentially the same plot but the regression lines are fit independently to the groups. In this case, the fits will be identical.

sns.lmplot(x="cpa", y="ptsd", hue="csa", data=sexab, ci=None)
plt.show()

lmod3 = smf.ols('ptsd ~ csa+cpa', sexab).fit()
lmod3.summary()

No shortcut to producing the plot this time.

abused = (sexab.csa == "Abused")
plt.scatter(sexab.cpa[abused], sexab.ptsd[abused], marker='x',label="abused")
xl,xu = [-3, 9]
a, b = (lmod3.params[0], lmod3.params[2])
plt.plot([xl,xu], [a+xl*b,a+xu*b])
plt.scatter(sexab.cpa[~abused], sexab.ptsd[~abused], marker='o',label="not abused")
a, b = (lmod3.params[0]+lmod4.params[1], lmod3.params[2])
plt.plot([xl,xu], [a+xl*b,a+xu*b])
plt.legend()
plt.show()

Get the confidence intervals:

lmod3.conf_int()

sns.residplot(lmod3.fittedvalues, lmod3.resid, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

lmod1 = smf.ols('ptsd ~ cpa', sexab).fit()
lmod1.summary()

Interpretation with interaction terms¶

ws = pd.read_csv("data/whiteside.csv", index_col=0)
ws.head()

sns.lmplot(x="Temp", y="Gas", col="Insul", data=ws)
plt.show()

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

lmod = smf.ols('Gas ~ Temp*Insul', ws).fit()
lmod.summary()

ws.Temp.mean()

4.875

ws['cTemp'] = ws.Temp - ws.Temp.mean()
lmod = smf.ols('Gas ~ cTemp*Insul', ws).fit()
lmod.summary()

Factors with more than two levels¶

ff = pd.read_csv("data/fruitfly.csv", index_col=0)
ff.head()

sns.pairplot(x_vars="thorax", y_vars="longevity", data=ff, hue="activity",size=5)
plt.show()

/anaconda/lib/python3.7/site-packages/seaborn/axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)

sns.lmplot(x="thorax", y="longevity", data=ff, col="activity",size=5)
plt.show()

/anaconda/lib/python3.7/site-packages/seaborn/regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Fit the model

lmod = smf.ols('longevity ~ thorax*activity', ff).fit()
lmod.summary()

Construct and show selected rows of the design matrix (one for each level of activity). DataFrame is just for pretty printing.

mm = patsy.dmatrix('~ thorax*activity', ff)
ii = (1, 25, 49, 75, 99)
pd.DataFrame(mm[ii,:],index=ii,columns=lmod.params.index)

sns.residplot(lmod.fittedvalues, lmod.resid, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

sm.qqplot(lmod.resid, line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

sm.stats.anova_lm(lmod)

lmods = smf.ols('longevity ~ thorax+activity', ff).fit()
sm.stats.anova_lm(lmods,lmod)

/anaconda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/anaconda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/anaconda/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

lmod = smf.ols('np.log(longevity) ~ thorax+activity', ff).fit()
lmod.summary()

sns.residplot(lmod.fittedvalues, lmod.resid, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

np.exp(lmod.params[1:5])

activity[T.isolated]    1.520824
activity[T.low]         1.343643
activity[T.many]        1.660571
activity[T.one]         1.601589
dtype: float64

lmod = smf.ols('longevity ~ activity', ff).fit()
lmod.summary()

sm.stats.anova_lm(lmod)

lmod = smf.ols('np.log(longevity) ~ activity', ff).fit()
lmod.summary()

Factor coding¶

from patsy.contrasts import Treatment
levels = [1,2,3,4]
contrast = Treatment(reference=0).code_without_intercept(levels)
print(contrast.matrix)

[[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

from patsy.contrasts import Sum
contrast = Sum().code_without_intercept(levels)
print(contrast.matrix)

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [-1. -1. -1.]]

from patsy.contrasts import Helmert
contrast = Helmert().code_without_intercept(levels)
print(contrast.matrix)

[[-1. -1. -1.]
 [ 1. -1. -1.]
 [ 0.  2. -1.]
 [ 0.  0.  3.]]

lmod = smf.ols('ptsd ~ C(csa,Sum)', sexab).fit()
lmod.summary()

%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels

	cpa	ptsd	csa
1	2.04786	9.71365	Abused
2	0.83895	6.16933	Abused
3	-0.24139	15.15926	Abused
4	-1.11461	11.31277	Abused
5	2.01468	9.95384	Abused

Dep. Variable:	ptsd	R-squared:	0.519
Model:	OLS	Adj. R-squared:	0.513
Method:	Least Squares	F-statistic:	79.90
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	2.17e-13
Time:	11:16:12	Log-Likelihood:	-201.44
No. Observations:	76	AIC:	406.9
Df Residuals:	74	BIC:	411.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	5.5457	0.270	20.526	0.000	5.007	6.084
x1	6.3954	0.403	15.874	0.000	5.593	7.198
x2	-0.8498	0.450	-1.888	0.063	-1.747	0.047

Omnibus:	1.200	Durbin-Watson:	1.830
Prob(Omnibus):	0.549	Jarque-Bera (JB):	1.206
Skew:	-0.198	Prob(JB):	0.547
Kurtosis:	2.527	Cond. No.	9.34e+15

Dep. Variable:	ptsd	R-squared:	0.519
Model:	OLS	Adj. R-squared:	0.513
Method:	Least Squares	F-statistic:	79.90
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	2.17e-13
Time:	11:16:12	Log-Likelihood:	-201.44
No. Observations:	76	AIC:	406.9
Df Residuals:	74	BIC:	411.5
Df Model:	1
Covariance Type:	nonrobust

	cpa			ptsd
	min	median	max	min	median	max
csa
Abused	-1.1	2.6	8.6	6.0	11.3	19.0
NotAbused	-3.1	1.3	5.0	-3.3	5.8	10.9

Omnibus:	1.273	Durbin-Watson:	1.837
Prob(Omnibus):	0.529	Jarque-Bera (JB):	1.083
Skew:	-0.069	Prob(JB):	0.582
Kurtosis:	2.432	Cond. No.	12.0

Omnibus:	0.879	Durbin-Watson:	1.860
Prob(Omnibus):	0.644	Jarque-Bera (JB):	0.878
Skew:	-0.070	Prob(JB):	0.645
Kurtosis:	2.492	Cond. No.	9.14

	0	1
Intercept	8.815712	11.680361
csa[T.NotAbused]	-7.910809	-4.634696
cpa	0.208584	0.892520

Omnibus:	0.484	Durbin-Watson:	1.181
Prob(Omnibus):	0.785	Jarque-Bera (JB):	0.176
Skew:	0.103	Prob(JB):	0.916
Kurtosis:	3.114	Cond. No.	4.93

	Insul	Temp	Gas
1	Before	-0.8	7.2
2	Before	-0.7	6.9
3	Before	0.4	6.4
4	Before	2.5	6.0
5	Before	2.9	5.8

Dep. Variable:	Gas	R-squared:	0.928
Model:	OLS	Adj. R-squared:	0.924
Method:	Least Squares	F-statistic:	222.3
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	1.23e-29
Time:	11:16:13	Log-Likelihood:	-14.100
No. Observations:	56	AIC:	36.20
Df Residuals:	52	BIC:	44.30
Df Model:	3
Covariance Type:	nonrobust

Omnibus:	6.016	Durbin-Watson:	1.854
Prob(Omnibus):	0.049	Jarque-Bera (JB):	4.998
Skew:	-0.626	Prob(JB):	0.0822
Kurtosis:	3.757	Cond. No.	30.9

Dep. Variable:	longevity	R-squared:	0.653
Model:	OLS	Adj. R-squared:	0.626
Method:	Least Squares	F-statistic:	23.88
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	1.89e-22
Time:	11:16:14	Log-Likelihood:	-464.79
No. Observations:	124	AIC:	949.6
Df Residuals:	114	BIC:	977.8
Df Model:	9
Covariance Type:	nonrobust

Omnibus:	2.091	Durbin-Watson:	1.957
Prob(Omnibus):	0.352	Jarque-Bera (JB):	1.710
Skew:	0.281	Prob(JB):	0.425
Kurtosis:	3.126	Cond. No.	127.

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	11.9411	0.518	23.067	0.000	10.910	12.973
csa	-7.2452	0.811	-8.939	0.000	-8.860	-5.630

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	11.9411	0.518	23.067	0.000	10.910	12.973
x2	4.6959	0.624	7.529	0.000	3.453	5.939

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.9411	0.518	23.067	0.000	10.910	12.973
csa[T.NotAbused]	-7.2452	0.811	-8.939	0.000	-8.860	-5.630

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.6959	0.624	7.529	0.000	3.453	5.939
Abused	7.2452	0.811	8.939	0.000	5.630	8.860

	Intercept	activity[T.isolated]	activity[T.low]	activity[T.many]	activity[T.one]	thorax	thorax:activity[T.isolated]	thorax:activity[T.low]	thorax:activity[T.many]	thorax:activity[T.one]
1	1.0	0.0	0.0	1.0	0.0	0.68	0.0	0.00	0.68	0.00
25	1.0	1.0	0.0	0.0	0.0	0.70	0.7	0.00	0.00	0.00
49	1.0	0.0	0.0	0.0	1.0	0.64	0.0	0.00	0.00	0.64
75	1.0	0.0	1.0	0.0	0.0	0.68	0.0	0.68	0.00	0.00
99	1.0	0.0	0.0	0.0	0.0	0.64	0.0	0.00	0.00	0.00

	df	sum_sq	mean_sq	F	PR(>F)
activity	4.0	12269.467151	3067.366788	26.727833	1.200318e-15
thorax	1.0	12368.420833	12368.420833	107.773577	3.565139e-18
thorax:activity	4.0	24.313592	6.078398	0.052965	9.946914e-01
Residual	114.0	13082.983907	114.763017	NaN	NaN

	df_resid	ssr	df_diff	ss_diff	F	Pr(>F)
0	118.0	13107.297500	0.0	NaN	NaN	NaN
1	114.0	13082.983907	4.0	24.313592	0.052965	0.994691

Dep. Variable:	np.log(longevity)	R-squared:	0.702
Model:	OLS	Adj. R-squared:	0.690
Method:	Least Squares	F-statistic:	55.72
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	1.81e-29
Time:	11:16:14	Log-Likelihood:	31.033
No. Observations:	124	AIC:	-50.07
Df Residuals:	118	BIC:	-33.15
Df Model:	5
Covariance Type:	nonrobust

Omnibus:	2.504	Durbin-Watson:	1.883
Prob(Omnibus):	0.286	Jarque-Bera (JB):	2.448
Skew:	-0.283	Prob(JB):	0.294
Kurtosis:	2.609	Cond. No.	23.5

Omnibus:	1.296	Durbin-Watson:	1.096
Prob(Omnibus):	0.523	Jarque-Bera (JB):	1.141
Skew:	0.034	Prob(JB):	0.565
Kurtosis:	2.535	Cond. No.	5.81

Omnibus:	12.042	Durbin-Watson:	1.036
Prob(Omnibus):	0.002	Jarque-Bera (JB):	12.531
Skew:	-0.719	Prob(JB):	0.00190
Kurtosis:	3.600	Cond. No.	5.81

Software	Version
Python	3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython	6.5.0
OS	Darwin 17.7.0 x86_64 i386 64bit
pandas	0.23.4
numpy	1.15.1
matplotlib	2.2.3
seaborn	0.9.0
scipy	1.1.0
patsy	0.5.0
statsmodels	0.9.0
Wed Sep 26 11:16:15 2018 BST