Chapter 16: Models with several factors¶

Load the packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from scipy import stats

Two factors with no replication¶

Load in the data

composite = pd.read_csv("data/composite.csv", index_col=0)
composite

Create the interaction plots:

sns.catplot(x="laser", y="strength", hue="tape", data=composite, kind="point")

<seaborn.axisgrid.FacetGrid at 0x1c218f7a58>

sns.catplot(x="tape", y="strength", hue="laser", data=composite, kind="point")

<seaborn.axisgrid.FacetGrid at 0x1c21c0f6d8>

Fit a saturated model. Don't show the error messages (get complaints about small datasets)

%%capture --no-display
lmod = smf.ols("strength ~ tape * laser", composite).fit()
lmod.summary()

Fit main effects only model:

lmod = smf.ols("strength ~ tape + laser", composite).fit()
lmod.params

Intercept         22.437778
tape[T.medium]     5.516667
tape[T.slow]       1.480000
laser[T.50W]       6.573333
laser[T.60W]      12.213333
dtype: float64

lasercoefs = np.tile([0, 6.57, 12.21],3)
lasercoefs

array([ 0.  ,  6.57, 12.21,  0.  ,  6.57, 12.21,  0.  ,  6.57, 12.21])

tapecoefs = np.repeat([1.48, 5.52, 0], 3)
tapecoefs

array([1.48, 1.48, 1.48, 5.52, 5.52, 5.52, 0.  , 0.  , 0.  ])

%%capture --no-display
composite['crossp'] = tapecoefs * lasercoefs
tmod = smf.ols("strength ~ tape + laser + crossp", composite).fit()
tmod.summary()

sm.stats.anova_lm(tmod)

Fix the order of the categories

cat_type = pd.api.types.CategoricalDtype(categories=['slow','medium','fast'],ordered=True)
composite['tape'] = composite.tape.astype(cat_type)

%%capture --no-display
from patsy.contrasts import Poly
lmod = smf.ols("strength ~ C(tape,Poly) + C(laser,Poly)", composite).fit()
lmod.summary()

Check out the design matrix

import patsy
dm = patsy.dmatrix('~ C(tape,Poly) + C(laser,Poly)', composite)
np.asarray(dm).round(2)

array([[ 1.  , -0.71,  0.41, -0.71,  0.41],
       [ 1.  , -0.71,  0.41, -0.  , -0.82],
       [ 1.  , -0.71,  0.41,  0.71,  0.41],
       [ 1.  , -0.  , -0.82, -0.71,  0.41],
       [ 1.  , -0.  , -0.82, -0.  , -0.82],
       [ 1.  , -0.  , -0.82,  0.71,  0.41],
       [ 1.  ,  0.71,  0.41, -0.71,  0.41],
       [ 1.  ,  0.71,  0.41, -0.  , -0.82],
       [ 1.  ,  0.71,  0.41,  0.71,  0.41]])

composite['Nlaser'] = np.tile([40,50,60],3)
composite['Ntape'] = np.repeat([6.42,13,27], 3)

%%capture --no-display
lmodn = smf.ols("strength ~ np.log(Ntape) + I(np.log(Ntape)**2) + Nlaser", composite).fit()
lmodn.summary()

Two factors with replication¶

Read in the data while making sure the predictors are treated as categorical variables:

pvc = pd.read_csv("data/pvc.csv", index_col=0, dtype={'psize':'float','operator':'category','resin':'category'})
pvc.head()

Compute the group means

pvc.groupby('operator')[['psize']].mean()

sns.catplot(x='resin',y='psize',hue='operator', data=pvc, kind="point",ci=None)

<seaborn.axisgrid.FacetGrid at 0x1c22535fd0>

ax = sns.catplot(x='operator',y='psize',hue='resin', data=pvc, ci=None,scale=0.5,kind="point")
ax = sns.swarmplot(x='operator',y='psize',hue='resin', data=pvc)
ax.legend_.remove()

Create the ANOVA table

lmod = smf.ols('psize ~ operator*resin', pvc).fit()
sm.stats.anova_lm(lmod).round(3)

Now remove the interaction term:

lmod = smf.ols('psize ~ operator+resin', pvc).fit()
sm.stats.anova_lm(lmod).round(3)

Make the diagnostic plots:

fig = sm.qqplot(lmod.resid, line="q")

/anaconda/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

sns.residplot(lmod.fittedvalues, lmod.resid)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

sns.swarmplot(pvc['operator'], lmod.resid)
plt.axhline(0)
plt.xlabel("Operator")
plt.ylabel("Residuals")
plt.show()

Test for equality of variance:

lmod = smf.ols('psize ~ operator*resin', pvc).fit()
ii = np.arange(0,48,2)
pvce = pvc.iloc[ii,:].copy()
pvce['res'] = np.sqrt(abs(lmod.resid.loc[ii+1]))
vmod = smf.ols('res ~ operator+resin', pvce).fit()
sm.stats.anova_lm(vmod,typ=3).round(3)

lmod = smf.ols('psize ~ operator+resin', pvc).fit()
lmod.summary()

Get the Tukey critical value:

from statsmodels.sandbox.stats.multicomp import get_tukeyQcrit
get_tukeyQcrit(3,38)

array(3.45)

Get the first pairwise difference:

lmod.params[1] + np.array([-1,1]) * get_tukeyQcrit(3,38) * lmod.bse[1] / np.sqrt(2)

array([-1.25002748,  0.72502748])

Two factors with an interaction¶

Read in the data:

warpbreaks = pd.read_csv("data/warpbreaks.csv", index_col=0, dtype={'breaks':'int','wool':'category','tension':'category'})
warpbreaks.head()

sns.boxplot(x="wool", y="breaks", data=warpbreaks)
plt.show()

sns.catplot(x='wool',y='breaks',hue='tension', data=warpbreaks, ci=None,kind="point")

<seaborn.axisgrid.FacetGrid at 0x1c22eaa160>

warpbreaks['nwool'] = warpbreaks['wool'].cat.codes

ax = sns.catplot(x='wool',y='breaks',hue='tension', data=warpbreaks, ci=None,kind="point")
ax = sns.swarmplot(x='wool',y='breaks',hue='tension', data=warpbreaks)
ax.legend_.remove()

ax = sns.catplot(x='tension',y='breaks',hue='wool', data=warpbreaks, ci=None, kind="point")
ax = sns.swarmplot(x='tension',y='breaks',hue='wool', data=warpbreaks)
ax.legend_.remove()

lmod = smf.ols('breaks ~ wool * tension', warpbreaks).fit()
sns.regplot(lmod.fittedvalues, lmod.resid,scatter_kws={'alpha':0.3}, ci=None)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

Text(0,0.5,'Residuals')

lmod = smf.ols('np.sqrt(breaks) ~ wool * tension', warpbreaks).fit()
sns.regplot(lmod.fittedvalues, lmod.resid,scatter_kws={'alpha':0.3}, ci=None)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

Text(0,0.5,'Residuals')

sm.stats.anova_lm(lmod)

lmod.summary()

lmod = smf.ols('np.sqrt(breaks) ~ wool:tension-1', warpbreaks).fit()
lmod.summary()

Width of Tukey HSD bands is ( qt/sqrt(2) se sqrt(2))

get_tukeyQcrit(6,48) * lmod.bse[0]

1.3725048838137048

Compute all the pairwise differences:

import itertools
dp = set(itertools.combinations(range(0,6),2))
dcoef = []
namdiff = []
for cp in dp:
    dcoef.append(lmod.params[cp[0]] - lmod.params[cp[1]])
    namdiff.append(lmod.params.index[cp[0]] + '-' + lmod.params.index[cp[1]])
thsd = pd.DataFrame({'Difference':dcoef},index=namdiff)
thsd["lb"] = thsd.Difference - get_tukeyQcrit(6,48) * lmod.bse[0]
thsd["ub"] = thsd.Difference + get_tukeyQcrit(6,48) * lmod.bse[0]
thsd

lmod.params.index

Index(['wool[A]:tension[H]', 'wool[B]:tension[H]', 'wool[A]:tension[L]',
       'wool[B]:tension[L]', 'wool[A]:tension[M]', 'wool[B]:tension[M]'],
      dtype='object')

Larger factorial experiments¶

Read in the data

speedo = pd.read_csv("data/speedo.csv", index_col=0)
speedo

%%capture --no-display
lmod = smf.ols('y ~ h+d+l+b+j+f+n+a+i+e+m+c+k+g+o', speedo).fit()
lmod.summary()

Look at the design matrix:

dm = patsy.dmatrix('~ h+d+l+b+j+f+n+a+i+e+m+c+k+g+o', speedo)
np.asarray(dm)

array([[1., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0.],
       [1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0., 1., 1.],
       [1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
       [1., 1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1.],
       [1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 0.],
       [1., 1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1.],
       [1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0.],
       [1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0.],
       [1., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 1.],
       [1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Make a QQ plot of the coefficients:

ii = np.argsort(lmod.params[1:])
scoef = np.array(lmod.params[1:])[ii]
lcoef = (speedo.columns[:-1])[ii]
n = len(scoef)
qq = stats.norm.ppf(np.arange(1,n+1)/(n+1))

fig, ax = plt.subplots()
ax.scatter(qq, scoef,s=1)

for i in range(len(qq)):
    ax.annotate(lcoef[i], (qq[i],scoef[i]))
    
plt.xlabel("Normal Quantiles")
plt.ylabel("Sorted coefficients")
plt.show()

n = len(scoef)
qq = stats.norm.ppf((n + np.arange(1,n+1))/(2*n+1))
acoef = np.abs(lmod.params[1:])
ii = np.argsort(acoef)
acoef = acoef[ii]
lcoef = (speedo.columns[:-1])[ii]
                    
fig, ax = plt.subplots()
ax.scatter(qq, acoef,s=1)

for i in range(len(qq)):
    ax.annotate(lcoef[i], (qq[i],acoef[i]))
    
plt.xlabel("Normal Half Quantiles")
plt.ylabel("Sorted absolute coefficients")
plt.show()

%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels

Dep. Variable:	strength	R-squared:	1.000
Model:	OLS	Adj. R-squared:	nan
Method:	Least Squares	F-statistic:	0.000
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	nan
Time:	11:16:23	Log-Likelihood:	274.42
No. Observations:	9	AIC:	-530.8
Df Residuals:	0	BIC:	-529.1
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.6500	inf	0	nan	nan	nan
tape[T.medium]	7.3500	inf	0	nan	nan	nan
tape[T.slow]	5.0100	inf	0	nan	nan	nan
laser[T.50W]	9.1400	inf	0	nan	nan	nan
laser[T.60W]	15.0100	inf	0	nan	nan	nan
tape[T.medium]:laser[T.50W]	-2.0500	inf	-0	nan	nan	nan
tape[T.slow]:laser[T.50W]	-5.6500	inf	-0	nan	nan	nan
tape[T.medium]:laser[T.60W]	-3.4500	inf	-0	nan	nan	nan
tape[T.slow]:laser[T.60W]	-4.9400	inf	-0	nan	nan	nan

Omnibus:	4.500	Durbin-Watson:	0.628
Prob(Omnibus):	0.105	Jarque-Bera (JB):	2.010
Skew:	1.157	Prob(JB):	0.366
Kurtosis:	2.925	Cond. No.	13.9

Dep. Variable:	strength	R-squared:	0.968
Model:	OLS	Adj. R-squared:	0.914
Method:	Least Squares	F-statistic:	18.03
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.0191
Time:	11:16:24	Log-Likelihood:	-12.837
No. Observations:	9	AIC:	37.67
Df Residuals:	3	BIC:	38.86
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	21.9483	1.491	14.717	0.001	17.202	26.694
tape[T.medium]	6.6746	2.239	2.982	0.059	-0.449	13.799
tape[T.slow]	1.7905	1.498	1.195	0.318	-2.977	6.558
laser[T.50W]	7.0870	1.618	4.381	0.022	1.939	12.235
laser[T.60W]	13.1680	2.014	6.538	0.007	6.759	19.578
crossp	-0.0335	0.050	-0.671	0.550	-0.193	0.126

	strength	laser	tape
1	25.66	40W	slow
2	29.15	50W	slow
3	35.73	60W	slow
4	28.00	40W	medium
5	35.09	50W	medium
6	39.56	60W	medium
7	20.65	40W	fast
8	29.79	50W	fast
9	35.66	60W	fast

Omnibus:	0.288	Durbin-Watson:	2.095
Prob(Omnibus):	0.866	Jarque-Bera (JB):	0.368
Skew:	0.306	Prob(JB):	0.832
Kurtosis:	2.220	Cond. No.	146.

	df	sum_sq	mean_sq	F	PR(>F)
tape	2.0	48.918689	24.459344	8.033725	0.062408
laser	2.0	224.183822	112.091911	36.816831	0.007746
crossp	1.0	1.369294	1.369294	0.449748	0.550466
Residual	3.0	9.133750	3.044583	NaN	NaN

Omnibus:	0.284	Durbin-Watson:	1.929
Prob(Omnibus):	0.868	Jarque-Bera (JB):	0.411
Skew:	-0.128	Prob(JB):	0.814
Kurtosis:	1.985	Cond. No.	1.73

Omnibus:	0.659	Durbin-Watson:	1.972
Prob(Omnibus):	0.719	Jarque-Bera (JB):	0.553
Skew:	-0.205	Prob(JB):	0.758
Kurtosis:	1.857	Cond. No.	1.76e+03

	df	sum_sq	mean_sq	F	PR(>F)
operator	2.0	20.718	10.359	7.007	0.004
resin	7.0	283.946	40.564	27.439	0.000
operator:resin	14.0	14.335	1.024	0.693	0.760
Residual	24.0	35.480	1.478	NaN	NaN

	sum_sq	df	F	PR(>F)
Intercept	0.263	1.0	5.326	0.037
operator	1.490	2.0	15.117	0.000
resin	0.638	7.0	1.848	0.155
Residual	0.690	14.0	NaN	NaN

Dep. Variable:	psize	R-squared:	0.859
Model:	OLS	Adj. R-squared:	0.826
Method:	Least Squares	F-statistic:	25.82
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	1.47e-13
Time:	11:16:24	Log-Likelihood:	-69.000
No. Observations:	48	AIC:	158.0
Df Residuals:	38	BIC:	176.7
Df Model:	9
Covariance Type:	nonrobust

Omnibus:	13.065	Durbin-Watson:	2.517
Prob(Omnibus):	0.001	Jarque-Bera (JB):	17.339
Skew:	0.885	Prob(JB):	0.000172
Kurtosis:	5.353	Cond. No.	9.82

	df	sum_sq	mean_sq	F	PR(>F)
wool	1.0	2.901924	2.901924	3.022232	0.088542
tension	2.0	15.891615	7.945807	8.275225	0.000817
wool:tension	2.0	7.201396	3.600698	3.749976	0.030674
Residual	48.0	46.089229	0.960192	NaN	NaN

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	31.0322	0.540	57.452	0.000	29.533	32.532
C(tape, Poly).Linear	-1.0465	0.936	-1.119	0.326	-3.644	1.551
C(tape, Poly).Quadratic	-3.9001	0.936	-4.169	0.014	-6.498	-1.303
C(laser, Poly).Linear	8.6361	0.936	9.231	0.001	6.039	11.234
C(laser, Poly).Quadratic	-0.3810	0.936	-0.407	0.705	-2.979	2.216

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-55.0499	13.338	-4.127	0.009	-89.336	-20.763
np.log(Ntape)	46.5929	10.499	4.438	0.007	19.605	73.581
I(np.log(Ntape) ** 2)	-9.2378	2.028	-4.554	0.006	-14.452	-4.024
Nlaser	0.6107	0.060	10.113	0.000	0.455	0.766

	psize	operator	resin
1	36.2	1	1
2	36.3	1	1
3	35.3	1	2
4	35.0	1	2
5	30.8	1	3

	psize
operator
1	32.94375
2	32.68125
3	31.43750

Dep. Variable:	np.sqrt(breaks)	R-squared:	0.361
Model:	OLS	Adj. R-squared:	0.294
Method:	Least Squares	F-statistic:	5.415
Date:	Wed, 26 Sep 2018	Prob (F-statistic):	0.000500
Time:	11:16:25	Log-Likelihood:	-72.346
No. Observations:	54	AIC:	156.7
Df Residuals:	48	BIC:	168.6
Df Model:	5
Covariance Type:	nonrobust

Omnibus:	4.119	Durbin-Watson:	2.147
Prob(Omnibus):	0.127	Jarque-Bera (JB):	1.882
Skew:	0.034	Prob(JB):	0.390
Kurtosis:	2.088	Cond. No.	9.77

	Difference	lb	ub
wool[A]:tension[H]-wool[B]:tension[H]	0.554178	-0.818327	1.926683
wool[B]:tension[H]-wool[A]:tension[L]	-2.245378	-3.617883	-0.872873
wool[B]:tension[H]-wool[B]:tension[L]	-0.935944	-2.308449	0.436561
wool[A]:tension[M]-wool[B]:tension[M]	-0.472707	-1.845212	0.899798
wool[B]:tension[H]-wool[A]:tension[M]	-0.523746	-1.896251	0.848759
wool[B]:tension[H]-wool[B]:tension[M]	-0.996453	-2.368958	0.376051
wool[A]:tension[L]-wool[A]:tension[M]	1.721631	0.349127	3.094136
wool[A]:tension[H]-wool[B]:tension[M]	-0.442276	-1.814781	0.930229
wool[A]:tension[L]-wool[B]:tension[L]	1.309434	-0.063071	2.681939
wool[A]:tension[L]-wool[B]:tension[M]	1.248924	-0.123580	2.621429
wool[A]:tension[H]-wool[A]:tension[M]	0.030431	-1.342073	1.402936
wool[A]:tension[H]-wool[B]:tension[L]	-0.381766	-1.754271	0.990739
wool[B]:tension[L]-wool[A]:tension[M]	0.412198	-0.960307	1.784702
wool[A]:tension[H]-wool[A]:tension[L]	-1.691200	-3.063705	-0.318695
wool[B]:tension[L]-wool[B]:tension[M]	-0.060510	-1.433014	1.311995

	h	d	l	b	j	f	n	a	i	e	m	c	k	g	o	y
1	-	-	+	-	+	+	-	-	+	+	-	+	-	-	+	0.4850
2	+	-	-	-	-	+	+	-	-	+	+	+	+	-	-	0.5750
3	-	+	-	-	+	-	+	-	+	-	+	+	-	+	-	0.0875
4	+	+	+	-	-	-	-	-	-	-	-	+	+	+	+	0.1750
5	-	-	+	+	-	-	+	-	+	+	-	-	+	+	-	0.1950
6	+	-	-	+	+	-	-	-	-	+	+	-	-	+	+	0.1450
7	-	+	-	+	-	+	-	-	+	-	+	-	+	-	+	0.2250
8	+	+	+	+	+	+	+	-	-	-	-	-	-	-	-	0.1750
9	-	-	+	-	+	+	-	+	-	-	+	-	+	+	-	0.1250
10	+	-	-	-	-	+	+	+	+	-	-	-	-	+	+	0.1200
11	-	+	-	-	+	-	+	+	-	+	-	-	+	-	+	0.4550
12	+	+	+	-	-	-	-	+	+	+	+	-	-	-	-	0.5350
13	-	-	+	+	-	-	+	+	-	-	+	+	-	-	+	0.1700
14	+	-	-	+	+	-	-	+	+	-	-	+	+	-	-	0.2750
15	-	+	-	+	-	+	-	+	-	+	-	+	-	+	-	0.3425
16	+	+	+	+	+	+	+	+	+	+	+	+	+	+	+	0.5825

Omnibus:	17.995	Durbin-Watson:	0.602
Prob(Omnibus):	0.000	Jarque-Bera (JB):	18.033
Skew:	-1.744	Prob(JB):	0.000121
Kurtosis:	6.859	Cond. No.	9.90

Software	Version
Python	3.7.0 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
IPython	6.5.0
OS	Darwin 17.7.0 x86_64 i386 64bit
pandas	0.23.4
numpy	1.15.1
matplotlib	2.2.3
seaborn	0.9.0
scipy	1.1.0
patsy	0.5.0
statsmodels	0.9.0
Wed Sep 26 11:16:26 2018 BST