Load the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from scipy import stats
coagulation = pd.read_csv("data/coagulation.csv", index_col=0)
coagulation.head()
pandas
version of boxplot
coagulation.boxplot('coag',by='diet')
plt.suptitle("")
plt.show()
seaborn
version of boxplot
sns.boxplot(x="diet", y="coag", data=coagulation)
plt.show()
Maybe seaborn plot is preferable (although without the colors). Pandas version is annoying over the title and failure to label the y-axis.
swarmplot
is good when there are replicated points:
sns.swarmplot(x="diet", y="coag", data=coagulation)
plt.show()
Fit the model:
lmod = smf.ols("coag ~ diet", coagulation).fit()
lmod.summary()
Examine the coefficients:
lmod.params.round(1)
Look at the design matrix:
import patsy
patsy.dmatrix('~ diet', coagulation)
Test the significance of diet.
sm.stats.anova_lm(lmod)
Fit without an intercept term:
lmodi = smf.ols("coag ~ diet-1", coagulation).fit()
lmodi.summary()
Do the ANOVA explicitly:
%%capture --no-display
lmodnull = smf.ols("coag ~ 1", coagulation).fit()
sm.stats.anova_lm(lmodnull, lmod)
Do it with sum contrasts:
from patsy.contrasts import Sum
lmods = smf.ols("coag ~ C(diet,Sum)", coagulation).fit()
lmods.summary()
sns.residplot(lmod.fittedvalues, lmod.resid)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()
fig = sm.qqplot(lmod.resid, line="q")
plt.show()
Do the Levene test:
coagulation['meds'] = coagulation.groupby('diet').transform(np.median)
coagulation['mads'] = abs(coagulation.coag - coagulation.meds)
coagulation.head()
lmodb = smf.ols('mads ~ diet', coagulation).fit()
sm.stats.anova_lm(lmodb)
coagulation.coag[coagulation.diet == "A"]
Do the Bartlett test:
stats.bartlett(coagulation.coag[coagulation.diet == "A"],
coagulation.coag[coagulation.diet == "B"],
coagulation.coag[coagulation.diet == "C"],
coagulation.coag[coagulation.diet == "D"])
Assemble the pieces necessary for pairwise comparisons:
lmod.bse
lmod.params
stats.t.ppf(0.975,20)
lmod.params[1] + np.array([-1, 1]) * stats.t.ppf(0.975,20) * lmod.bse[1]
Can do the Tukey pairwise:
thsd = sm.stats.multicomp.pairwise_tukeyhsd(coagulation.coag, coagulation.diet)
thsd.summary()
fig = thsd.plot_simultaneous()
plt.show()
Read in the data:
jsp = pd.read_csv("data/jsp.csv", index_col=0)
jsp.head()
jsp['mathcent'] = jsp.math - np.mean(jsp.math)
Try two different plotting methods
jsp.boxplot('mathcent',by='school')
plt.suptitle("")
plt.xticks(fontsize=8, rotation=90)
plt.show()
sns.boxplot(x="school", y="mathcent", data=jsp)
plt.xticks(fontsize=8, rotation=90)
plt.show()
lmod = smf.ols("mathcent ~ C(school) - 1", jsp).fit()
lmod.summary()
sm.stats.anova_lm(smf.ols("mathcent ~ C(school)", jsp).fit())
lmod.pvalues
from statsmodels.sandbox.stats.multicomp import multipletests
reject, padj, _, _ = multipletests(lmod.pvalues, method="bonferroni")
These schools are rejected by Bonferroni.
lmod.params[reject]
selsch = np.argsort(lmod.pvalues)[np.sort(lmod.pvalues) < np.arange(1,50)*0.05/49]
lmod.params.index[selsch]
Schools identified using false discovery rate.
reject, padj, _, _ = multipletests(lmod.pvalues, method="fdr_bh")
lmod.params[reject]
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels