Load python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns
Load in the Pima Indians data
pima = pd.read_csv("data/pima.csv")
pima.head()
Construct numerical summary. In contrast to R, only numerical variables are summarized (but that's all of them here). Round improves the presentation but is not well controlled
pima.describe().round(1)
Sort diastolic and show the first few values
pima['diastolic'].sort_values().head()
Replace all the instances of zero in five variables with the missing value code. Python code neater than R here.
pima.replace({'diastolic' : 0, 'triceps' : 0, 'insulin' : 0, 'glucose' : 0, 'bmi' : 0}, np.nan, inplace=True)
pima.head()
Turn test into a categorical variable. Describe is not like R table.
pima['test'] = pima['test'].astype('category')
pima['test'] = pima['test'].cat.rename_categories(['Negative','Positive'])
pima['test'].describe()
This produces output like R table.
pima['test'].value_counts()
Produce histogram using seaborn
. Necessary to drop the missing values. The kernel density estimate is also plotted. (Warning message can be ignored)
sns.distplot(pima.diastolic.dropna())
Sort data, remove missing values and make a line plot using an index created over the range of the observations.
pimad = pima.diastolic.dropna().sort_values()
sns.lineplot(range(0, len(pimad)), pimad)
Scatter plot. Need to use smaller plotting character size with s=20.
sns.scatterplot(x='diastolic',y='diabetes',data=pima, s=20)
Produces the boxplot using the seaborn
package.
sns.boxplot(x="test", y="diabetes", data=pima)
We use some alpha transparency in the following plot to distinguish the different levels of test on the scatter plot of the two quantitative variables:
sns.scatterplot(x="diastolic", y="diabetes", data=pima, hue="test", alpha=0.3)
Can do faceting.
sns.relplot(x="diastolic", y="diabetes", data=pima, col="test")
Read in manilius data:
manilius = pd.read_csv("data/manilius.csv")
manilius.head()
Do ancient calculation
moon3 = manilius.groupby('group')[['arc','sinang', 'cosang']].sum()
moon3
Add an intercept column
moon3['intercept'] = [9]*3
moon3
Solve the linear equation:
np.linalg.solve(moon3[['intercept','sinang','cosang']],moon3['arc'])
Can do linear models with R style formulae.
import statsmodels.formula.api as smf
mod = smf.ols(formula='arc ~ sinang + cosang', data=manilius)
res = mod.fit()
res.params
Read in family height data:
families = pd.read_csv("data/families.csv")
families.head()
Scatter plot - need to reduce point size with s=20
sns.scatterplot(x='midparentHeight', y='childHeight',data=families,s=20)
Calculate lm fit:
mod = smf.ols(formula='childHeight ~ midparentHeight', data=families)
res = mod.fit()
res.params
Add the regression line to the plot. Seaborn's lmplot function does this automatically. We don't want the confidence intervals.
sns.lmplot('midparentHeight', 'childHeight', families, ci=None, scatter_kws={'s':2})
Calculate using the direct formulae.
cor = sp.stats.pearsonr(families['childHeight'],families['midparentHeight'])[0]
sdy = np.std(families['childHeight'])
sdx = np.std(families['midparentHeight'])
beta = cor*sdy/sdx
alpha = np.mean(families['childHeight']) - beta*np.mean(families['midparentHeight'])
np.round([alpha,beta],2)
Calculate the appropriate y=x line to show the regression effect. Python appears to lack a convenient abline() function as in R so we have to manually calculate the endpoints of the added line.
beta1 = sdy/sdx
alpha1 = np.mean(families['childHeight']) - beta1*np.mean(families['midparentHeight'])
sns.lmplot('midparentHeight', 'childHeight', families, ci=None, scatter_kws={'s':2})
xl,xu = [64, 76]
plt.plot([xl,xu], [alpha1+xl*beta1,alpha1+xu*beta1],c='Red')
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels