Load the packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
Load the data:
globwarm = pd.read_csv("data/globwarm.csv")
globwarm.head()
Fit the model:
lmod = smf.ols(formula='nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman',
data=globwarm).fit()
lmod.summary()
Plot successive residuals:
plt.scatter(lmod.resid.iloc[:-1],lmod.resid.iloc[1:])
plt.axhline(0,alpha=0.5)
plt.axvline(0,alpha=0.5)
plt.show()
Compute the correlation:
np.corrcoef(lmod.resid.iloc[:-1],lmod.resid.iloc[1:])
Fit the uncorrelated model and then iterate the fit. Produces result very similar to the correlation of the residuals. Different from the R correlation of 0.71
globwarm = globwarm.dropna()
X = sm.add_constant(globwarm.iloc[:,1:9])
gmod = sm.GLSAR(globwarm.nhtemp, X, rho=1)
res=gmod.iterative_fit(maxiter=6)
gmod.rho
Alternative method that shows the iterative nature of the solution:
gmod = sm.GLSAR(globwarm.nhtemp, X, rho=1)
for i in range(6):
results = gmod.fit()
print("AR coefficients: {0}".format(gmod.rho))
rho, sigma = sm.regression.yule_walker(results.resid,order=gmod.order)
gmod = sm.GLSAR(globwarm.nhtemp, X, rho)
Read in the data for the blocked example. Need to make variety categorical. For reasons unclear to me, the mixed effect modeling function throws an error if the response is called yield. We create the same variable but with a different name: grams.
oatvar = pd.read_csv("data/oatvar.csv", index_col=0)
oatvar['variety'] = oatvar['variety'].astype('category')
oatvar['grams'] = oatvar['yield']
oatvar.head()
Can do mixed effect model but would need to calculate correlation from this.
mmod = smf.mixedlm("grams ~ variety",oatvar,groups=oatvar['block']).fit()
mmod.summary()
GEE model is also possible
ind = sm.cov_struct.Exchangeable()
gmod = smf.gee("grams ~ variety", "block", oatvar, cov_struct = ind ).fit()
gmod.summary()
ind.summary()
Read in the French Presidential Election data:
fpe = pd.read_csv("data/fpe.csv",index_col=0)
fpe.head()
Fit the model with weights:
wmod = smf.wls("A2 ~ A + B + C + D + E + F + G + H + J + K +N -1", data=fpe, weights = 1/fpe.EI ).fit()
wmod.params
Fit the model without weights:
lmod = smf.wls("A2 ~ A + B + C + D + E + F + G + H + J + K +N -1", data=fpe).fit()
lmod.params
Multiplying the weights by a constant makes no difference:
wmod = smf.wls("A2 ~ A + B + C + D + E + F + G + H + J + K +N -1", data=fpe, weights = 53/fpe.EI ).fit()
wmod.params
Fit a reduced model. Doesn't seem to be an offset option so need to modify the response:
y = fpe.A2 - fpe.A - fpe.G - fpe.K
X = fpe.loc[:,["C","D","E","F","N"]]
wmod = sm.WLS(y, X, weights = 53/fpe.EI ).fit()
wmod.params
Contrained least squares - need some trickery to get the weights right.
y = fpe.A2
X = fpe.loc[:,["A","B","C","D","E","F","G","H","J","K","N"]]
weights = 1/fpe.EI
Xw = (X.T * np.sqrt(weights)).T
yw = y * np.sqrt(weights)
res = sp.optimize.lsq_linear(Xw, yw, bounds=(0, 1))
pd.Series(np.round(res.x,3),index=lmod.params.index)
IRWLS example
cars = pd.read_csv("data/cars.csv")
cars.head()
Show that heteroscedascity is present
lmod = smf.ols(formula='dist ~ speed', data=cars).fit()
plt.scatter(lmod.fittedvalues,lmod.resid)
plt.axhline(0)
plt.show()
lmod.params
Iterate once using absolute residuals as response for linear relationship in the SD
gamma = np.mean(np.abs(lmod.resid) - cars.speed)
lmod = smf.wls(formula='dist ~ speed', data=cars, weights=np.sqrt(1/(gamma+cars.speed))).fit()
gamma, lmod.params
Plot the weights
weights=np.sqrt(1/(gamma+cars.speed))
plt.scatter(cars.speed, weights)
plt.show()
Read in the data:
corrosion = pd.read_csv("data/corrosion.csv")
corrosion.head()
Fit the model:
lmod = smf.ols(formula='loss ~ Fe', data=corrosion).fit()
lmod.summary()
Set up the plot:
fig, ax = plt.subplots()
ax.scatter(corrosion.Fe, corrosion.loss)
plt.xlabel("Fe")
plt.ylabel("loss")
xr = np.array(ax.get_xlim())
ax.plot(xr, lmod.params[0] + lmod.params[1] * xr)
Fit the ANOVA-style model to get the replicate fits:
corrosion['Fefac'] = corrosion['Fe'].astype('category')
amod = smf.ols(formula='loss ~ Fefac', data=corrosion).fit()
ax.scatter(corrosion.Fe, amod.fittedvalues, marker='x')
plt.show()
Last line contains the answer. Can ignore annoying warnings.
sm.stats.anova_lm(lmod, amod)
Polynomial fit
pc = np.polyfit(corrosion.Fe, corrosion.loss, 6)
pc
fig, ax = plt.subplots()
ax.scatter(corrosion.Fe, corrosion.loss)
plt.xlabel("Fe")
plt.ylabel("loss")
ax.scatter(corrosion.Fe, amod.fittedvalues, marker='x')
grid = np.linspace(0,2,50)
ax.plot(grid,np.poly1d(pc)(grid))
plt.show()
Manually calculate R-squared.
pc, rss, _, _, _ = np.polyfit(corrosion.Fe, corrosion.loss, 6, full=True)
tss = np.sum((corrosion.loss-np.mean(corrosion.loss))**2)
1-rss/tss
Load Galapagos data
gala = pd.read_csv("data/gala.csv",index_col=0)
gala.drop('Endemics', axis=1, inplace=True)
gala.head()
Least squares fit:
lsmod = smf.ols(formula='Species ~ Area + Elevation + Nearest + Scruz + Adjacent', data=gala).fit()
lsmod.summary()
Fit the robust regression using default which is Huber T:
exog = gala.drop('Species',axis=1)
exog = sm.add_constant(exog)
rlmod = sm.RLM(gala.Species,exog).fit()
rlmod.summary()
Check the small weights:
wts = rlmod.weights
wts[wts < 1]
L1 or LAD fit:
l1mod = sm.QuantReg(gala.Species, exog).fit()
l1mod.summary()
This is not the same as least trimmed squares but as close as we can come with sm.RLM
:
ltsmod = sm.RLM(gala.Species, exog, M=sm.robust.norms.TrimmedMean()).fit()
ltsmod.summary()
Results come with standard errors so no need to do bootstrapping.
Fit the Galapagos model without Isabela:
wts = gala.index != 'Isabela'
limod = smf.wls(formula='Species ~ Area + Elevation + Nearest + Scruz + Adjacent', data=gala, weights=wts).fit()
limod.summary()
Show example with multiple outliers:
star = pd.read_csv("data/star.csv")
star.head()
gs1 = smf.ols(formula='light ~ temp', data=star).fit()
gs2 = smf.RLM(star.light, sm.add_constant(star.temp), data=star).fit()
gs3 = smf.RLM(star.light, sm.add_constant(star.temp), data=star,M=sm.robust.norms.TrimmedMean(c=0.1) ).fit()
plt.scatter(star.temp, star.light)
plt.plot([3.4, 4.7], [gs1.params[0] + gs1.params[1]*3.4, gs1.params[0] + gs1.params[1]*4.7])
plt.plot([3.4, 4.7], [gs2.params[0] + gs2.params[1]*3.4, gs2.params[0] + gs2.params[1]*4.7])
plt.plot([3.4, 4.7], [gs3.params[0] + gs3.params[1]*3.4, gs3.params[0] + gs3.params[1]*4.7])
plt.show()
Does not work. This class of estimators cannot do the job for this problem.
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels