Load the packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%precision 7
Read in the data:
fat = pd.read_csv("data/fat.csv")
fat.head()
Some plots of the data:
plt.scatter(fat.knee, fat.neck)
plt.xlabel("Knee")
plt.ylabel("Neck")
plt.show()
plt.scatter(fat.thigh, fat.chest)
plt.xlabel("Thigh")
plt.ylabel("Chest")
plt.show()
plt.scatter(fat.wrist, fat.hip)
plt.xlabel("Wrist")
plt.ylabel("Hip")
plt.show()
Consider only the circumference measurements:
cfat = fat.iloc[:,8:]
cfat.head()
Get PCA from sklearn
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(cfat)
SDs of the principal components
np.sqrt(pca.explained_variance_)
Proportion of explained variance
pca.explained_variance_ratio_
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
Get the rotation matrix
rot = pca.components_
rot.shape
Rows are the loadings. Here is the first:
rot[0,:]
Redo with scaled data
from sklearn.preprocessing import scale
scalfat = pd.DataFrame(scale(cfat))
scalfat.head()
pcac = PCA()
pcac.fit(scalfat)
np.sqrt(pcac.explained_variance_)
pcac.explained_variance_ratio_
np.cumsum(np.round(pcac.explained_variance_ratio_, decimals=4)*100)
PC1 is mostly a weighted average
rotc = pcac.components_
rotc[0,:]
rotc[1,:]
Do robust mean and covariance estimation. Not the same as R but similar.
from sklearn.covariance import EllipticEnvelope
ee = EllipticEnvelope()
ee.fit(cfat)
ee.location_
ee.covariance_.shape
Not exactly the same result as R but close.
Compute and plot the Mahalanobis distances:
from scipy.spatial.distance import mahalanobis
Vi = np.linalg.inv(ee.covariance_)
md = cfat.apply(lambda x: (mahalanobis(x, ee.location_, Vi)), axis=1)
md.head()
import scipy as sp
n=len(md)
ix = np.arange(1,n+1)
halfq = sp.stats.norm.ppf((n+ix)/(2*n+1)),
plt.scatter(halfq, np.sort(md))
plt.show()
Do the principal components regression. First do the OLS regression on the scaled predictors:
import statsmodels.api as sm
import statsmodels.formula.api as smf
xmat = sm.add_constant(cfat)
lmod = sm.OLS(fat.brozek, xmat).fit()
lmod.summary()
Now use only the first 2 principal components:
pcscores = pca.fit_transform(scale(cfat))
pcscores[:6,:2]
xmat = sm.add_constant(pcscores[:,:2])
lmod = sm.OLS(fat.brozek, xmat).fit()
lmod.summary()
cfat.columns
Use a simplified model using only abdomen and ankle:
xmat = pd.concat([scalfat.iloc[:,2], scalfat.iloc[:,6] - scalfat.iloc[:,2]],axis=1)
xmat = sm.add_constant(xmat)
lmod = sm.OLS(fat.brozek, xmat).fit()
lmod.summary()
Read in the meat spectrometer data:
meatspec = pd.read_csv("data/meatspec.csv",index_col=0)
meatspec.head()
Regression on the training set produces a very good fit. Differs slightly from R.
trainmeat = meatspec.iloc[:172,]
testmeat = meatspec.iloc[173:,]
modlm = sm.OLS(trainmeat.fat, sm.add_constant(trainmeat.iloc[:,:-1])).fit()
modlm.rsquared
Set up an RMSE function
def rmse(x,y):
return np.sqrt(np.mean((x-y)**2))
RMSE on the training data:
rmse(modlm.fittedvalues, trainmeat.fat)
RMSE on the test data:
testpv = modlm.predict(sm.add_constant(testmeat.iloc[:,:-1]))
rmse(testpv, testmeat.fat)
Do some feature selection first. No need to scale since variables are commensurate. This is not the same as AIC-based selection with step
in R. This method takes the largest coefficients.
from sklearn.feature_selection import RFECV
from sklearn import linear_model
reg = linear_model.LinearRegression(fit_intercept=True)
X = trainmeat.drop('fat',axis=1)
reg.fit(X, trainmeat.fat)
selector = RFECV(reg, step=1, cv=10)
selector = selector.fit(X, trainmeat.fat)
selector.ranking_
Only a small number of predictors are selected:
Xmat = X.iloc[:,selector.support_]
Xmat.shape
modsteplm = sm.OLS(trainmeat.fat, sm.add_constant(Xmat)).fit()
rmse(modsteplm.fittedvalues, trainmeat.fat)
Result on the test set is much better than before.
Xmatest = testmeat.drop('fat',axis=1)
testpv = modsteplm.predict(sm.add_constant(Xmatest.iloc[:,selector.support_]))
rmse(testpv, testmeat.fat)
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
sd10 = np.sqrt(pca.explained_variance_).round(2)[:10]
sd10
pcscores = pca.fit_transform(X)
pcscores.shape
Need to invert PC2 because direction is different from R (not a problem - this just happens).
freq = np.arange(0,100)
rotc = pca.components_
plt.plot(freq, rotc[0,:])
plt.plot(freq, -rotc[1,:])
plt.plot(freq, rotc[2,:])
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
pcrmod = sm.OLS(trainmeat.fat, sm.add_constant(pcscores[:,:4])).fit()
rmse(pcrmod.fittedvalues, trainmeat.fat)
Plot the effect of predictors using a linear model
plt.plot(freq,modlm.params[1:])
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
pcrmod.params
Can duplicate
pceff = np.dot(pca.components_[:4,].T, pcrmod.params[1:])
plt.plot(freq, pceff)
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
Can do the same thing using the @ operator but the vector needs to be a numpy array.
pceff = pca.components_[:4,].T @ np.array(pcrmod.params[1:])
plt.plot(freq, pceff)
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
plt.plot(np.arange(1,11), sd10)
plt.xlabel("PC number")
plt.ylabel("SD of PC")
plt.show()
Demonstrate the method by which the design matrix in the PCR is constructed from the original X matrix and the loadings. Note that X needs to be centered.
m = X.mean(axis=0)
Xc = X - m
rotX = Xc @ pca.components_[:4,].T
rotX.iloc[0,:]
pcscores[0,:4]
Now do it for the test data.
Xtest = testmeat.drop('fat',axis=1)
Xtestc = Xtest - m
rotX = Xtestc @ pca.components_[:4,].T
testpv = pcrmod.params[0] + np.dot(rotX,pcrmod.params[1:])
rmse(testpv, testmeat.fat)
ncomp = np.arange(1,51)
rmsep = np.empty(50)
for icomp in ncomp:
Xtest = testmeat.drop('fat',axis=1)
Xtestc = Xtest - m
rotX = Xtestc @ pca.components_[:icomp,].T
pcrmod = sm.OLS(trainmeat.fat, sm.add_constant(pcscores[:,:icomp])).fit()
testpv = pcrmod.params[0] + np.dot(rotX,pcrmod.params[1:])
rmsep[icomp-1] = rmse(testpv, testmeat.fat)
plt.plot(ncomp, rmsep)
plt.ylabel("RMSEP")
plt.xlabel("Number of components")
plt.show()
np.argmin(rmsep)+1
Use k-fold crossvalidation. (Unclear how random this is).
from sklearn.model_selection import KFold
nsp = 10
kf = KFold(n_splits=nsp)
rmsep = np.empty(nsp)
y = np.asarray(trainmeat.fat)
ncomp = 50
irmsep = np.empty(ncomp)
for icomp in range(1,ncomp+1):
Xtrain = sm.add_constant(pcscores[:,:icomp])
for k, (train, test) in enumerate(kf.split(Xtrain, y)):
pmod = sm.OLS(y[train], Xtrain[train,:]).fit()
testpv = pmod.predict(Xtrain[test,:])
rmsep[k] = rmse(testpv, y[test])
irmsep[icomp-1] = rmsep.mean()
plt.plot(range(1,ncomp+1), irmsep)
plt.ylabel("RMSEP")
plt.xlabel("Number of components")
plt.show()
np.argmin(irmsep)+1
Chooses a different number of components than R and gets slightly worse performance
icomp = 20
pcrmod = sm.OLS(trainmeat.fat, sm.add_constant(pcscores[:,:icomp])).fit()
Xtrain = trainmeat.drop('fat',axis=1)
m = Xtrain.mean(axis=0)
Xtest = testmeat.drop('fat',axis=1) - m
rotX = Xtest @ pca.components_[:icomp,].T
testpv = pcrmod.params[0] + np.dot(rotX,pcrmod.params[1:])
rmse(testpv, testmeat.fat)
The sklearn
Python package and pls
R package are not the same so we cannot achieve identical results. This section needs additional work.
Do PLS with 4 components:
Ytrain = trainmeat.fat
ncomp = 4
from sklearn.cross_decomposition import PLSRegression
plsreg = PLSRegression(scale=False,n_components=ncomp)
plsmod = plsreg.fit(Xtrain, Ytrain)
Xtrainr, Ytrainr = plsreg.transform(Xtrain, Ytrain)
Figure out how Ytrainr relates to Ytrain:
plt.scatter(Ytrain, Ytrainr)
plt.show()
np.mean(Ytrain - Ytrainr)
np.mean(Ytrain)
Can see that Ytrainr is just a mean centered version of Ytrain.
But can check we only have 4 components:
Xtrain.shape
Xtrainr.shape
Compare the predicted values from PLS with the observed values:
pv = plsmod.predict(Xtrain)
plt.scatter(Ytrain, pv)
plt.show()
Not that good a fit:
rmse(Ytrain,np.squeeze(pv))
Now do the comparison for the test set:
Xtest = testmeat.drop('fat',axis=1)
Ytest = testmeat.fat
pvt = plsmod.predict(Xtest)
plt.scatter(Ytest, pvt)
plt.show()
rmse(np.asarray(Ytest),np.squeeze(pvt))
plsmod.x_loadings_.shape
Plot of the coefficients:
regmod = sm.OLS(trainmeat.fat, sm.add_constant(plsmod.x_scores_)).fit()
pceff = plsmod.x_loadings_ @ np.array(regmod.params[1:])
plt.plot(freq, pceff)
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
Compare these to the fitted values:
pvt = plsmod.predict(Xtrain)
fv = regmod.fittedvalues
plt.scatter(fv, pvt)
plt.show()
Do some CV to select the number of components:
from sklearn.model_selection import KFold
nsp = 10
kf = KFold(n_splits=nsp)
rmsep = np.empty(nsp)
y = np.asarray(trainmeat.fat)
Xtrain = trainmeat.drop('fat',axis=1)
ncomp = 50
irmsep = np.empty(ncomp)
Not clear if the right thing is being crossvalidated here:
for icomp in range(1,ncomp+1):
plsreg = PLSRegression(scale=False,n_components=icomp)
plsmod = plsreg.fit(Xtrain, y)
Xtraini = sm.add_constant(plsmod.x_scores_)
for k, (train, test) in enumerate(kf.split(Xtraini, y)):
pmod = sm.OLS(y[train], Xtraini[train,:]).fit()
testpv = pmod.predict(Xtraini[test,:])
rmsep[k] = rmse(testpv, y[test])
irmsep[icomp-1] = rmsep.mean()
plt.plot(range(1,ncomp+1), irmsep)
plt.ylabel("RMSEP")
plt.xlabel("Number of components")
plt.show()
Would pick a large number of components.
Can reproduce the ridge regression plot.
from sklearn import linear_model
n_alphas = 200
alphas = np.logspace(-10, -5, n_alphas)
coefs = []
for a in alphas:
ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
ridge.fit(Xtrain, y)
coefs.append(ridge.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Need to do more work to create predictions.
Section needs more work.
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
Use CV to compute R^2 on a grid of alphas.
X = trainmeat.drop('fat',axis=1)
y = np.asarray(trainmeat.fat)
lasso = Lasso(random_state=0, max_iter=10000, tol=0.01)
alphas = np.logspace(-3, -1, 30)
tuned_parameters = [{'alpha': alphas}]
n_folds = 3
clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
plt.semilogx(alphas, scores)
# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)
plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')
# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])
plt.show()
Demonstrates preference for smaller values of alpha. Choose alpha = 0.01 for the reason that this choice is within the 2SD band of the optimum. This will give preference to fewer non-zero coefficients in the Lasso.
lasmod = Lasso(alpha=0.01, max_iter=1e5)
lasmod.fit(X, y)
plt.bar(freq,lasmod.coef_)
plt.axhline(0)
plt.xlabel("Frequency")
plt.ylabel("Coefficient")
plt.show()
We see that relatively few coefficients are active. Compute the RMSE on the test set:
Xtest = testmeat.drop('fat',axis=1)
rmse(np.dot(Xtest, lasmod.coef_) + lasmod.intercept_, testmeat.fat)
Not a particularly good result. Use a much smaller alpha:
lasmod = Lasso(alpha=0.0001, max_iter=1e5)
lasmod.fit(X, y)
Xtest = testmeat.drop('fat',axis=1)
rmse(np.dot(Xtest, lasmod.coef_) + lasmod.intercept_, testmeat.fat)
This results in a better fit although not as good as previously. Now compute the coefficient traces as alpha changes.
coefs = []
for a in alphas:
lasmod = Lasso(alpha=a, max_iter=1e5)
lasmod.fit(X, y)
coefs.append(lasmod.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Result looks a bit dodgy for large values of alpha. Maybe something is wrong?
%load_ext version_information
%version_information pandas, numpy, matplotlib, seaborn, scipy, patsy, statsmodels