Linear Models with R translated to Python

27 Apr, 2018

I have translated the R code in Linear Models with R into Python. The code is available as Jupyter Notebooks.

I was able to translate most of the content into Python. Sometimes the output is similar but not the same. Python has far less statistics functionality than R but it seems most of the functionality in base R can be found in Python. R now has over 10,000 packages. Python has about ten times as many but most of these are unrelated to Statistics. My book does not depend heavily on additional packages so this was not so much of an obstacle for me. In a few cases, I rely on R packages that do not exist in Python. Doubtless a Python equivalent could be created but that will take some effort.

After this experience, I can say that R is a better choice for Statistics than Python. Nevertheless, there are good reasons why one might choose to do Statistics with Python. One obvious reason is that if you already know Python, you will be reluctant to also learn R. In the UK, Python is now being taught in schools and we will soon have a wave of students who will come to university knowing Python. Python usage is also more common in several areas such as Computer Science and Engineering. Another reason to use Python is the huge range of packages ranging from text, image and signal processing to machine learning and optimisation. These go far beyond what can found in R. The Python userbase is much larger than R and this has translated into greater functionality as a programming language.

I started using S in 1984 and moved onto R when it was first released. It’s hard to move from 34 years of experience of R to no prior experience with Python. Here are a few impressions that may help other R users who start learning Python:

1.Base R is quite functional without loading any packages. In Python, you will always need to load some packages even to do some basic computations. You will probably need to load numpy, scipy, pandas, statsmodels, matplotlib just to get something similar to the base R environment. 2. Python is very fussy about namespaces. You will find yourself have to prefix every loaded function. For example, you cannot write log(x) — you’ll need to write np.log(x) indicating that log comes from the numpy package. I understand the reason for this but this alone makes Python code longer than R code. 3. Python array indices start from zero. Again, I know why this is but it’s something the R user has to continually adjust to. 4. matplotlib is the Python equivalent of the R base plotting functionality. It does more than R but a range of options is daunting for the new user. I had a better time with seaborn which is more like ggplot is producing attractive plots. (There’s a partial translation of ggplot into Python but I preferred seaborn). 5. statsmodels provides the linear modelling functionality found in R but you will find some differences that will trip you up. In particular, no intercept term is included by default and the handling of saturated models is different, as the Moore-Penrose inverse is used rather than dropping some offending columns as R does. The output from the linear model is far too verbose for my tastes. Of course, you can work around all these issues. 6. Python uses pipes very commonly. It helps if you have already started using these in R via the %>% operator to get you into that frame of mind.

New users inevitably encounter some frustrations but did find Python enjoyable. I have nightmares about having to do Statistics in Excel but Python world is a pleasant land even if it is still a bit unfamiliar to me.

Julian Faraway

Professor of Statistics

Professor of Statistics at the University of Bath