Smoking is good, p<0.05

Introduction

Much has been said about the Simpson’s paradox, the famous statistic paradox that leads to reversing association when adding or removing a covariate.

My intention in this post is not to explain the paradox, as many have done it already before me, instead I just provide an illustration to set the context, and then I focus on how to forge data that leads to this paradox.

This post is named after a very inspiring article by Jakob Cohen [1], explaining how the null hypothesis significance testing is misunderstood and misused, although it is the bread and butter of tens of thousands of statisticians around the world.

The post is also inspired by David Louapre’s Science Etonnante(in french) excellent blog.

An example of Simpson’s paradox

Let’s look at a real study performed in the seventies in a UK village [2]. The researchers surveyed women, smoking and non smoking. They then reported the percentage of women still alive 20 years later.

The result was the following:

The data shows that the mortality was lower amongst the smokers. How surprising! Something must have gone wrong.

Indeed, if we refine the data and add a covariate, i.e. we now distinguish between classes of age, we see this:

(smokers are in red, non smokers in pink)

What we now observe is that mortality has generally been higher for the smokers than for the non smokers. Sounds more in line we what we expected, doesn’t it?

So what happened? Well, that’s an illustration of the simpson’s paradox. The paradox may happen when an important covariate (factor that has in influence on the result) is missing, and generally under specific circumpstances with unbalanced group sizes (non homogeneity).

There are multiple examples of this paradox appearing in real life situation, including the famous UCI Berkeley students’ admission discrimination case. At first glance, it that appeared the admission ratio for male applicants was higher than that of female applicants (44% versus 35%). However, when splitting data by major, the results no longer showed any discrimination as the admission ratio was approximately similar for both male and females. See this excellent article for more.

The original data

I got interested in forging data whilst studying linear regression. I worked on a small group project and wanted to come up with a striking example, which although it would have coefficients of statistical significance, the regression woud lead to wrong conclusion.

I started with the first study mentionned above (smoking/non smoking). The raw data is presented below

# Initialize environment
import pandas as pd
import seaborn as sns
from __future__ import division
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Load the data from the original study
data = pd.read_csv("./smoking.data",delim_whitespace=True)
data

	id	age	smoker	alive	dead
0	1	18-24	1	53	2
1	2	18-24	0	61	1
2	3	25-34	1	121	3
3	4	25-34	0	152	5
4	5	35-44	1	95	14
5	6	35-44	0	114	7
6	7	45-54	1	103	27
7	8	45-54	0	66	12
8	9	55-64	1	64	51
9	10	55-64	0	81	40
10	11	65-74	1	7	29
11	12	65-74	0	28	101
12	13	75+	1	0	13
13	14	75+	0	0	64

The data could not be visualised as a scatter plot and neither was it possible to fit a regression line. So I decided to forge it.

Forging a simple example

Let’s “study” the hypothetical effect of smoking on libido.

The idea is to create several separate clouds of points. Each cloud, regressed independantly will have a negative coefficients.

But each cloud is located in such a way that is the regression happens on the whole cloud of points, the regression take place in the other direction

# Forge the data
import random

# Male data cloud
# xm: abscissa data, em = randomness for the ordinate, ym = ordianate, sexm = label
xm = [random.randint(18,35) for _ in range(40)]
em = [np.random.normal(0,7)+5 for _ in range(40)]
ym = [10-.5*xm[i]+em[i] for i in range(40)]
sexm = ["Male" for _ in range(40)]

# Female data cloud
# xf: abscissa data, ef = randomness for the ordinate, yf = ordianate, sexf = label
xf = [random.randint(0,22) for _ in range(40)]
ef = [np.random.normal(0,5) for _ in range(40)]
yf = [-5-.75*xf[i]+ef[i] for i in range(40)]
sexf = ["Female" for _ in range(40)]

# Concatenate the male and female data to be able to display aggregated clouds
xt = xm + xf 
yt = ym + yf 
sext = sexm + sexf

Let’s display the whole data cloud, without differentiating by sex.

import seaborn as sns
x=xt
y=yt
sex=sext
sns.set(style="ticks", context="talk")
data = {"x":xt,"y":yt,"sex":sext}
df = pd.DataFrame(data)
g = sns.lmplot(x="x",y="y",data=df)
g.set_axis_labels("Number of cigarettes per day", "% variation of libido")
plt.show()

The line seems to indicate a positive correlation between smoking a libido: the more the smoke, the more libido you have.

Now let’s add our age covariate.

f = sns.lmplot(x="x",y="y",hue="sex",data=df)
f.set_axis_labels("Number of cigarettes per day", "% variation of libido")
plt.show()

That now looks more as expected… smoking has a negative effect on both groups when taken separately!

Note that in the two graphs, the clouds of points in exactly the same!

Final check… in the first graph, was our coefficient statistically significant?

import statsmodels.api as sm
x_fit = sm.add_constant(x) # Without it intercept is excluded
model = sm.OLS(y, x_fit).fit()
predictions = model.predict(x_fit)
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.111
Model:                            OLS   Adj. R-squared:                  0.099
Method:                 Least Squares   F-statistic:                     9.718
Date:                Sat, 19 Nov 2016   Prob (F-statistic):            0.00255
Time:                        22:41:26   Log-Likelihood:                -300.52
No. Observations:                  80   AIC:                             605.0
Df Residuals:                      78   BIC:                             609.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -12.6647      2.636     -4.804      0.000       -17.913    -7.417
x1             0.3917      0.126      3.117      0.003         0.142     0.642
==============================================================================
Omnibus:                        2.345   Durbin-Watson:                   1.591
Prob(Omnibus):                  0.310   Jarque-Bera (JB):                2.253
Skew:                          -0.340   Prob(JB):                        0.324
Kurtosis:                       2.538   Cond. No.                         47.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

x1 coefficient has a p value of 0.003, which is significant if we reject the null hypothesis for p<0.05 !

Forging a multi-class example

This second example uses the same technic as previously, but splits into 3 classes.

# Forge the data
import random

# Age group 15-34
x1 = [random.randint(12,40) for _ in range(40)]
e1 = [np.random.normal(0,6) for _ in range(40)]
y1 = [82-.15*x1[i]+e1[i] for i in range(40)]
age1 = ["15-34" for _ in range(40)]

# Age group 35-54
x2 = [random.randint(5,30) for _ in range(40)]
e2 = [np.random.normal(0,5) for _ in range(40)]
y2 = [75-.3*x2[i]+e2[i] for i in range(40)]
age2 = ["35-54" for _ in range(40)]

# Age group 55+
x3 = [random.randint(0,30) for _ in range(40)]
e3 = [np.random.normal(0,4) for _ in range(40)]
y3 = [68-.45*x3[i]+e3[i] for i in range(40)]
age3 = ["55+" for _ in range(40)]

# Concatenate all the groups data to be able to display aggregated clouds
x4 = x1 + x2 + x3
y4 = y1 + y2 + y3
age4 = age1 + age2 + age3

Again, let’s look at the aggregated data (unexepected result), split with covariate (expected result) and the significance factor.

# Display the data without the confounding factor
x=x4
y=y4
age=age4
sns.set(style="ticks", context="talk")
data = {"x":x,"y":y,"age":age}
df = pd.DataFrame(data)
g = sns.lmplot(x="x",y="y",data=df)
g.set_axis_labels("Number of cigarettes per day", "Life expectancy")
plt.show()

# Show the same data with confounding factor
f = sns.lmplot(x="x",y="y",hue="age",data=df)
f.set_axis_labels("Number of cigarettes per day", "Life expectancy")
plt.show()

# Show the significance of the coefficient
import statsmodels.api as sm
x_fit = sm.add_constant(x) # Without it intercept is excluded
model = sm.OLS(y, x_fit).fit()
predictions = model.predict(x_fit)
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.067
Model:                            OLS   Adj. R-squared:                  0.059
Method:                 Least Squares   F-statistic:                     8.491
Date:                Sat, 19 Nov 2016   Prob (F-statistic):            0.00427
Time:                        22:43:39   Log-Likelihood:                -425.24
No. Observations:                 120   AIC:                             854.5
Df Residuals:                     118   BIC:                             860.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         65.4403      1.695     38.598      0.000        62.083    68.798
x1             0.2350      0.081      2.914      0.004         0.075     0.395
==============================================================================
Omnibus:                        0.616   Durbin-Watson:                   1.230
Prob(Omnibus):                  0.735   Jarque-Bera (JB):                0.716
Skew:                           0.014   Prob(JB):                        0.699
Kurtosis:                       2.623   Cond. No.                         46.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Generate an example with double paradox

# Forge the data
import random

ar = (8,10)
nr = (10,8)
ap = (15,5)
nP = (17,10)
std = 2

# Male/Smoker
xap = [random.randint(ap[0]-2,ap[0]+2) for _ in range(40)]
e = [np.random.normal(0,1) for _ in range(40)]
yap = [ap[1]+e[i]-1*(xap[i]-np.min(xap)) for i in range(40)]
sap = ["Tobacco" for _ in range(40)]
aap = ["Male" for _ in range(40)]
lap = ["Tobacco + Male" for _ in range(40)]

# Female/Smoker
xnp = [random.randint(nP[0]-2,nP[0]+2) for _ in range(40)]
e = [np.random.normal(0,1) for _ in range(40)]
ynp = [nP[1]+e[i]-1*(xnp[i]-np.min(xnp)) for i in range(40)]
snp = ["Tobacco" for _ in range(40)]
anp = ["Female" for _ in range(40)]
lnp = ["Tobacco + Female" for _ in range(40)]

# Male/E-cigarette
xar = [random.randint(ar[0]-2,ar[0]+2) for _ in range(40)]
e = [np.random.normal(0,1) for _ in range(40)]
yar = [ar[1]+e[i]+.2*(xar[i]-np.min(xar)) for i in range(40)]
sar = ["E cigarettes" for _ in range(40)]
aar = ["Male" for _ in range(40)]
lar = ["E cigarettes + Male" for _ in range(40)]

# Female/E-cigarette
xnr = [random.randint(nr[0]-2,nr[0]+2) for _ in range(40)]
e = [np.random.normal(0,1) for _ in range(40)]
ynr = [nr[1]+e[i]+.2*(xnr[i]-np.min(xnr)) for i in range(40)]
snr = ["E cigarettes" for _ in range(40)]
anr = ["Female" for _ in range(40)]
lnr = ["E cigarettes + Female" for _ in range(40)]

x = xap + xnp + xar + xnr
y = yap + ynp + yar + ynr
s = sap + snp + sar + snr
a = aap + anp + aar + anr
l = lap + lnp + lar + lnr

Let’s look at our overall data. Does smoking tobacco or e-cigarettes impact your health?

# Display without confounding factor
data = {"x":x,"y":y,"smoke":s,"athlet":a,"label":l}
df = pd.DataFrame(data)

f = sns.lmplot(x="x",y="y",data=df)
f.set_axis_labels("Number of cigarettes per day", "% very healthy people")
plt.show()

Ok, so far so good, it looks like smoking damages your health as expected.

Let’s add a covariate and split by what people smoke.

# Display woth one confounding factor
f = sns.lmplot(x="x",y="y",hue="smoke",data=df)
f.set_axis_labels("Number of cigarettes per day", "% very healthy people")
plt.show()

Uh-oh! Now it appears tobacco is having a positive effect on health, whereas e-cigarettes are slightly harmful. Strange isn’t it?

Let’s add again one more covariate!

# Display with two confounding factors
f = sns.lmplot(x="x",y="y",hue="label",data=df)
f.set_axis_labels("Number of cigarettes per day", "% very healthy people")
plt.show()

Now we see a double reversal of the simpson’s paradox! By adding successively new covariates, we were able to obtain successively very different results!

Conclusion

As we just saw, you can get a lot out of data, pretty much you can get them to tell anything you want!

So as a data scientist, you need to be aware of the Simpson’s paradox so you can spot it when it occurs, that’s a no brainer.

But most of all, it shows the importance of being a subject matter expert, or working with one, of area/data you want to process.

References

[1] Cohen. “The earth is round, p<0.05” The American Psychologist 49.12 (1994): 997-1003

[2] Appleton, David R., Joyce M. French, and Mark PJ Vanderpump. “Ignoring a covariate: An example of Simpson’s paradox.” The American Statistician 50.4 (1996): 340-341.

Recently by the same author:

Video Algorithm, Artificial Intelligence, Big data should we be afraid about it?