Working with Nikoleta we recently needed to carry out an Analysis of Variance (ANOVA) on a data set where the sample size of each category is not constant. This blog post shows very briefly how to carry this out in Python (when using Pandas).

We are going to be using SciPy’s stats.f_oneway function which can handle different sized samples out of the box. Here’s a quick example:

>>> import random
>>> random.seed(0)
>>> data = [[random.randint(1, 20) for _ in range(size)] for size in range(1, 5)]
>>> for sample in data:
...     print(sample)
[13]
[14, 2]
[9, 17, 16]
[13, 10, 16, 12]

We see that we have 4 samples of different size. Here let’s carry out the ANOVA:

>>> from scipy import stats
>>> f_val, p_val = stats.f_oneway(*data)
>>> p_val
0.57172146848075944

We see that our \(p\) value is approximately \(0.57\).

Where you need to be careful is when the data you have is in a pandas data frame. This was the particular case we were dealing with.

For the purposes of our experiment let us build up the dataframe in question. First let us transform our samples in to their own dataframe. This is an artificial exercise to get data of the form required as I’m just going to show how to use dropna to get back to the situation above.

>>> import pandas as pd
>>> dfs = [pd.DataFrame({k: sample}) for k, sample in enumerate(data)]
>>> df = pd.concat(dfs,  ignore_index=True, axis=1)
>>> df
      0     1     2   3
0  13.0  14.0   9.0  13
1   NaN   2.0  17.0  10
2   NaN   NaN  16.0  16
3   NaN   NaN   NaN  12

If we were to carry out the anova as before (pulling out the columns) we would get an error:

>>> data = [df[col] for col in df]
>>> f_val, p_val = stats.f_oneway(*[df[col] for col in df])
>>> p_val
nan

The missing step is to ignore the missing numbers in the pandas dataframe:

>>> data = [df[col].dropna() for col in df]
>>> f_val, p_val = stats.f_oneway(*data)
>>> p_val
0.57172146848075944

We can finish things of by drawing a nice violin plot with a regression line fitted to the median of the data.

import numpy as np  # For the median
import matplotlib.pyplot as plt  # For the plot

# Fit line to median of distributions
x = range(1, len(data) + 1)
y = [np.median(sample) for sample in data]
slope, intercept, r_val, p_val, slope_std_error = stats.linregress(x, y)

def line(x):
    """The regression line"""
    return slope * x + intercept

plt.figure()
plt.violinplot(data);
x1, x2 = plt.xlim()
plt.plot((x1, x2), (line(x1), line(x2)), '--',
         label="$y = {0:.2f}x + {1:.2f}$ ($p={2:.2f}$)".format(slope,
                                                               intercept,
                                                               p_val),
         ),
plt.legend(loc=4);

which gives:

Edit Here is a jupyter notebook with all of the above.