Anscombe’s quartet is a great example of the importance of fully understanding variability in a data set: it is a set of 4 data sets with the same summary measures (mean, std, etc…), the same correlation and the same regression line but with very different distributions. In this post I’ll show how easy it is to play with Anscombe’s quartet using Python and then talk about another mathematical area where variability must be fully understood: the study of queues. I’ll do this with the Ciw python library.

(If you would rather read this as jupyter notebook you can find it here: ipynb)

Here are all the imports that I’m going to need for this post:

>>> import pandas as pd  # Data manipulation
>>> import ciw  # The discrete event simulation library we will use to study queues
>>> import matplotlib.pyplot as plt  # Plots
>>> import seaborn as sns  # Powerful plots
>>> from scipy import stats  # Linear regression
>>> import numpy as np  # Quick summary statistics
>>> import tqdm  # A progress bar


Anscombe’s quartet

First of all, let us get the data set, that’s immediate to do as it comes ready loadable within the seaborn library:

>>> anscombe = sns.load_dataset("anscombe")


We can see that there are 4 data sets each with different values of $$x$$ and $$y$$:

>>> anscombe
dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68
11      II  10.0   9.14
12      II   8.0   8.14
13      II  13.0   8.74
14      II   9.0   8.77
15      II  11.0   9.26
16      II  14.0   8.10
17      II   6.0   6.13
18      II   4.0   3.10
19      II  12.0   9.13
20      II   7.0   7.26
21      II   5.0   4.74
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0  12.50
41      IV   8.0   5.56
42      IV   8.0   7.91
43      IV   8.0   6.89


Let us take a quick look at the summary measures:

>>> anscombe.groupby("dataset").describe()
x          y
dataset
I       count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031568
min     4.000000   4.260000
25%     6.500000   6.315000
50%     9.000000   7.580000
75%    11.500000   8.570000
max    14.000000  10.840000
II      count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.031657
min     4.000000   3.100000
25%     6.500000   6.695000
50%     9.000000   8.140000
75%    11.500000   8.950000
max    14.000000   9.260000
III     count  11.000000  11.000000
mean    9.000000   7.500000
std     3.316625   2.030424
min     4.000000   5.390000
25%     6.500000   6.250000
50%     9.000000   7.110000
75%    11.500000   7.980000
max    14.000000  12.740000
IV      count  11.000000  11.000000
mean    9.000000   7.500909
std     3.316625   2.030579
min     8.000000   5.250000
25%     8.000000   6.170000
50%     8.000000   7.040000
75%     8.000000   8.190000
max    19.000000  12.500000


We see that the summary statistics are, if not the same (a mean in $$x$$ of 9, in $$y$$ of 7.5) very similar. We can also take a look at the relationship between $$x$$ and $$y$$ in each data set by computing the regression line and visualising it:

>>> for data_set in anscombe.dataset.unique():
...     df = anscombe.query("dataset == '{}'".format(data_set))
...     slope, intercept, r_val, p_val, slope_std_error = stats.linregress(x=df.x, y=df.y)
...     sns.lmplot(x="x", y="y", data=df);
...     plt.title("Data set {}: y={:.2f}x+{:.2f} (p: {:.2f}, R^2: {:.2f})".format(data_set, slope, intercept, p_val, r_val))


While the plots show that each data set is evidently different, they also all have the same regression line (with the same $$p, R^2$$ values!):

Anscombe’s quartet is often used as an example justifying that a summary of a data set will inherently lose information and so should be accompanied by further study/understanding (such as in this case: viewing the plots of the data). (There even a paper out there showing how to generate similar data sets (behind a paywall): “Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset”)

Variability in queues

Another area of mathematics where it is imperative to have a good understanding of variability is queueing theory which is (suprise suprise) the study of queues. I’ve written about queueing theory before: here is a post about calculating the expected wait in a tandem queue, the main idea however is to assume some (random) distribution for the time between arrivals of individuals at a queueing system and another (random) distribution for the time it takes to “service” each individual.

When studied analytically these are often studied as Markov chains. The wikipedia entry for a single server queue wikipedia.org/wiki/Queueing_theory#Example_of_M.2FM.2F1 shows a picture for the corresponding chain.

These Markov models of queues assume that the distributions (of inter arrival times and service times) are Markovian: ie correspond to exponential distributions. This is often a good distribution that models realistic situations, however it is just as often not realistic. For example in healthcare systems such as hospital wards (which are just queues) service time distributions often follow a lognormal distribution.

When studying queues, as well as building analytical models it is possible to simulate the process using pseudo random number generators. One piece of software (the main developer being Geraint Palmer who is studying for his PhD with Paul Harper and I) that allows for this is ciw (the name is Welsh for queue! :)).

What I’ll do in the rest of this post is build a simulation model of a single server queue with ciw where individuals:

• Arrive on average according to a rate of .5 per time unit (so 2 time units between arrivals);
• Are served on average at a rate of 1 per time unit (so 1 time unit per service).

To do this in ciw we need to set up a parameters dictionary. Ciw is written so as to handle most queueing systems but with queueing networks in mind in particular and also the ability to handle multiple classes of individuals. In our system we will just have one class of individuals Class 0 and the transition matrix is a bit redundant because we’re dealing with a single queue. Here is the dictionary for our system when assuming exponential distributions:

>>> parameters = {'Arrival_distributions': {'Class 0': [['Exponential', 0.5]]},
...               'Service_distributions': {'Class 0': [['Exponential', 1]]},
...               'Transition_matrices': {'Class 0': [[0.0]]},
...               'Number_of_servers': [1]}



Here is a function to run a single simulation:

>>> def iteration(parameters, maxtime=250, warmup=50):
...     """
...     Run a single iteration of the simulation and
...     return the times spent waiting for service
...     as well as the service times
...     """
...     N = ciw.create_network(parameters)
...     Q = ciw.Simulation(N)
...     Q.simulate_until_max_time(maxtime)
...     records = [r for r in  Q.get_all_records() if r.arrival_date > warmup]
...     waits = [r.waiting_time for r in records]
...     service_times = [r.service_time for r in records]
...     n = len(waits)
...     return waits, service_times



Here is a function to run a trial consisting of multiple runs of the simulation. This is important to help smooth the stochasticity of the random process.

>>> def trials(parameters, repetitions=30, maxtime=2000, warmup=250):
...     """Repeat out simulation over a number of trials"""
...     waits = []
...     service_times = []
...     for seed in tqdm.trange(repetitions):  # tqdm gives us a nice progress bar
...         ciw.seed(seed)
...         wait, service_time = iteration(parameters, maxtime=maxtime, warmup=warmup)
...         waits.extend(wait)
...         service_times.extend(service_time)
...     return waits, service_times



Let’s run a trial of the default parameters:

>>> waits, service_times = trials(parameters)



First, let us verify that we are getting the expected service time (recall it should be 1 time units):

>>> np.mean(service_times)
1.00249...



We can also check that mean wait agrees with the theoretical value of $$\frac{\rho}{\mu(1-\rho)}$$ where $$\rho=\lambda/\mu$$, with $$\lambda$$, $$\mu$$) being the interarrival and service rates.

>>> lmbda = parameters['Arrival_distributions']['Class 0'][0][1]
>>> mu = parameters['Service_distributions']['Class 0'][0][1]
>>> rho = lmbda / mu
>>> np.mean(waits), rho / (mu * (1 - rho))
(1.0299..., 1.0)


Now that we have done this for a simple Markovian system let us create a number of distributions that all have the same mean service time and compare these:

>>> distributions = [
...     ['Uniform', 0, 2],  # A uniform distribution with mean 1
...     ['Deterministic', 1],  # A deterministic distribution with mean 1
...     ['Triangular', 0, 2, 1],  # A triangular distribution with mean 1
...     ['Exponential', 1],  # The Markovian distribution with mean 1
...     ['Gamma', 0.91, 1.1], # A Gamma distribution with mean 1
...     ['Lognormal', 0, .1], # A lognormal distribution with mean 1
...     ['Weibull', 1.1, 3.9],  # A Weibull distribuion with mean 1
...     ['Empirical', [0] * 19 + [20]]  # An empirical distribution with mean 1 (95% of the time: 0, 5% of the time: 20)
...    ]



Let us know simulate the queueing process for all the above distribution:

>>> columns = ["distribution", "waits", "service_times"]
>>> df = pd.DataFrame(columns=columns)  # Create a dataframe that will keep all the data
>>> for distribution in distributions:
...     parameters['Service_distributions']['Class 0'] = [distribution]
...     waits, service_times = trials(parameters)
...     n = len(waits)
...     df = df.append(pd.DataFrame(list(zip([distribution[0]] * n, waits, service_times)), columns=columns))


If we look at the summary statistics we see that all distributions have the same mean service time but drastically different mean waiting times:

>>> bydistribution = df.groupby("distribution")  # Grouping the data
>>> for name, df in sorted(bydistribution, key= lambda dist: dist[1].waits.max()):
...     print("{}:\n\t Mean service time: {:.02f}\n\t Mean waiting time: {:.02f}\n\t 95% waiting time: {:.02f} \n\t Max waiting time: {:.02f}\n".format(
...                                                         name,
...                                                         df.service_times.mean(),
...                                                         df.waits.mean(),
...                                                         df.waits.quantile(0.95),
...                                                         df.waits.max()))
Deterministic:
Mean service time: 1.00
Mean waiting time: 0.51
95% waiting time: 2.11
Max waiting time: 6.73

Weibull:
Mean service time: 1.00
Mean waiting time: 0.54
95% waiting time: 2.24
Max waiting time: 8.16

Lognormal:
Mean service time: 1.01
Mean waiting time: 0.51
95% waiting time: 2.09
Max waiting time: 8.53

Triangular:
Mean service time: 1.00
Mean waiting time: 0.59
95% waiting time: 2.48
Max waiting time: 9.21

Uniform:
Mean service time: 1.00
Mean waiting time: 0.67
95% waiting time: 2.84
Max waiting time: 10.42

Exponential:
Mean service time: 1.00
Mean waiting time: 1.03
95% waiting time: 4.76
Max waiting time: 16.41

Gamma:
Mean service time: 1.00
Mean waiting time: 1.02
95% waiting time: 4.72
Max waiting time: 20.65

Empirical:
Mean service time: 0.99
Mean waiting time: 9.30
95% waiting time: 37.60
Max waiting time: 100.60



For example, the empirical distribution sees at least one individual waiting for a total of 100 time units! (Remember, individuals are arriving on average once every 2 time units and being served, on average every time unit!). Also 5% of individuals waited more than 37 time units in the empiral distribution but still for more conservative distributions 5% of individuals had waits larger than: 4.72 times units (Weibull), 4.76 time units (Exponential), 2.84 time units (Uniform) etc…

But just as we mentioned when looking at Anscombe’s quartet it’s important to not just look at summary statistics so here is some visualisation:

>>> for name, df in sorted(bydistribution, key= lambda dist: dist[1].waits.max())[:-1]:
...     plt.hist(df.waits, normed=True, bins=20, cumulative=True,
...				 histtype = 'step', label=name, linewidth=1.5)
>>> plt.title("Cdf (excluding the empirical distribution)")
>>> plt.xlabel("Waiting time")
>>> plt.ylabel("Probability")
>>> plt.ylim(0, 1)
>>> plt.legend(loc=5);



>>> for name, df in sorted(bydistribution, key= lambda dist: dist[1].waits.max()):
...     plt.hist(df.waits, normed=True, bins=20, cumulative=True,
...              histtype = 'step', label=name, linewidth=1.5)
>>> plt.title("Cdf")
>>> plt.xlabel("Waiting time")
>>> plt.ylabel("Probability")
>>> plt.ylim(0, 1)
>>> plt.legend(loc=5);



We see that the empirical distribution has by far the longest tail with some individuals having more than 80 time units spent waiting. Whilst this distribution is pretty artificial, it is fairly realistic to assume that there is a particular procedure or service that often takes no time at all but rarely does take a lot of more time. Examples of these will include healthcare operations where non routine procedures sometimes are needed.

This is all done with a pretty simple queueing system (just one server) but the Ciw python library makes it pretty easy to study more complex systems.

Side note

Ciw is already pretty well developed and has full 100% test coverage but is always looking for contributions (of any level) so please take a look at: