Tutorial#

To demonstrate the use case of Matplotlib we will consider a particular set of data called Anscombe’s Quartet.

Problem

Consider the following 4 data sets:

Table 3 Set 1#

x

y

10.0

8.04

8.0

6.95

13.0

7.58

9.0

8.81

11.0

8.33

14.0

9.96

6.0

7.24

4.0

4.26

12.0

10.84

7.0

4.82

5.0

5.68

Table 4 Set 2#

10.0

9.14

8.0

8.14

13.0

8.74

9.0

8.77

11.0

9.26

14.0

8.1

6.0

6.13

4.0

3.1

12.0

9.13

7.0

7.26

5.0

4.74

Table 5 Set 3#

10.0

7.46

8.0

6.77

13.0

12.74

9.0

7.11

11.0

7.81

14.0

8.84

6.0

6.08

4.0

5.39

12.0

8.15

7.0

6.42

5.0

5.73

Table 6 Set 4#

8.0

6.58

8.0

5.76

8.0

7.71

8.0

8.84

8.0

8.47

8.0

7.04

8.0

5.25

19.0

12.5

8.0

5.56

8.0

7.91

8.0

6.89

  1. For every data set obtain:

    1. The mean and standard deviation of \(x\);

    2. The mean and standard deviation of \(y\).

  2. Plot a scatter plot of all 4 data sets of \(y\) against \(x\).

  3. Find a regression line that for \(y\) against \(x\) and add a plot of that to the scatter plot.

We start this problem by creating tuples with values corresponding to each column of each data set:

set_1_x = (10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0)
set_1_y = (8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
set_2_x = (10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0)
set_2_y = (9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74)
set_3_x = (10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0)
set_3_y = (7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73)
set_4_x = (8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0)
set_4_y = (6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)

Now to compute the mean and standard deviation we will use numpy:

import numpy as np

for x in (set_1_x, set_2_x, set_3_x, set_4_x):
    print(np.mean(x), np.std(x))
9.0 3.1622776601683795
9.0 3.1622776601683795
9.0 3.1622776601683795
9.0 3.1622776601683795

We see that all the data sets have the same mean and standard deviation for \(x\).

for y in (set_1_y, set_2_y, set_3_y, set_4_y):
    print(np.mean(y), np.std(y))
7.500909090909093 1.937024215108669
7.50090909090909 1.93710869148962
7.5 1.9359329439927313
7.500909090909091 1.9360806451340837

Similarly for \(y\): all the data sets have approximately the same mean and standard deviation.

We will now use matplotlib to plot a scatter plot of all the data sets:

import matplotlib.pyplot as plt

plt.figure()
plt.scatter(set_1_x, set_1_y)
plt.title("Data set I")
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_7_01.png
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(set_2_x, set_2_y)
plt.title("Data set II")
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_8_0.png
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(set_3_x, set_3_y)
plt.title("Data set III")
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_9_01.png
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(set_4_x, set_4_y)
plt.title("Data set IV")
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_10_0.png

It is clear that despite having differing means and standard deviations, the data sets are different.

To fit a line of best fit we will using numpy.polyfit which fits a polynomial. We specify that we want a line (so a polynomial of degree 1):

coefficients = np.polyfit(set_1_x, set_1_y, 1)
coefficients
array([0.50009091, 3.00009091])

Here are each of the coefficients for the lines of best fit for each data set:

for x, y in (
    (set_1_x, set_1_y),
    (set_2_x, set_2_y),
    (set_3_x, set_3_y),
    (set_4_x, set_4_y),
):
    a, b = np.polyfit(x, y, 1)
    print(a, b)
0.5000909090909094 3.000090909090908
0.5000000000000006 3.0009090909090883
0.49972727272727313 3.0024545454545453
0.49990909090909097 3.0017272727272717

All the coefficients are the same, we will go ahead and add a plot of them to each plot:

import matplotlib.pyplot as plt

x = set_1_x
y = set_1_y
title = "Data set I"

coefficients = np.polyfit(x, y, 1)
line_y = [a * x_value + b for x_value in x]

plt.figure()
plt.scatter(x, y)
plt.plot(x, line_y, color="red")
plt.title(title)
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_16_0.png
import matplotlib.pyplot as plt

x = set_2_x
y = set_2_y
title = "Data set II"

coefficients = np.polyfit(x, y, 1)
line_y = [a * x_value + b for x_value in x]

plt.figure()
plt.scatter(x, y)
plt.plot(x, line_y, color="red")
plt.title(title)
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_17_01.png
import matplotlib.pyplot as plt

x = set_3_x
y = set_3_y
title = "Data set III"

coefficients = np.polyfit(x, y, 1)
line_y = [a * x_value + b for x_value in x]

plt.figure()
plt.scatter(x, y)
plt.plot(x, line_y, color="red")
plt.title(title)
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_18_0.png
import matplotlib.pyplot as plt

x = set_4_x
y = set_4_y
title = "Data set IV"

coefficients = np.polyfit(x, y, 1)
line_y = [a * x_value + b for x_value in x]

plt.figure()
plt.scatter(x, y)
plt.plot(x, line_y, color="red")
plt.title(title)
plt.xlabel("$x$")
plt.ylabel("$y$");
../../../_images/main_19_0.png

Anscombe’s quartet is often used to demonstrate the importance of visualising data. In this particular exercise we have seen that 4 data sets have the same mean, standard deviation and line of best fit but are immediately different which is clear once visualised.

Important

In this chapter we have:

  • Plotted a scatter plot.

  • Add a plot of a line to our scatter plot.