Tutorial
Tutorial#
We will solve the following problem using a computer to do some of the more tedious calculations.
Problem
Anna is investigating the relationship between exercise and resting heart rate. She takes a random sample of 19 people in her year group and records for each person
their resting heart rate, \(h\) beats per minute.
the number of minutes, \(m\), spent exercising each week.
A table with the data is here:
h |
m |
---|---|
76.0 |
5 |
72.0 |
5 |
71.0 |
21 |
74.0 |
30 |
71.0 |
42 |
69.0 |
20 |
68.0 |
20 |
68.0 |
35 |
66 |
80.0 |
64 |
120.0 |
65 |
140.0 |
63 |
180.0 |
63 |
205.0 |
62 |
225.0 |
65 |
237.0 |
63 |
280.0 |
65 |
300.0 |
64 |
356.0 |
64 |
360.0 |
You can see a scatter plot below.
For all collected values of \(h\) and \(m\) obtain:
The mean
The median
The quartiles
The standard deviation
The variation
The maximum
The minimum
Obtain the Pearson Coefficient of correlation for the variables \(h\) and \(m\).
Obtain the line of best fit for variables \(x\) and \(y\) as defined by:
\[x=\ln(m)\qquad y=\ln(h)\]Using the above obtain a relationship between \(m\) and \(h\) of the form:
\[h=cm^k\]
Let us start by inputting all the data:
h = (
76.0,
72.0,
71.0,
74.0,
71.0,
69.0,
68.0,
68.0,
66.0,
64.0,
65.0,
63.0,
63.0,
62.0,
65.0,
63.0,
65.0,
64.0,
64.0,
)
m = (
5,
5,
21,
30,
42,
20,
20,
35,
80,
120,
140,
180,
205,
225,
237,
280,
300,
356,
360,
)
The main tool we are going to use for this is statistics
.
import statistics as st
To calculate the mean:
st.mean(h)
67.0
st.mean(m)
140.05263157894737
To calculate the median:
st.median(h)
65.0
st.median(m)
120
To calculate the quartiles, we use statistics.quantiles
and specify that we
want to separate the date in to \(n=4\) quarters.
st.quantiles(h, n=4)
[64.0, 65.0, 71.0]
st.quantiles(m, n=4)
[21.0, 120.0, 237.0]
Note that this calculation confirms the median which corresponds to the 50% quartile.
To calculate the sample standard deviation:
st.stdev(h)
4.123105625617661
st.stdev(m)
124.46662813970593
To calculate the sample variance:
st.variance(h)
17.0
st.variance(m)
15491.941520467837
To compute that maximum:
max(h)
76.0
max(m)
360
To compute the minimum:
min(h)
62.0
min(m)
5
In order to compute the Pearson Coefficient of correlation we use
statistics.correlation
:
st.correlation(h, m)
-0.7686142969026402
This negative value indicates a negative correlation between \(h\) and \(m\), indicating that the more you exercise the lower your heart rate is likely to be.
To calculate the line of best fit for the transformed variables we need to first
create them. We will do this using a list comprehension. As we are doing
everything numerically, we will use math.log
which by default computes the
natural logarithm:
import math
x = [math.log(value) for value in m]
y = [math.log(value) for value in h]
Now to compute the line of best fit we will use statistics.linear_regression
:
slope, intercept = st.linear_regression(x, y)
The slope is:
slope
-0.03854770754231997
The intercept is:
intercept
4.368415819445762
If we recall the transformation of the variables:
We now have the relationship:
Where \(a\) corresponds to the slope
and \(b\) corresponds to the intercept
.
The question was asking for a relationship between \(m\) and \(h\) of the form:
We can use sympy
to manipulate the expressions:
import sympy as sym
h = sym.Symbol("h")
m = sym.Symbol("m")
a = sym.Symbol("a")
b = sym.Symbol("b")
x = sym.ln(m)
y = sym.ln(h)
A general line of best fit for \(x\) and \(y\) can be expressed in terms of \(m\) and \(h\):
line = sym.Eq(lhs=y, rhs=a * x + b)
line
Taking the exponential of both sides gives the required relationship:
sym.exp(line.lhs)
sym.expand(sym.exp(line.rhs))
Which can be rewritten as:
Substituting our values for the slope
and intercept
in to these expressions
gives the required relationship:
sym.exp(line.rhs).subs({a: slope, b: intercept})
Below is a plot that shows this relationship:
Important
In this tutorial we have
Calulated values of central tendency and spread
Calculated some bivariate coefficients
Fitted a line of best fit