Solutions#

Question 1#

1. For each of the following sets of data:

 Calculate:
 - The mean,
 - The median,
 - The max,
 - The min,
 - The population standard deviation,
 - The sample standard deviation,
 - The population variance,
 - The sample variance,
 - The quartiles (the set of $n=4$ quantiles),
 - The deciles (the set of $n=10$ quantiles),

1. `data_set_1 = (…)

import statistics as st

data_set_1 = (
    74,
    -7,
    58,
    82,
    60,
    3,
    49,
    85,
    24,
    99,
    73,
    76,
    11,
    -4,
    61,
    87,
    93,
    13,
    1,
    28,
)
  • The mean,

st.mean(data_set_1)
48.3
  • The median,

st.median(data_set_1)
59.0
  • The max,

max(data_set_1)
99
  • The min,

min(data_set_1)
-7
  • The population standard deviation,

st.pstdev(data_set_1)
35.1441318003447
  • The sample standard deviation,

st.stdev(data_set_1)
36.05711842998112
  • The population variance,

st.pvariance(data_set_1)
1235.11
  • The sample variance,

st.variance(data_set_1)
1300.1157894736841
  • The quartiles (the set of \(n=4\) quantiles),

st.quantiles(data_set_1, n=4)
[11.5, 59.0, 80.5]
  • The deciles (the set of \(n=10\) quantiles),

st.quantiles(data_set_1, n=10)
[-3.5, 4.6, 16.3, 36.4, 59.0, 68.2, 75.4, 84.4, 92.4]

2. `data_set_2 = (…)

import statistics as st

data_set_2 = (
    65,
    59,
    81,
    81,
    76,
    93,
    91,
    88,
    55,
    97,
    86,
    94,
    79,
    54,
    63,
    56,
    58,
    77,
    85,
    88,
)
  • The mean,

st.mean(data_set_2)
76.3
  • The median,

st.median(data_set_2)
80.0
  • The max,

max(data_set_2)
97
  • The min,

min(data_set_2)
54
  • The population standard deviation,

st.pstdev(data_set_2)
14.202464574854606
  • The sample standard deviation,

st.stdev(data_set_2)
14.571421200057106
  • The population variance,

st.pvariance(data_set_2)
201.71
  • The sample variance,

st.variance(data_set_2)
212.32631578947368
  • The quartiles (the set of \(n=4\) quantiles),

st.quantiles(data_set_2, n=4)
[60.0, 80.0, 88.0]
  • The deciles (the set of \(n=10\) quantiles),

st.quantiles(data_set_2, n=10)
[55.1, 58.2, 63.6, 76.4, 80.0, 83.4, 87.4, 90.4, 93.9]

3. `data_set_3 = (…)

import statistics as st

data_set_3 = (
    0.31,
    -0.13,
    0.19,
    0.46,
    -0.27,
    -0.06,
    0.20,
    0.42,
    -0.07,
    0.11,
    -0.11,
    -0.43,
    -0.36,
    0.45,
    -0.42,
    0.11,
    0.08,
    0.31,
    0.48,
    0.17,
)
  • The mean,

st.mean(data_set_3)
0.07200000000000001
  • The median,

st.median(data_set_3)
0.11
  • The max,

max(data_set_3)
0.48
  • The min,

min(data_set_3)
-0.43
  • The population standard deviation,

st.pstdev(data_set_3)
0.28690765064738166
  • The sample standard deviation,

st.stdev(data_set_3)
0.2943610386118237
  • The population variance,

st.pvariance(data_set_3)
0.082316
  • The sample variance,

st.variance(data_set_3)
0.08664842105263158
  • The quartiles (the set of \(n=4\) quantiles),

st.quantiles(data_set_3, n=4)
[-0.125, 0.11, 0.31]
  • The deciles (the set of \(n=10\) quantiles),

st.quantiles(data_set_3, n=10)
[-0.414,
 -0.242,
 -0.098,
 -0.003999999999999998,
 0.11000000000000001,
 0.18200000000000002,
 0.277,
 0.398,
 0.4590000000000001]

4. `data_set_4 = (…)

import statistics as st

data_set_4 = (
    2,
    4,
    2,
    2,
    2,
    2,
    2,
    3,
    2,
    2,
    2,
    4,
    2,
    4,
    2,
    2,
    3,
    4,
    3,
    4,
)
  • The mean,

st.mean(data_set_4)
2.65
  • The median,

st.median(data_set_4)
2.0
  • The max,

max(data_set_4)
4
  • The min,

min(data_set_4)
2
  • The population standard deviation,

st.pstdev(data_set_4)
0.852936105461599
  • The sample standard deviation,

st.stdev(data_set_4)
0.8750939799154206
  • The population variance,

st.pvariance(data_set_4)
0.7275
  • The sample variance,

st.variance(data_set_4)
0.7657894736842106
  • The quartiles (the set of \(n=4\) quantiles),

st.quantiles(data_set_4, n=4)
[2.0, 2.0, 3.75]
  • The deciles (the set of \(n=10\) quantiles),

st.quantiles(data_set_4, n=10)
[2.0, 2.0, 2.0, 2.0, 2.0, 2.6, 3.0, 4.0, 4.0]

Question 2#

2. Calculate the sample covariance and the correlation coefficient for the following pairs of data sets from question 1:

1. data_set_1 and data_set_4

st.covariance(data_set_1, data_set_4)
-12.468421052631578
st.correlation(data_set_1, data_set_4)
-0.39515342199380205

2. data_set_3 and data_set_4

st.covariance(data_set_3, data_set_4)
0.04126315789473684
st.correlation(data_set_3, data_set_4)
0.1601870630717755

3. data_set_2 and data_set_3

st.covariance(data_set_2, data_set_3)
0.057263157894736905
st.correlation(data_set_2, data_set_3)
0.013350362425512118

4. data_set_1 and data_set_2

st.covariance(data_set_1, data_set_2)
77.16842105263159
st.correlation(data_set_1, data_set_2)
0.1468745962708178

Question 3#

3. For each of the data sets from question 1 obtain the covariance and correlation coefficient for the data set with itself.

1. `data_set_1 = (…)

st.covariance(data_set_1, data_set_1)
1300.1157894736843
st.correlation(data_set_1, data_set_1)
1.0

2. `data_set_2 = (…)

st.covariance(data_set_2, data_set_2)
212.32631578947368
st.correlation(data_set_2, data_set_2)
1.0

3. `data_set_3 = (…)

st.covariance(data_set_3, data_set_3)
0.08664842105263158
st.correlation(data_set_3, data_set_3)
1.0

4. `data_set_4 = (…)

st.covariance(data_set_4, data_set_4)
0.7657894736842106
st.correlation(data_set_4, data_set_4)
1.0

Question 4#

4. Obtain a line of best fit for the pairs of data sets from question 2.

1. data_set_1 and data_set_4

st.linear_regression(data_set_1, data_set_4)
LinearRegression(slope=-0.009590238926087555, intercept=3.113208540130029)

2. data_set_3 and data_set_4

st.linear_regression(data_set_3, data_set_4)
LinearRegression(slope=0.47621361582195443, intercept=2.6157126196608194)

3. data_set_2 and data_set_3

st.linear_regression(data_set_2, data_set_3)
LinearRegression(slope=0.00026969411531406506, intercept=0.05142233900153683)

4. data_set_1 and data_set_2

st.linear_regression(data_set_1, data_set_2)
LinearRegression(slope=0.05935503720316409, intercept=73.43315170308718)

Question 5#

5. Given a collection of 250 individuals whose height is normally distributed with mean 165 and standard deviation 5. What is the expected number of individuals with height between 150 and 160?

We start by creating the distribution:

distribution = st.NormalDist(165, 5)
distribution
NormalDist(mu=165.0, sigma=5.0)

Now let us find the probability of the random variable being between 150 and 160:

probability = distribution.cdf(160) - distribution.cdf(150)
probability
0.15730535589982697

The expected number of individuals is thus given by:

probability * 250
39.32633897495674

Question 6#

6. Consider a class test where the score are normally distributed with mean 65 and standard deviation 5.

1. What is the probability of failing the class test (a score less than 40)?

We start by creating the distribution:

distribution = st.NormalDist(65, 5)
distribution
NormalDist(mu=65.0, sigma=5.0)

The probability is given by:

distribution.cdf(40)
2.8665157186802404e-07

2. What proportion of the class gets a first class mark (a score above 70)?

The probability is given by:

1 - distribution.cdf(70)
0.15865525393145707

3. What is the mark that only 5% of the class would expect to get more than?

For this, we use the inverse cdf but we need to find the inverse cdf of \(.5\): a mark for which 5% of the class gets more than is equivalent to a mark for which 95% of the class get less than.

distribution.inv_cdf(.95)
73.22426813475735