MTH220 – Statistics Methods and Interference

Question 1

To settle a debate over whether Coca Cola or Pepsi Cola is more popular, a random sample of 79 participants was selected and asked to state their preference for one of the drinks. 43 of them preferred Coca Cola. Let 𝑝 denote the proportion of the population who prefer Coca Cola.

(a) Compute the probability that a participant selected at random from the sample will prefer Coca Cola.

(b) What is the probability that two participants selected (without replacement) at random from the sample will both prefer Coca Cola?

(c) Construct a 95% confidence interval for 𝑝.

(d) Construct a one-sided 95% confidence interval for 𝑝 of the form (π‘Ÿ, 1).

(e) What conclusion about 𝑝 can be drawn from these confidence intervals? In particular, can we claim that Coca Cola is the more popular drink? That is, is 𝑝 > 0.50?

Suppose the sample size is increased to 7,900 participants (100 times), and the sample proportion stays the same at 𝑝̂= 4,300 7,900.

(f) What impact does the bigger sample have, if any, on the standard deviation of 𝑝̂?

(g) Will the enlarged sample size affect your answer in Question 1(e)? If so, in what manner?

Question 2

The two coins 𝐴 and 𝐡 look identical. Coin 𝐴 is a fair coin. For coin 𝐡, the probability of obtaining a head is 40% and the probability of obtaining a tail is 60% in a toss. Your friend Amina picked one of the coins at random and wanted to find out whether it was coin 𝐴 or coin 𝐡.

She tossed the selected coin that 39 times and obtain 18 heads. She asked you to help her analyse the data. You decide to use hypothesis testing using 5% as the level of significance.

Your null hypothesis is that coin 𝐴 was picked. That is, the probability 𝑝 of getting a head is 50%. Null hypothesis, 𝐻0: coin 𝐴 was picked, 𝑝 = 0.50.

(a) Write down the alternative hypothesis.

(b) Let 𝑋 denote the number of heads obtained in 39 tosses. Under the null hypothesis what is the distribution of 𝑋?

(c) Under the null hypothesis, what is the expected value and variance of the sample proportion 𝑝̂= 𝑋 39? That is, compute 𝐸(𝑝̂) and var(𝑝̂).

(d) What test statistic would you use for the test?

(e) Is the null hypothesis rejected at the 5% level of significance?

(f) Which is typically considered to be the more serious error between Type I and Type II errors?

Question 3

A professor of marketing wants to investigate whether a customer’s rating of a restaurant’s food and the amount the customer spent at the restaurant are related. There are four possible ratings for a restaurant’s food, namely excellent, good, average and poor. A customer’s bill is classified into one of three categories, namely expensive, average and cheap. The information
obtained from 450 customers is displayed in the Table Q3 below.

As you have taken the MTH220 course, the professor engages you to assist in the analysis of the data.

(a) The professor asks you whether an ANOVA test is appropriate for this project. Provide your response and justify your answer.

You decide to apply the chi-square test for independence in the study.
(b) Formulate the null and alternative hypotheses.

(c) State the level of significance you would use for the study.

(d) Use R to perform a chi-square test. Write down the commands you use. Include the output from R.

(e) What is the critical value of the test?

(f) What is the 𝑝-value of the test?

(g) What can we conclude about a customer’s rating of a restaurant’s food and the amount spent?

(h) Which cell in the table has the biggest discrepancy between the observed frequency and the expected frequency? Recall that the expected frequency is computed under the assumption that a customer’s rating and the amount spent are independent.

Question 4

A new medication that was designed to lower the level of a certain enzyme was tested on 11 subjects. The Table Q4 shows the level of the enzyme before and after the application of the medication.

Assume that the difference in the level of enzyme before and after medication in a subject may be modelled as a normal random variable. Apply a suitable test of hypothesis at the 5% level of significance to determine whether there is sufficient evidence to conclude that the medication is effective in lowering the level of the enzyme.

(a) State the null and alternative hypotheses?

(b) Calculate the sample proportion, sample standard deviation and the value of the test statistic?

(c) What is the critical value of the test?

(d) What is the 𝑝–value of the test?

(e) What is the conclusion of the test?

(f) A graduate student wanted to analyse the data. In his haste, he recorded the numerical values without linking the values to the subject. So he ended up with two samples of data as follows:

Before medication:
9.32, 10.59, 14.71, 18.15, 19.92, 20.96, 23.36, 23.85, 24.65, 26.23, 31.86.
After medication:
8.03, 9.56, 10.16, 17.41, 18.39, 18.79, 19.93, 20.85, 22.14, 25.98, 29.34.

He treated the data as two independent samples and assumed that the data came from normal distributions. Since both sample standard deviations were around 6.7, he assumed that the variances of the two populations were the same. He performed a test of hypothesis at the 5% level of significance.

Without doing tedious computation, comment on the likely outcome of the graduate student’s test? Do you think the null hypothesis will be rejected? Justify your answer.

The purpose of this part is not for you to perform another routine test using the recipes. The main point here is to develop your intuition and insight into what goes on in such tests. Whether you are correct about the outcome of the test is not that important. The reasoning and justification you use to arrive at your conclusion are more important.

Question 5

To view and use software packages, such as R and Excel, as a fast calculator or a convenient implementer of statistical procedures and algorithms is to miss out on the vast potential of these tools. We should use them for exploration and investigation. The study guide covers the approximation of the binomial and Poisson distributions by normal distributions under suitable conditions. Have you wondered how good these approximations are?

Using the notation in the study guide, page SU1-28, let 𝑋 ~ Poisson(30) and 𝑍 ~ 𝑁(30, 30) be random variables having the Poisson and normal distributions respectively. Note that the two variables have the same mean and variance. Recall the distribution functions 𝐹𝑋(𝑑) =𝑃(𝑋 ≀ 𝑑) and 𝐹𝑍 (𝑑) = 𝑃(𝑍 ≀ 𝑑). We wish to see how close the two distribution functions are.

You can generate the values of these distribution functions using your favourite software.

Instructions for Excel are provided in here, in case you need them. The Excel commands for the distribution functions are 𝐹𝑋(𝑑) = POISSON(𝑑, 30, 1) and 𝐹𝑍
(𝑑) = NORM.DIST(𝑑, 30, √30, 1). In Excel, you can drag a formula to generate the values for a list of inputs 𝑑. Generate the values of the two distribution functions for integral values 𝑑 =0, 1, 2, … , 65. The values of the distribution functions for 𝑑 = 18, 19, β‹― , 26 are shown in Figure Q5 below together with their differences.

We see that for 𝑑 = 25, the absolute difference |𝐹𝑃(25) βˆ’ 𝐹𝑍(25)| = 0.0277 = 2.77%.

(a) For what integer value 𝑑 is the absolute difference |𝐹𝑃 (𝑑) βˆ’ 𝐹𝑍(𝑑)| the largest?

(b) State the largest value of |𝐹𝑃

(𝑑) βˆ’ 𝐹𝑍(𝑑)| for an integer 𝑑.

(c) Do you consider the approximation using a normal distribution to be reasonably good? Provide some justification for your answer.

Recall that a continuity correction can be used. Does the continuity correction improve the approximation? Let 𝐹𝑍𝐢(𝑑) = 𝑃(𝑍 ≀ 𝑑 + 0.5) = 𝐹𝑍 (𝑑 + 0.5). Note that 𝐹𝑍𝐢(𝑑) incorporates the continuity correction. Use Excel or any software to compute the values of 𝐹𝑍𝐢(𝑑) for 𝑑 = 0, 1, 2, β‹― , 65.

(d) For what integer value 𝑑 is the absolute difference |𝐹𝑃 (𝑑) βˆ’ 𝐹𝑍𝐢(𝑑)| the largest?

(e) State the largest value of |𝐹𝑃(𝑑) βˆ’ 𝐹𝑍𝐢(𝑑)| for an integer value 𝑑.

(f) Does 𝐹𝑍𝐢(𝑑) give a better approximation than 𝐹𝑍(𝑑)?

For a given random variable π‘ˆ, the tail of its distribution may be of interest. The tail of the distribution concerns the values of 𝑃(π‘ˆ > 𝑒) = 1 βˆ’ πΉπ‘ˆ(𝑒) for large values of 𝑒. It is clear that the value of 𝑃(π‘ˆ > 𝑒) will decrease with the increasing value of 𝑒. The question is how quickly 𝑃(π‘ˆ > 𝑒) decreases with increasing 𝑒. For two random variables, π‘ˆ and 𝑉, we say that π‘ˆ has a fatter tail than 𝑉 if 𝑃(π‘ˆ > 𝑑) > 𝑃(𝑉 > 𝑑) for all sufficiently large values of 𝑑.

(g) By looking at the values of 𝐹𝑃(𝑑) and 𝐹𝑍(𝑑) for large 𝑑, say 𝑑 > 50, state which of the distribution, Poisson(30) and 𝑁(30,30), has a fatter tail.