Quantitative Methods

1
1
STAM4000
Quantitative Methods
Week 11
Chi-square tests
https://unsplash.com/@lastly?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText
2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2

3
ts
#1
#2
#3
Introduction to Chi-square tests
Chi-square test of independence
Standardized Chi-square residuals
Week 11
Chi-square tes
Learning
Outcomes

4
Why does this matter?
If we have
categorical
variables, and our
data are counts,
we can still
examine whether
variables are
independent.
https://www.reddit.com/r/mathmemes/comments/b2dub1/poor_souls/
5
#1 Introduction to Chi-square tests
https://unsplash.com/@senseiminimal?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText
6
#1 What are we testing here?
Chi-square tests are about one or more categorical variables.
We will follow the familiar process of hypothesis testing:
Check conditions, but now we will have conditions for Chi-square.
Follow the steps of hypothesis testing:
o Write hypotheses
o Find the Chi-square calculated test statistic
o Find the Chi-square critical value
o Sketch a Chi-square curve
o Decision
o Conclusion
Chi-square
is read as
“ki square

7
#1 Three different type of Chi-square tests
Compares the observed distribution of one categorical
variable, to an expected distribution of that categorical
variable.
Goodness-of-fit test
•Compares the distribution of several groups for the same
Test of homogeneity categorical variable
Examines the difference between observed and expected
counts of two categorical variables, to determine if there
is an association between the two variables.
Test of independence
We will cover the test of independence and standardized residuals
Chi-square
is read as
“ki square

8
#1
We assume:
The outcome of each of the identical trials would fall into one of two categories.
The probability of these outcomes is constant throughout the experiment.
If p is the probability of success, the Expected frequency of an event X with
success rate
p is E[X] = np
o The expected frequencies are calculated, assuming the null hypothesis, Ho,
is TRUE.
Chi-square tests: Theory
9
#1
Our test compares the observed frequencies, from the sample, with the
expected frequencies, from the hypothesised model in Ho.
We ask:
“Is the difference between what we expected and what we observed,
due to sampling variability or is the differences large enough to be
due to a change from the hypothesis model in Ho?”
We square the difference between the observed and expected frequencies, to
make them positive AND then we divide this by the expected frequency, to get
an idea of the relative size of the difference.
Theory continued …
10
#1 Chi-square calculated test statistic
𝜒𝑐𝑎𝑙𝑐 2 = σ 𝑓0-𝑓𝑒 2
𝑓𝑒
𝜒𝑐𝑎𝑙𝑐 2 = σ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 -𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦)2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑟𝑞𝑢𝑒𝑛𝑐𝑦
𝜒𝑐𝑎𝑙𝑐 2 = σ (𝑂𝑏𝑠 -𝐸𝑥𝑝)2
𝐸𝑥𝑝
𝜒𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 2 = σ 𝐵𝑎𝑏𝑦 𝐶ℎ𝑖𝑠
𝜒𝑐𝑎𝑙𝑐 2 is read as
“ki squared calculated”
𝑓𝑜 = observed frequency from the sample
𝑓𝑒 = expected frequency from the sample
𝐵𝑎𝑏𝑦 𝐶ℎ𝑖 =
(𝑂𝑏𝑠 – 𝐸𝑥𝑝)
2
𝐸𝑥𝑝
11
#1 Chi-square curve
c2 is the Greek letter we use to represent a
family of sampling distribution models and
the statistic measuring the relative
difference between observed and expected
frequencies.
c2 curves are positively skewed (skewed to
the right)
c2 tests are always right-tailed tests
The shape of the c2 curve depends on the
degrees of freedom.
Here, the degrees of freedom formula will
include the number of row categories and
the number of column categories.

12
#1 Chi-square curve continued …
c2 is read as
“ki square

𝜒c2rit
Read as “ki square critical”
Use c2 tables to find c2crit based
on:
o α, the level of significance
o degrees of freedom for c2
df = (r – 1) x (c – 1)
Note:
r = number of row categories
c = number of column categories.
These do NOT include the totals
c2crit c2
Reject Ho
Do not
reject
Ho

13
#1 Sample of 𝛘2 table used to find 𝛘𝐜 𝟐𝐫𝐢𝐭
14
#1 Conditions to check for Chi-square tests
Randomization
Condition:
The individuals
or items in the
sample should
be randomly
selected from
the population.
Counted Data
Condition:
The data must
be counts (or
frequencies) for
the categories of
the categorical
variables.
Expected Cell
Frequency
Condition:
Expected
frequencies for
each cell must
be ≥ 5
https://unsplash.com/@zmachacek?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText
15
#2 Chi-square test of independence
https://www.pinterest.com.au/pin/390476230194036577/
16
#2 Chi-square test of independence, hypotheses
The null hypothesis is written as:
Ho: The two named categorical variables are independent
The alternative hypothesis is written as,
Ha: The two named categorical variables are NOT independent
Ho is stating
that there is NO
association
between the
categorical
variables
Ha is stating that
there is an
association
between the
categorical
variables, such that
they are NOT
independent

17
#2 Hypotheses for independence continued …
In Ho,
use the
names of
the
variables
In Ha,
use the
names of
the
variables
If we RETAIN Ho, we
conclude that the
difference in
proportions is not
statistically significant
at α and can
reasonably be
attributed to chance.
If we REJECT Ho, a
statistically significant chisquare statistic does not in
itself tell us about the
nature of the dependence.
We need to investigate
the individual
contributions together
with their signs.

18
#2 Example of building a Chi-square
test of independence
A marketing manager for a school
stationery supplier was interested in
whether
there is evidence of a
dependence between gender and
preferred writing hand.
A random sample of 300 individuals was
taken, where participants where asked
their writing hand preference and their
gender.
The observed values are summarized in
Table i).

Gender Hand Preference Total
Left Right
Female 12 108 120
Male 24 156 180
Total 36 264 300

Table i)
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
19
#2 Example continued
Is gender and writing hand preference independent?
Check the conditions, then test at α of 5%.
Ho: gender and writing hand preference are independent
Ha: gender and writing hand preference are NOT independent
Step 1: Write the hypotheses
Check conditions:
Told random sample; data in table are counts; to check
expected cell frequencies are ≥ 5, we must first learn
how to find these. We will check this condition later.
https://unspl
ash.com/@z
machacek?ut
m_source=un
splash&utm_
medium=refe
rral&utm_con
tent=creditCo
pyText
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello

20
#2 Example continued
In hypothesis testing, we assume Ho is true and then gather evidence to try to
show Ho is false.
In a Chi-square test of independence, the
null hypothesis, Ho, is claiming that
the two variables are independent.
If two events, A and B are independent, then
P(A and B) = P(A) × P(B)
Step 2: Find the calculated test statistic
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
21
#2 Example continued
In hypothesis testing, we must ASSUME Ho is TRUE. Here, by assuming Ho is true, we are
assuming that the categorical variables are independent.
We can use the rule of independent events, to calculate the expected number of individuals,
using
Table i).
For example, if gender and hand preference were independent, then the joint probability of
female and left-handed would be as follows:
P(female and left-handed) = P(female) × P(left-handed)
=
120
300
×
36
300
= 0.048
Now, in a sample of 300 individuals, the corresponding Expected number of individuals who
are female and left-handed may be calculated as follows:
E[female and left
On the next slide, we describe an easier method to find the expected cell frequencies. -handed] = np = 300 × 0.048 = 14.4 individuals
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
22
#2 Expected cell frequency formula
Expected cell frequencies are the frequencies we would expect if the null hypothesis
of ,
independence, was TRUE. When we have two categorical variables, and our data
are counts, in a contingency table, then for each cell in the body of the table, we can
use the following formula to calculate the expected cell frequencies:
𝑬𝒙𝒑𝒆𝒄𝒕𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 =
𝒓𝒐𝒘 𝒕𝒐𝒕𝒂𝒍 × 𝒄𝒐𝒍𝒖𝒎𝒏 𝒕𝒐𝒕𝒂𝒍
𝒐𝒗𝒆𝒓𝒂𝒍𝒍 𝒕𝒐𝒕𝒂𝒍
Example: find the Expected cell frequency for female and left-handed, with this formula:
Expected cell frequency =
𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 𝑓𝑜𝑟 𝐹𝑒𝑚𝑎𝑙𝑒 × 𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 𝑓𝑜𝑟 𝐿𝑒𝑓𝑡-ℎ𝑎𝑛𝑑𝑒𝑑
𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑡𝑜𝑡𝑎𝑙 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
=
120 ×36
300
= 14.4 individuals, if gender and hand-preference were independent
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
23
#2 Example continued

Gender Hand Preference Total
Left Right
Observed Expected Observed Expected
Female 12 14.4 108 105.6 120
Male 24 21.6 156 158.4 180
Total 36 36 264 264 300

Table ii) Summary of Observed and Expected cell frequencies
Condition to check: all the expected cell frequencies are ≥ 5.
As all the conditions are satisfied, we can use the Chi-square distribution.
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
24
#2 Example continued
How close are the Observed frequencies and the Expected frequencies?
We use the Chi-square calculated test statistic formula to measure this.
𝜒𝑐𝑎𝑙𝑐 2 = σ (𝑂𝑏𝑠 -𝐸𝑥𝑝)2
𝐸𝑥𝑝
𝜒𝑐𝑎𝑙𝑐 2 = σ 𝐵𝑎𝑏𝑦 𝐶ℎ𝑖

𝜒𝑐𝑎𝑙𝑐 2 = 14.4 +
21.6

(12-14.4)2
(24-21.6)2
+
(108-105.6)2
105.6
+
(156-158.4)2
158.4
𝝌𝒄𝒂𝒍𝒄 𝟐 = 0.758
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
25
#2 Example continued
Step 3: Find the critical value
c2crit = 3.841
Use c2 tables to find c2crit based on:
o α. Earlier told α = 5% = 0.05
o degrees of freedom for c2
df = (2 – 1) x (2 – 1)
df = 1
Note:
r = number of row categories
c = number of column categories.
These do NOT include the totals
26
#2 Example continued
Step 4: Sketch a curve
c2crit c2
= 3.841

Reject Ho
Do not
reject
Ho

𝝌𝒄𝒂𝒍𝒄 𝟐
= 0.758
Step 5: Decision
Here, we cannot reject Ho as
𝛘2calc of 0.758 is less than
𝛘2criti of 3.841.
We must retain Ho at α of %
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
27
#2 Example continued
Step 6: Conclusion
There is no significant evidence that gender and hand-preference are not
independent.
Note: a slightly higher proportion of males in the sample preferred to write
with their left hand.
However, the difference in proportions is nowhere near statistically significant
at the 0.05 level and can reasonably be attributed to chance.
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
28

2 8

#2 Exercise
The Titanic was a
ship which
tragically sank in
1912, after
colliding with an
ice-berg.
This table cross
classifies survival
by the person’s
status on the
ship.
(Source: Sharpe, De Veaux & Velleman “Business Statistics” 2nd Edition, 2014 Pearson Education International, England, Chapter 14 and page 451)
Were survival and status on board the Titanic independent?

STATUS
Crew First
class
Second
class
Third
class
Total
SURVIVAL Alive 212 202 118 178 710
Dead 673 123 167 528 1491
Total 885 325 285 706 2201

In 1912, The maiden journey of the famous ship the Titanic, hit an iceberg and sunk.
29
#2 Exercise continued
a) Write the hypotheses for this test.
b) If a person’s chance of survival was the same, irrespective of their status on the
ship, how many crew members would you expect to have survived?
That is, calculate and comment on the expected cell frequency for “alive” and “
crew”, to complete the following table of expected cell frequencies.
SURVIVAL Observed Expected Observed Expected Observed Expected Observed Expected Total
Alive
212 202 104.8387 118 91.93548 178 227.7419 710
Dead
673 599.5161 123 220.1613 167 193.0645 528 478.2581 1491
Total 885 325
325 285 285 706 706 2201
First class Second class Third class
STATUS
Crew
In 1912, The maiden journey of the famous ship the Titanic, hit an iceberg and sunk.
30
30
#2 Exercise continued
c) Find the baby chi for “alive and “crew”.
d) Find the critical value, if testing at α = 1%.
e) You are now told that the calculated test statistic is 187.8. Give your decision,
briefly explaining your reasoning.
f) Give your conclusion.
In 1912, The maiden journey of the famous ship the Titanic, hit an iceberg and sunk.
34
#3 Standardized Chi-square residuals
https://www.pinterest.com.au/sonrisamas/psychology/
35
#3 Chi-square standardized residuals
𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝𝐢𝐳𝐞𝐝 𝐫𝐞𝐬𝐢𝐝𝐮𝐚𝐥 =
(𝐎𝐛𝐬 – 𝐄𝐱𝐩)
𝐄𝐱𝐩
Chi-square tests are always right-tailed tests. When we reject a null hypothesis of
independence, we are stating that there is significant evidence that the fit to the
model of independence between the two categorical variables is not good.
We do not know the direction of the relationship from this conclusion.
When we reject independence, we can examine the Chi-square standardized
residuals, to gather more information:

36
#3 Chi-square standardized residuals
𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝𝐢𝐳𝐞𝐝 𝐫𝐞𝐬𝐢𝐝𝐮𝐚𝐥 = 𝐙 𝐬𝐜𝐨𝐫𝐞 =
(𝐎𝐛𝐬 – 𝐄𝐱𝐩)
𝐄𝐱𝐩
The standardized residual is a Z score
The mean of the standardized residual is 0 and the standard deviation is 1.
If the Null hypothesis was true, we can use the Empirical rule: 68-95-99.7% rule, to
interpret the standardized residuals.
For example:
o A standardized residual greater than 3, tells us that what we observed was
unusually larger than what we expected.
o A standardized residual less than -3, tells us that what we observed was unusually
less than what we expected.

37
Example
A marketing manager for a school stationery supplier was interested in whether gender and
writing hand preference were independent. A random sample of 300 individuals was
taken, where participants where asked their writing hand preference and their gender.
Here is the table of observed and expected frequencies.
a) Calculate the standardized residuals for each cell.
b) Briefly
comment on these.
#3
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
38
#3 Example continued
a) Calculate the standardized residuals for each cell.
b) Briefly comment on these.
None of these standardized residuals are noteworthy, as they are all quite small, which is what
occurs when the null hypothesis is not rejected, as in this example.
The largest negative value of -0.632 suggests that we observed less females who are left-handed
than we expected, but as a Z score, this value is not that important.
Gender
Female
Male
STANDARDIZED RESIDUALS
-0.632
0.516
0.234
-0.191
Hand Preference
Left Right
https://www.buzzfeed.com/lanesainty/paddles-the-cat-hello
39
Exercise
For the Titanic exercise, here is the complete table of observed and expected cell frequencies:
#3
SURVIVAL Observed Expected Observed Expected Observed Expected Observed Expected Total
Alive
212 285.4839 202 104.8387 118 91.93548 178 227.7419 710
Dead
673 599.5161 123 220.1613 167 193.0645 528 478.2581 1491
Total 885
885 325 325 285 285 706 706 2201
First class Second class Third class
STATUS
Crew
a) Complete the following table of standardized residuals.
b) Compare the standardized residual for crew members and first-class passengers, who survived.
SURVIVAL
Alive
Dead
2.718 -3.296
3.001 -6.548 -1.876 2.275
STANDARDIZED RESIDUALS
STATUS
Crew First class Second class Third class
In 1912, The maiden journey of the famous ship the Titanic, hit an iceberg and sunk.
42
Supplementary Exercises
Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials “at the end of each week.
Time permitting, the lecturer may ask students to work through some of these exercises
in class.
Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.

43
Extension
The following slides are an extension to this week’s topic.
The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
44
44
Exercise
An entrepreneur is considering opening a takeaway store selling tea and would
like you to research whether there is an association between age and tea
preference. You randomly sampled 260 individuals asking their age group and
their tea preference. The observed frequencies are summarized in the table
below. Test for independence at α = 5%. Check the conditions.

Tea preference
Age Black Herbal Cold Total
Under 30 years 20 30 100 150
30 years and over 70 25 15 110
Total 90 55 115 260

45
45
Exercise solution
Check conditions:
Told random sample
Data are counts
The expected cell frequencies, in the table below, are each greater than 5.
As the conditions are satisfied, we can use the Chi-square distribution.
Ho: Age and tea preference are independent
Ha: Age and tea preference are NOT independent

Tea preference
Black Herbal Cold
Total
Age Observed Expected Observed Expected Observed Expected
Under 30 years 20 51.9231 30 31.7308 100 66.3462 150
30 years and over 70 38.0769 25 23.2692 15 48.6538 110
Total 90 90 55 55 115 115 260

46
46
Exercise solution
Degrees of freedom is df = (2-1) x (3-1) = 2.
𝜒𝐶𝑟𝑖𝑡 2 = 5.991 at an α = 0.05 level.
As ,
𝜒𝐶𝑎𝑙𝑐 2 > 𝜒𝐶𝑟𝑖𝑡 2 , that is, 86.9629 > 5.991, we reject Ho, accept Ha at α = 0.05 level. .
We can advise the entrepreneur, there is significant evidence that age and tea preference are not
independent, and are related.
𝜒𝑐𝑎𝑙𝑐 2 = σ (𝑂𝑏𝑠 -𝐸𝑥𝑝)2
𝐸𝑥𝑝
𝜒𝑐𝑎𝑙𝑐 2 = (20 -51.92)2
51.92
+
(70 -38.08)2
38.08
+
(30 -31.73)2
31.73
+
(25 -23.30)2
23.30
+
(100 -66.35)2
66.35
+
(15 -48.66)2
48.66
𝜒𝑐𝑎𝑙𝑐 2 = 86.93