Quantitative Methods

1
1
STAM4000
Quantitative Methods
Week 9
Correlation and simple linear
regression
https://www.google.com/search?q=causation+versus+correlation+comic+cartoon+funny&tbm=isch&ved=2ahUKEwjA1YH7vsfuAhXBn0sFHWGhDYwQ2-
cCegQIABAA&oq=causation+versus+correlation+comic+cartoon+funny&gs_lcp=CgNpbWcQAzoHCCMQ6gIQJzoECCMQJzoECAAQQzoICAAQsQMQgwE6BQgAELEDOgcIABCxAxBDOgoIABCxAxCDARBDOgIIADoECAAQGDoGCAAQChAYOgQIABAeUO2AAljA3AJg094C
aAJwAHgAgAGrAogB9kGSAQYwLjQ3LjKYAQCgAQGqAQtnd3Mtd2l6LWltZ7ABCsABAQ&sclient=img&ei=mVMXYIClHMG_rtoP4cK24Ag&bih=470&biw=1013&rlz=1C1CHBF_enAU841AU846&hl=en#imgrc=I3tF8lFRL1EZiM

2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2

3
d
#1
#2
#3
Examine the relationship between two
quantitative variables
Differentiate between correlation and
causation
Model with simple linear regression
#4 Create and assess reliability of forecasts
Week 9
Correlation an
simple linear
regression
Learning
Outcomes

4
Why does this matter?
If there is an
association
between two
quantitative
variables, we can
model the
relationship to
predict future
values.
https://www.google.com/search?q=regression+cartoon&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk02pME45VK0MPH4OfnjIO_5qRS3Jag:1611892844986&source=lnms&tbm=isch&sa=X&ved=2ahUKEwibuMPjoMDuAhV9zTgGHRK
EAxwQ_AUoAXoECBIQAw&biw=1024&bih=444#imgrc=kC7IUqSGWMulMM

5
#1 Examine the relationship
between two quantitative variables
https://www.google.com/search?q=regression+cartoon&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk02pME45VK0MPH4OfnjIO_5qRS3Jag:1611892844986&source=lnms&tbm=isch&sa=X&ved=2ahUKEwibuMPjoMDuAhV9zTgGHRKEAxwQ_AUoAXoECBIQAw&b
iw=1024&bih=444#imgrc=HDY2nP8j3G4I5M

6
#1 Make a picture
Scatterplots are a tool for representing the relationship
between two quantitative variables.
X is the “independent” or “explanatory” or “predictor” variable
Y is the “dependent” or “response” variable
Before creating a scatterplot, it is best to decide which
variable is responding to the other.
Note: in business, we usually have positive numbers and deal
with the top right quadrant of the X, Y axes.
-10

-5
-5
5
X
5
10

0
15
-10 0 10
Scatterplot of Y against X
5 0
10
15
0 2 4 6
Y (units)
X (units)
Scatterplot of of Y against
X
Y
7
#1
Here are two examples of scatterplots representing the
relationship between two quantitative variables.
Linear model Non-linear model
0
50
100
0 5 10
Exam score (%)
Cups of coffee before a test
Scatterplot of exam score against
cups of coffee before a test
0
10
20
30
0 50 100
Exam score (%)
Hours of study
Scatterplot of exam score against
hours of study
This Photo by Unknown Author is licensed under CC BY-NC-ND
Illustration
8
#1
This Photo by Unknown Author is
licensed under CC BY
Example
A random sample of thirty rental properties was collected and values for the following
variables were recorded: weekly
rent, ($/wk), distance from the city centre (km), number of
bedrooms, number of bathrooms and age of the property (year).
EXCEL was used to create the following scatterplots.
How could we describe the relationship between rent and distance?
How could we describe the relationship between weekly rent and number of bedrooms?
0
200
400
600
800
1000
1200
1400
0 10 20 30 40
Rent ($/wk)
Distance (km)
Scatterplot of rent against distance
0
200
400
600
800
1000
1200
1400
0 1 2 3 4 5
Rent ($/wk)
Bedrooms
Scatterplot of rent against bedrooms
9
#1 Example
10
#2 Differentiate between correlation and causation
This Photo by Unknown Author is licensed under CC BY-SA
11
Correlation ≠ causation
12
#2 The correlation coefficient
𝑟 =
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
13
STRONG: tightly
clustered points
WEAK: loosely
clustered points
Negative
relationships
Y
X
Y
X
Y Y
X X
Positive
relationships
#2 Direction: + or -? Strength: strong or weak?
14
#2 Summary table of direction and size of r

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
negative positive

 

very
strong
strong moderate weak
very
strong
weak moderate strong

15
15
#2 How do we use EXCEL to find correlation coefficients?
16
This Photo by Unknown Author is licensed under CC BY
16
#2 Exercise
A random sample of thirty rental properties was collected and values for
the following variables were recorded: weekly
rent, ($/wk), distance from
the city centre (km), number of
bedrooms, number of bathrooms and age
of the property (year). Using the EXCEL file named “STAM Week 9 Excel
file.xls” and the sheet named “Rent”, create a correlation table of these
variables.
Use the correlation table to answer the following:
a) Describe the relationship between rent and distance?
b) Describe the relationship between weekly rent and number of bedrooms?

18
#3 Model with simple linear regression
https://www.google.com/search?q=regression%20analysis%20cartoon&tbm=isch&hl=en&rlz=1C1CHBF_enAU841AU846&sa=X&ved=0CMoBEKzcAigAahcKEwiIuZO7tMPuAhUAAAAAHQAAAAAQAg&biw=1013&bih=433#imgrc=mWMAV2iiX–veM
19
Deterministic Model: an equation or set of
equations that allow us to
fully determine
the value of the dependent variable from the
values of the independent variable(s).
Probabilistic Model: a method used to
capture the
randomness that is part of a reallife process.
E.g.: Do all houses of the same size, sell for
the same price?
Copyright © 2013 Pearson Australia (a division of Pearson
Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e
#3 Two basic types of regression models
https://www.google.com/search?q=house+for+sale+comic&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk022NxOgtvF_3V91LvfSbS_JsBzlPw
:1612158142201&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjYy4yL_cfuAhVwIbcAHR8QAwAQ_AUoAXoECAMQAw&biw=1024&bih=44
4#imgrc=N9jlWL7JyumnoM

20
A model of the relationship between house size (independent
variable) and house price (dependent variable) would be:
Y axis intercept is 200,000
Equation for this deterministic model: House price = 200 000 + 800(Size)
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
#3 Illustration of a model
X, Size (sqm)
Y
House
price
($)
In this model,
the price of
the house is
completely
determined by
the size.

21
x
200,000
Equation for this probabilistic model:
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) 9781442549272/Berenson/Business Statistics /2e – House price = 200,000 + 800 Size +
#3 Illustration of continued
X, Size (sqm)
Y
House
price
($)
In real-life, the
price of a house
will be
influenced by
many variables.
E.g.: location,
type of building
materials,
quality of
fittings etc.
Realistic scatterplot of
house price against size
Lower
variability
versus
higher
variability
The probabilistic model
includes the “error” term

22
We now represent the price of a house as a function of
its size in this
probabilistic model:
House price = 200,000 + 800 House size +

where (Greek letter read as “epsilon”):
is the random term in the population
is also called the error variable or the residual
It is the difference between the actual (or observed) selling price and the
estimated (or predicted) selling price of the same sized house.
varies from house sale to house sale, for the same value of x, the size of the
house, due to other factors influencing the price.
In the sample equation, the random term is represented by e
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
#3 Random term, 𝜺
23
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Random error for this X
i value
= RESIDUAL
=
𝑌
𝑖 – 𝑌෠𝑖
Y
X
Observed value of Y for X
i
𝑌෠ = Predicted value of Y for X
i
X
i
Slope
= β
1
=
𝑟𝑖𝑠𝑒
Intercept = β0 𝑟𝑢𝑛

εi

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
#3 Simple linear regression model (SLR)
24
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Yi = β0 + β1Xi + εi
Linear
component
Population
Y intercept
Population
slope
coefficient
Dependent
variable
Independent
variable
Random error
component
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in Y are associated with changes in X
#3 Simple linear regression model (SLR)
25
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
The simple linear regression equation (derived from a sample) provides an
estimate of the population regression line.
Estimate of the
regression
intercept Estimate of the regression slope
Estimated (or predicted)
Y value for observation i
Value of X for observation i
The individual random error terms e
i have a mean of zero
𝒀෡𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝒊
#3 Simple linear regression equation (prediction line)
26
The Process of estimating regression coefficients: Least
Squares Method
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
b
0 is the estimated average value of Y when the value of X is zero
b
1 is the estimated change in the average value of Y as a result of a one-unit
change in X
Most statistical software packages will find the values of b0 and b1 that minimise
the sum of the squared differences between actual values (Y) and predicted
values (
𝑌෠ ), i.e.
𝒎𝒊𝒏 ෍ 𝒀𝒊 – 𝒀෢𝒊 𝟐 = 𝒎𝒊𝒏 ෍ 𝒀𝒊 – (𝒃𝟎 + 𝒃𝟏 𝑿𝒊𝟐
#3
27
Y
X
#3 Least squares method
The question is:
Which straight line
fits best?
The least squares
line minimises the
sum of the squared
differences between
the points and
the line.

28
Formula for calculating the slope,
where,
and
Formula for calculating the Y intercept,

where and 𝑋ത = 𝑌ത = σ𝑖 𝑛=1 𝑌𝑖
𝑛

Least Squares Estimation formulae
1 1
1 1
( )( )
( )( )
n n
n n i i
i i
i i i i
i i
X Y
SSXY X X Y Y X Y
n
= =
= =
= – – = –
 
 
2
2 2 1
1 1
( )
( )
n
n n i
i
i i
i i
X
SSX X X X
n
=
= =
= – = –

 
σ𝑖 𝑛=1 𝑋𝑖
𝑛
#3
𝑏
𝑜 = 𝑌ത – 𝑏1 𝑋ത
𝑏
1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋
These manual
calculations are
tedious and will
NOT be assessed.
It is more efficient to
use a software
package, such as
EXCEL, to find the
values of b
0 and b1.
You WILL BE
ASSESSED on how
you use EXCEL
regression output

29
29
How do we use EXCEL to create a simple linear
regression?
#3
In the menu bar:
Click on “Data”,
then select
“Data Analysis”

30
#3 Example
The manager of a computer games store wishes to:
Examine the relationship between weekly sales ($000) and the number
of customers making purchases.
Use the results of that examination to predict future weekly sales.

Copyright © 2013 Pearson Australia (a division of Pearson Australia
Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e

 

Weekly sales
($000)
No.
customers
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

0
200
400
600
0 500 1000 1500 2000 2500 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus
number of customers
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
This Photo by Unknown Author is licensed under CC BY-SA
31

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.762
R Square 0.581
Adjusted R Square 0.528
Standard Error 41.330
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.93478 18934.93 11.08476 0.010
Residual 8 13665.56522 1708.196
Total 9 32600.5
Coefficients Standard
Error
t Stat P-value Lower 95% Upper
95%
Intercept 98.248 58.033 1.693 0.129 -35.577 232.074
No. customers 0.110 0.033 3.329 0.010 0.034 0.186

#3 Example Excel regression output:

= Correlation coefficien t, r. EXCE L “forgets” t he sign o
Check the si gn of the slope,b, f or the sign f r.
= sample size , n
Name of X variable
b
1 =
0

1b
=
This Photo by Unknown
Author is licensed under
CC BY-SA

32
#3 Example
a) How do we write the regression model/equation?
Estimated weekly sales = 98.248 + 0.110 Number of customers
b) How do we interpret the correlation coefficient, r? r = 0.762
There is a strong, positive, linear relationship between weekly customer sales and the number of customers
making purchases.
c) How do we interpret the intercept? b0 = 98.248
We estimate, that for zero customers, this game store makes 98.248 ($000s) or $98,248 per week in sales,
on average. This makes no sense. However, the intercept simply indicates that over the sample size
selected, the portion of weekly sales not explained by number of customers is estimated to be
$98,248.30,
on average.
Also note that X = 0 is outside the range of observed values, and this makes the intercept
estimate an extrapolation (see later).
d) How do we interpret the slope? b1 = 0.110
We estimate, for each one extra paying customer in this games store, weekly sales increase by an average
0.110 ($000) or by $110.
This Photo by Unknown
Author is licensed under
CC BY-SA

33
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Intercept
= 98.248
#3 Example
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Weekly sales ($000)
Number of customers
Scatterplot and predicted line for Weekly sales versus number of customers
Weekly sales ($000)
Predicted Weekly sales
($000)
Linear (Weekly sales
($000))
𝑊𝑒𝑒𝑘𝑙𝑦 ෣𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110 Number of customers
This Photo by Unknown Author is licensed
under CC BY-SA

34
#3 Exercise
A random sample of thirty rental properties was collected and values for the following
variables were recorded: weekly
rent, ($) and distance from the city centre (km).
a) Use Excel and the file named STAM4000 Week 9 Excel file.xls to create a simple linear
regression between these two variables. The Excel output is provided in the next slide.
b) Write the regression equation.
c) Recall, earlier we created a correlation table for this data set. For the variables of rent and
distance, is the correlation coefficient in the regression output the same as in the earlier
correlation table? Briefly explain.
d) Interpret the intercept?
e) Interpret the slope?
This Photo by Unknown Author is licensed under CC BY
35

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.863
R Square 0.745
Adjusted R Square 0.736
Standard Error 119.530
Observations 30
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 820.148 40.124 20.441 2.289E-18 737.958 902.338
Distance (km) -23.495 2.597 -9.048 8.348E-10 -28.814 -18.176

#3 Exercise solution

= Correlation coefficient , r. EXCEL forgets” the s ign of r.
Check the s gn of the s lope,b1, fo r the sign of r .
= sample size , n
Name of X vari able
b
0
=
b
1
=

a) This Photo by Unknown Author is licensed under CC BY
37
#4 Create and assess reliability of forecasts
https://www.google.com/search?q=statisyics+estimation+cartoon&tbm=isch&ved=2ahUKEwif5KnRwcPuAhVFRSsKHdHPA2oQ2-cCegQIABAA&oq=statisyics+estimation+cartoon&gs_lcp=CgNpbWcQA1CWkiFY8qshYJmuIWgAcAB4AIABgGIAbwQkgEFMC44LjOYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=iD0VYJ-AGMWKrQHRn4_QBg&bih=444&biw=1024&rlz=1C1CHBF_enAU841AU846#imgrc=H1tby55I92IsuM
38
Create and assess reliability of forecasts
We can use the SLR model to create forecasts/predictions/estimates of the Y variable for a
given value of X – be careful of units and reliability.
Reliability of forecasts/predictions/estimates?
Depends on two factors:
i) Coefficient of determination, r
2 :
High r2 value (closer to 1), may suggest high reliability of the forecast.
Low r2 value (closer to 0), may suggest low reliability of the forecast.
ii) Value of X:
If the value of X is inside the sample data range, the estimate is called an
interpolation and may be reliable if the r2 value is reasonably high.
If the value of X is outside the sample data range, the estimate is called
an
extrapolation and is NOT reliable, irrespective of the r2 value.
#4
39
39
Coefficient of determination, r2
How good is the “fit” of the regression model with the dataset?
To answer this, we can use the coefficient of determination, r
2 .
r2 is the proportion (or percentage) of the total variation in the Y variable
explained by the regression with the X variable.
0 ≤ r2 ≤ 1, where, in general:
o The closer r2 is to 1, the better the fit of the linear regression
model.
o If 0.4 < r2 < 0.7, there is a relatively moderate fit
o If r2 ≤ 0.4, there is a relatively weak fit
o We will use EXCEL output instead of this formula.
This Photo by Unknown Author is licensed under CC BY-NC-ND
In EXCEL, the
coefficient of
determination
is called
R square.
2 regression sum of squares
total sum of squares
SSR
r
SST
= =
#4
40

SUMMARY OUTPUT
Regression
Statistics
Multiple R 0.762
R Square 0.581
Adjusted R Square 0.528
Standard Error 41.330
Observations 10

Example
How do we interpret the coefficient of
determination coefficient for the computer
games store example?
Coefficient of determination = r2 = 0.581
Interpretation:
0.581 x 100% = 58.1%.
We can say that 58.1% of the variation in weekly sales is explained by the variation in number
of customers making a purchase. As this is just above 50%, we can say that this regression
model is a moderate fit for the dataset.
Note: This also tells us that 41.9% of the variation in weekly sales is explained by other
variables, e.g. quality of computer games sold, demographic of area where store is located,
vicinity of nearest competitor etc.
This Photo by Unknown Author is licensed under CC BY-SA
This Photo by Unknown Author is licensed under CC BY-NC-ND
#4
41
0
200
400
600
0 1000 2000 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus
number of customers
0
200
400
600
0 1000 2000 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus number
of customers
#4 Extrapolation and interpolation

Interpolation: predictions (forecasts,
estimates) within the sample data range
Extrapolation: predictions (forecasts,
estimates) outside the sample data range

What is the predicted weekly sales when the
computer games store has 2000 customers who have
made a purchase? Discuss reliability here.
𝑊𝑒𝑒𝑘𝑙𝑦 ෣𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(2000) = $318,248
Reliable as 2000 customers is within the sample data
range and moderate r
2 of 0.581
What is the predicted weekly sales when the computer
games store has 3000 customers who have made a
purchase? Discuss reliability here.
𝑊𝑒𝑒𝑘𝑙𝑦 ෣𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(3000) = $428,248
NOT reliable as 3000 customers is OUTSIDE the sample
data range, making this estimate an EXTRAPOLATION.
This Photo by Unknown Author is
licensed under CC BY-SA

42
#4 Exercise
This Photo by Unknown Author is licensed
under
CC BY
0
1000
2000
0 5 10 15 20 25 30 35
Rent ($/wk)
Distance (km)
Scatterplot of rent against distance
A random sample of thirty rental properties was collected and
values for the following variables were recorded: weekly rent,
($) and distance from the city centre (km).
a) Interpret the coefficient of determination.
b) Predict the weekly rent for a property 10 km from the city centre?
Comment on the reliability of this estimate.
c) Predict the weekly rent for a property 40 km from the city centre.
Comment on the reliability of this estimate.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.863
R Square 0.745
Adjusted R
Square
0.736
Standard Error 119.530
Observations 30
Coefficients
Intercept 820.148
Distance (km) -23.495

44
Supplementary Exercises
Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials “at the end of each week.
Time permitting, the lecturer may ask students to work through some of these exercises
in class.
Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.

45
Extension
The following slides are an extension to this week’s topic.
The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
46
Standard Error of the Estimate
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
The standard deviation of the variation of observations around the
regression line is estimated by the standard error of the estimate,
SXY
𝑆𝑋𝑌 =
𝑆𝑆𝐸
𝑛 – 2 =
σ
𝑖 𝑛=1(𝑌𝑖 – 𝑌෠𝑖2
𝑛 – 2
SSE = error sum of squares
n = sample size
Interpretation: This is the
typical error that occurs when
the least squares regression equation is used to estimate
the value of Y for a given value of X.
How can we use this in our
forecasts/predictions/estimates?
Incorporate the typical error to enhance our estimate:
𝑌෠ ± S
XY

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.762
R Square 0.581
Adjusted R Square 0.528
Standard Error 41.330
Observations 10

This Photo by Unknown Author is licensed under CC BY-SA
47

Copyright © 2013 Pearson Au
9781442549
stralia (a division of Pears
272/Berenson/Business S
on Australia Group Pty Ltd) –
tatistics /2e
This Photo
CC BY-SA
mputer ga
nhance the p redicted w eekly sales when the co
s tore has 200 customer who have made a purc hase.
nhance this stimate:
± S
XY = $31
,248 ± 41 .330 = ($31 8206.67, $31 8289.33)

 

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.762
R Square 0.581
Adjusted R Square 0.528
Standard Error 41.330
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.93478 18934.93 11.08476 0.010
Residual 8 13665.56522 1708.196
Total 9 32600.5
Coefficients Standard
Error
t Stat P-value Lower 95% Upper
95%
Intercept 98.248 58.033 1.693 0.129 -35.577 232.074
No. customers 0.110 0.033 3.329 0.010 0.034 0.186

by Unknown Author is licensed under
Example

E mes
0 s

Recall, 𝑊𝑒𝑒𝑘𝑙𝑦 ෣𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(2000) = $318,248
𝑌෠
48
Comparing Standard Errors
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Y Y
X X
S
XY is a measure of the variation of observed Y values from the regression line
The magnitude of SYX can be judged relative to the size of the Y values in the sample data.
For the computer games store example, SYX = $41.33K is moderately small
relative to weekly sales in the $200 – $300K range
Small S
YX Large SYX
49
Exercise
A random sample of forty employees in a large, multinational
company was collected and values for the following variables
were recorded: annual
wage ($000s), work experience
(years), absenteeism (days per year) and years of education.
Use EXCEL to create scatterplots and answer the following:
a) Describe the relationship between wage and education.
b) Describe the relationship between wage and absenteeism.
This Photo by Unknown Author is licensed under CC BY-NC-ND
This Photo by Unknown Author is licensed under CC BY-NC-ND
Open the EXCEL
file named
“STAM4000 Week 9
Excel.xls” and
use the
sheet named “Wage”.

50
Exercise solution
a) This Photo by Unknown Author is licensed under CC BY-SA
0
20
40
60
80
100
120
140
160
180
200
0 5 10 15 20 25
Wage ($000)
Education (years)
Scatterplot of wage against
education
This Photo by Unknown Author is licensed under CC BY-NCND
51
Exercise solution
This Photo by Unknown Author is licensed under CC BY-ND
0
50
100
150
200
0 5 10 15 20
Wage ($000)
Absenteeism (days)
Scatterplot of wage against
absenteeism
This Photo by Unknown Author is licensed under CC BY-NCb) ND
52
Example: using the correlation coefficient formula
𝑟 =
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
Technology can calculate this for us, but if you wanted to try using the formula.
Example:
You are interested in the relationship between the number of people in the sales
team and the sales generated in the chemical industry. The data below shows
gross sales, measured in millions of dollars, and the number of people in the
sales team. Calculate and interpret the coefficient of correlation.

53
Example solution
𝑥ҧ = 43.9 𝑦ത = 46
𝑟 =
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ
𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
r =
26 -22 45 -46 + …+(24 -22ሻ(47 -46ሻ
(26 -22ሻ
2(45 -46ሻ2+ …+ (24 -22ሻ2(47 -46ሻ2
r = 0.816
Based on correlation coefficient, it is concluded that there is a fairly strong
positive linear relationship between number of people in a sales team and the
sales generated.

54

A
C
E
Correlation r = 0
Correlation r = 0.5
Correlation r = 0.9
B
Correlation r = – 0.3
D
Correlation r = – 0.7
F
Correlation r = -0.99

Examples of scatterplots and correlation coefficients, r
This Photo by Unknown Author is licensed under
CC BY-NC
55
Formula for calculating the slope, 𝑏1
𝑏1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋
where
and
Formula for calculating the Y intercept, 𝑏𝑜
where and
Least Squares Estimation formulae
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics
/2e
1 1
1 1
( )( )
( )( )
n n
n n i i
i i
i i i i
i i
X Y
SSXY X X Y Y X Y
n
= =
= =
= – – = –
 
 
2
2 2 1
1 1
( )
( )
n
n n i
i
i i
i i
X
SSX X X X
n
=
= =
= – = –

 
𝑋ത = σ𝑖 𝑛=1 𝑋𝑖
𝑛
𝑌ത = σ
𝑖 𝑛=1 𝑌𝑖
𝑛
56
Illustration
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e

Weekly sales
($000)
Number of
Customers
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

A manager of a local computer
games store wishes to:
Manually, find the regression
equation to estimate the
relationship between weekly sales
($000) and the number of
customers making purchases over
a 10-week period.
This Photo by Unknown
Author is licensed under
CC BY-SA
Using the formulas to
calculate b
0 and b1
manually – not assessed
57
Solution
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e

No. of
Customers
(
X)
Weekly
sales in
$1000s
(
Y)
1400 245 1960000 60025 343000
1600 312 2560000 97344 499200
1700 279 2890000 77841 474300
1875 308 3515625 94864 577500
1100 199 1210000 39601 218900
1550 219 2402500 47961 339450
2350 405 5522500 164025 951750
2450 324 6002500 104976 793800
1425 319 2030625 101761 454575
1700 255 2890000 65025 433500
Σ 17150 2865 30983750 853423 5085975

X 2 Y 2 XY
𝑏1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋
1 1
1
( )( )
n n
n i i
i i
i i
i
X Y
SSXY X Y
n
= =
=
= –
 

2
2 1
1
( )
n
n i
i
i
i
X
SSX X
n
=
=
= –


SSXY = 5085975 – 2865(17150ሻ
10
= 172500
SSX=30983750-
171502
10
= 1571500
𝒃𝟏 = 172500
1571500
= 0.10977
𝑌ത = 2865
10
= 286.5 and 𝑋ത = 17150
10
= 1715
b
0 = 286.5 – (0.10977)(1715) = 98.24
b Y b X 0 1 = –
Estimated Sales = 98.24 + 0.1098 Customers
This Photo by Unknown Author is licensed under CC BY-SA
58
Measuring the strength of the linear relationship: r2 values
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Y r2 = 1
X
Y
X
r2 = 1 Perfect linear relationship
between X and Y.
100% of the variation in Y is
explained by variation in X
r2 = 0
No linear relationship between X and Y
The value of Y does NOT depend on X.
(None of the variation in Y is explained by variation in X)
Y
X

59
The least squares method will produce a regression line whether there is/is not a
linear relationship between X and Y.
Consequently, it is important to assess how well the linear model fits the data.
Several methods are used to assess the model:
oUsing descriptive measurements such as the sum of squares for errors (SSE).
oTesting and/or estimating the regression model coefficients – see next class.
Copyright © 2013 Pearson Australia (a division of Pearson
Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e
Assessing the model
60
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Total Sum of Squares
𝑺𝑺𝑻 = ෍(𝒀𝒊 – 𝒀ഥሻ𝟐
Measures the
variation of the Y
i
values around their
mean
𝒀ഥ.
Total variation is made up of two parts
SST = SSR + SSE
Regression Sum of Squares
𝑺𝑺𝑹 = ෍(𝒀෡𝒊 – 𝒀ഥሻ𝟐
Explained variation
attributable to the
relationship between
X and Y.
Error Sum of Squares
𝑺𝑺𝑬 = ෍(𝒀𝒊 – 𝒀෡𝒊𝟐
Variation attributable
to factors other than
the relationship
between X and Y.

61
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
X
i X
Y
Y
i ෡𝒀
Visualization of total variation
𝑺𝑺𝑻 = ෍(𝒀𝒊 – 𝒀ഥሻ𝟐
𝒀ഥ 𝒀ഥ

𝑺𝑺𝑹 = ෍(𝒀෡𝒊 – 𝒀ഥሻ𝟐
𝑺𝑺𝑬 = ෍(𝒀𝒊 – 𝒀෡𝒊𝟐

𝒀෡