1
1
STAM4000
Quantitative Methods
Week 9
Correlation and simple linear
regression
https://www.google.com/search?q=causation+versus+correlation+comic+cartoon+funny&tbm=isch&ved=2ahUKEwjA1YH7vsfuAhXBn0sFHWGhDYwQ2-
cCegQIABAA&oq=causation+versus+correlation+comic+cartoon+funny&gs_lcp=CgNpbWcQAzoHCCMQ6gIQJzoECCMQJzoECAAQQzoICAAQsQMQgwE6BQgAELEDOgcIABCxAxBDOgoIABCxAxCDARBDOgIIADoECAAQGDoGCAAQChAYOgQIABAeUO2AAljA3AJg094C
aAJwAHgAgAGrAogB9kGSAQYwLjQ3LjKYAQCgAQGqAQtnd3Mtd2l6LWltZ7ABCsABAQ&sclient=img&ei=mVMXYIClHMG_rtoP4cK24Ag&bih=470&biw=1013&rlz=1C1CHBF_enAU841AU846&hl=en#imgrc=I3tF8lFRL1EZiM
2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2
3 d #1 #2 #3 Examine the relationship between two quantitative variables Differentiate between correlation and causation Model with simple linear regression #4 Create and assess reliability of forecasts |
Week 9 Correlation an simple linear regression Learning Outcomes |
4
Why does this matter?
If there is an
association
between two
quantitative
variables, we can
model the
relationship to
predict future
values.
https://www.google.com/search?q=regression+cartoon&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk02pME45VK0MPH4OfnjIO_5qRS3Jag:1611892844986&source=lnms&tbm=isch&sa=X&ved=2ahUKEwibuMPjoMDuAhV9zTgGHRK
EAxwQ_AUoAXoECBIQAw&biw=1024&bih=444#imgrc=kC7IUqSGWMulMM
5
#1 Examine the relationship
between two quantitative variables
https://www.google.com/search?q=regression+cartoon&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk02pME45VK0MPH4OfnjIO_5qRS3Jag:1611892844986&source=lnms&tbm=isch&sa=X&ved=2ahUKEwibuMPjoMDuAhV9zTgGHRKEAxwQ_AUoAXoECBIQAw&b
iw=1024&bih=444#imgrc=HDY2nP8j3G4I5M
6
#1 Make a picture
Scatterplots are a tool for representing the relationship
between two quantitative variables.
X is the “independent” or “explanatory” or “predictor” variable
Y is the “dependent” or “response” variable
Before creating a scatterplot, it is best to decide which
variable is responding to the other.
Note: in business, we usually have positive numbers and deal
with the top right quadrant of the X, Y axes.
-10
-5 -5 |
5 X |
5 10 |
0
15
-10 0 10
Scatterplot of Y against X
5 0
10
15
0 2 4 6
Y (units)
X (units)
Scatterplot of of Y against
X
Y
7
#1
Here are two examples of scatterplots representing the
relationship between two quantitative variables.
Linear model Non-linear model
0
50
100
0 5 10
Exam score (%)
Cups of coffee before a test
Scatterplot of exam score against
cups of coffee before a test
0
10
20
30
0 50 100
Exam score (%)
Hours of study
Scatterplot of exam score against
hours of study
This Photo by Unknown Author is licensed under CC BY-NC-ND
Illustration
8
#1
This Photo by Unknown Author is
licensed under CC BY
Example
A random sample of thirty rental properties was collected and values for the following
variables were recorded: weekly rent, ($/wk), distance from the city centre (km), number of
bedrooms, number of bathrooms and age of the property (year).
EXCEL was used to create the following scatterplots.
How could we describe the relationship between rent and distance?
How could we describe the relationship between weekly rent and number of bedrooms?
0
200
400
600
800
1000
1200
1400
0 10 20 30 40
Rent ($/wk)
Distance (km)
Scatterplot of rent against distance
0
200
400
600
800
1000
1200
1400
0 1 2 3 4 5
Rent ($/wk)
Bedrooms
Scatterplot of rent against bedrooms
9
#1 Example
10
#2 Differentiate between correlation and causation
This Photo by Unknown Author is licensed under CC BY-SA
11
Correlation ≠ causation
12
#2 The correlation coefficient
𝑟 =
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
13
STRONG: tightly
clustered points
WEAK: loosely
clustered points
Negative
relationships
Y
X
Y
X
Y Y
X X
Positive
relationships
#2 Direction: + or -? Strength: strong or weak?
14
#2 Summary table of direction and size of r
-1 | -0.8 | -0.6 | -0.4 | -0.2 | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 |
negative | positive |
very strong strong moderate weak |
very strong weak moderate strong |
15
15
#2 How do we use EXCEL to find correlation coefficients?
16
This Photo by Unknown Author is licensed under CC BY
16
#2 Exercise
A random sample of thirty rental properties was collected and values for
the following variables were recorded: weekly rent, ($/wk), distance from
the city centre (km), number of bedrooms, number of bathrooms and age
of the property (year). Using the EXCEL file named “STAM Week 9 Excel
file.xls” and the sheet named “Rent”, create a correlation table of these
variables.
Use the correlation table to answer the following:
a) Describe the relationship between rent and distance?
b) Describe the relationship between weekly rent and number of bedrooms?
18
#3 Model with simple linear regression
https://www.google.com/search?q=regression%20analysis%20cartoon&tbm=isch&hl=en&rlz=1C1CHBF_enAU841AU846&sa=X&ved=0CMoBEKzcAigAahcKEwiIuZO7tMPuAhUAAAAAHQAAAAAQAg&biw=1013&bih=433#imgrc=mWMAV2iiX–veM
19
Deterministic Model: an equation or set of
equations that allow us to fully determine
the value of the dependent variable from the
values of the independent variable(s).
Probabilistic Model: a method used to
capture the randomness that is part of a reallife process.
E.g.: Do all houses of the same size, sell for
the same price?
Copyright © 2013 Pearson Australia (a division of Pearson
Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e
#3 Two basic types of regression models
https://www.google.com/search?q=house+for+sale+comic&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk022NxOgtvF_3V91LvfSbS_JsBzlPw
:1612158142201&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjYy4yL_cfuAhVwIbcAHR8QAwAQ_AUoAXoECAMQAw&biw=1024&bih=44
4#imgrc=N9jlWL7JyumnoM
20
A model of the relationship between house size (independent
variable) and house price (dependent variable) would be:
Y axis intercept is 200,000
Equation for this deterministic model: House price = 200 000 + 800(Size)
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
#3 Illustration of a model
X, Size (sqm)
Y
House
price
($)
In this model,
the price of
the house is
completely
determined by
the size.
21
x
200,000
Equation for this probabilistic model:
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) 9781442549272/Berenson/Business Statistics /2e – House price = 200,000 + 800 Size +
#3 Illustration of continued
X, Size (sqm)
Y
House
price
($)
In real-life, the
price of a house
will be
influenced by
many variables.
E.g.: location,
type of building
materials,
quality of
fittings etc.
Realistic scatterplot of
house price against size
Lower
variability
versus
higher
variability
The probabilistic model
includes the “error” term
22
We now represent the price of a house as a function of
its size in this probabilistic model:
House price = 200,000 + 800 House size +
where (Greek letter read as “epsilon”):
• is the random term in the population
• is also called the error variable or the residual
• It is the difference between the actual (or observed) selling price and the
estimated (or predicted) selling price of the same sized house.
• varies from house sale to house sale, for the same value of x, the size of the
house, due to other factors influencing the price.
• In the sample equation, the random term is represented by e
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
#3 Random term, 𝜺
23
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Random error for this X
i value
= RESIDUAL
= 𝑌
𝑖 – 𝑌𝑖
Y
X
Observed value of Y for X
i
𝑌 = Predicted value of Y for X
i
X
i
Slope
= β1
=
𝑟𝑖𝑠𝑒
Intercept = β0 𝑟𝑢𝑛
εi |
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
#3 Simple linear regression model (SLR)
24
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Yi = β0 + β1Xi + εi
Linear
component
Population
Y intercept
Population
slope
coefficient
Dependent
variable
Independent
variable
Random error
component
•Only one independent variable, X
•Relationship between X and Y is described by a linear function
•Changes in Y are associated with changes in X
#3 Simple linear regression model (SLR)
25
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
The simple linear regression equation (derived from a sample) provides an
estimate of the population regression line.
Estimate of the
regression
intercept Estimate of the regression slope
Estimated (or predicted)
Y value for observation i
Value of X for observation i
The individual random error terms e
i have a mean of zero
𝒀𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝒊
#3 Simple linear regression equation (prediction line)
26
The Process of estimating regression coefficients: Least
Squares Method
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
b
0 is the estimated average value of Y when the value of X is zero
b
1 is the estimated change in the average value of Y as a result of a one-unit
change in X
Most statistical software packages will find the values of b0 and b1 that minimise
the sum of the squared differences between actual values (Y) and predicted
values ( 𝑌 ), i.e.
𝒎𝒊𝒏 𝒀𝒊 – 𝒀𝒊 𝟐 = 𝒎𝒊𝒏 𝒀𝒊 – (𝒃𝟎 + 𝒃𝟏 𝑿𝒊ሻ 𝟐
#3
27
Y
X
#3 Least squares method
The question is:
Which straight line
fits best?
The least squares
line minimises the
sum of the squared
differences between
the points and
the line.
28
Formula for calculating the slope,
where,
and
Formula for calculating the Y intercept,
where | and | 𝑋ത = | 𝑌ത = | σ𝑖 𝑛=1 𝑌𝑖 𝑛 |
Least Squares Estimation formulae
1 1
1 1
( )( )
( )( )
n n
n n i i
i i
i i i i
i i
X Y
SSXY X X Y Y X Y
n
= =
= =
= – – = –
2
2 2 1
1 1
( )
( )
n
n n i
i
i i
i i
X
SSX X X X
n
=
= =
= – = –
σ𝑖 𝑛=1 𝑋𝑖
𝑛
#3
𝑏
𝑜 = 𝑌ത – 𝑏1 𝑋ത
𝑏1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋
These manual
calculations are
tedious and will
NOT be assessed.
It is more efficient to
use a software
package, such as
EXCEL, to find the
values of b
0 and b1.
You WILL BE
ASSESSED on how
you use EXCEL
regression output
29
29
How do we use EXCEL to create a simple linear
regression?
#3
In the menu bar:
Click on “Data”,
then select
“Data Analysis”
30
#3 Example
The manager of a computer games store wishes to:
• Examine the relationship between weekly sales ($000) and the number
of customers making purchases.
• Use the results of that examination to predict future weekly sales.
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e |
Weekly sales ($000) |
No. customers |
245 | 1400 |
312 | 1600 |
279 | 1700 |
308 | 1875 |
199 | 1100 |
219 | 1550 |
405 | 2350 |
324 | 2450 |
319 | 1425 |
255 | 1700 |
0
200
400
600
0 500 1000 1500 2000 2500 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus
number of customers
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
This Photo by Unknown Author is licensed under CC BY-SA
31
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.762 | |||||
R Square | 0.581 | |||||
Adjusted R Square | 0.528 | |||||
Standard Error | 41.330 | |||||
Observations | 10 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 1 | 18934.93478 | 18934.93 | 11.08476 | 0.010 | |
Residual | 8 | 13665.56522 | 1708.196 | |||
Total | 9 | 32600.5 | ||||
Coefficients | Standard Error |
t Stat | P-value | Lower 95% | Upper 95% |
|
Intercept | 98.248 | 58.033 | 1.693 | 0.129 | -35.577 | 232.074 |
No. customers | 0.110 | 0.033 | 3.329 | 0.010 | 0.034 | 0.186 |
#3 Example Excel regression output:
= | Correlation | coefficien | t, r. EXCE | L “forgets” t | he sign o |
Check the si | gn of the | slope,b, f | or the sign | f r. | |
= | sample size | , n | |||
Name of X variable | |||||
b 1 = |
|||||
0 |
1b
=
This Photo by Unknown
Author is licensed under
CC BY-SA
32
#3 Example
a) How do we write the regression model/equation?
Estimated weekly sales = 98.248 + 0.110 Number of customers
b) How do we interpret the correlation coefficient, r? r = 0.762
There is a strong, positive, linear relationship between weekly customer sales and the number of customers
making purchases.
c) How do we interpret the intercept? b0 = 98.248
We estimate, that for zero customers, this game store makes 98.248 ($000s) or $98,248 per week in sales,
on average. This makes no sense. However, the intercept simply indicates that over the sample size
selected, the portion of weekly sales not explained by number of customers is estimated to be $98,248.30,
on average. Also note that X = 0 is outside the range of observed values, and this makes the intercept
estimate an extrapolation (see later).
d) How do we interpret the slope? b1 = 0.110
We estimate, for each one extra paying customer in this games store, weekly sales increase by an average
0.110 ($000) or by $110.
This Photo by Unknown
Author is licensed under
CC BY-SA
33
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Intercept
= 98.248
#3 Example
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Weekly sales ($000)
Number of customers
Scatterplot and predicted line for Weekly sales versus number of customers
Weekly sales ($000)
Predicted Weekly sales
($000)
Linear (Weekly sales
($000))
𝑊𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110 Number of customers
This Photo by Unknown Author is licensed
under CC BY-SA
34
#3 Exercise
A random sample of thirty rental properties was collected and values for the following
variables were recorded: weekly rent, ($) and distance from the city centre (km).
a) Use Excel and the file named STAM4000 Week 9 Excel file.xls to create a simple linear
regression between these two variables. The Excel output is provided in the next slide.
b) Write the regression equation.
c) Recall, earlier we created a correlation table for this data set. For the variables of rent and
distance, is the correlation coefficient in the regression output the same as in the earlier
correlation table? Briefly explain.
d) Interpret the intercept?
e) Interpret the slope?
This Photo by Unknown Author is licensed under CC BY
35
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.863 | |||||
R Square | 0.745 | |||||
Adjusted R Square | 0.736 | |||||
Standard Error | 119.530 | |||||
Observations | 30 | |||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
Intercept | 820.148 | 40.124 | 20.441 | 2.289E-18 | 737.958 | 902.338 |
Distance (km) | -23.495 | 2.597 | -9.048 | 8.348E-10 | -28.814 | -18.176 |
#3 Exercise solution
= Correlation | coefficient | , r. EXCEL | forgets” the s | ign of r. |
Check the s | gn of the s | lope,b1, fo | r the sign of r | . |
= sample size | , n | |||
Name of X vari | able | |||
b 0 |
= | |||
b 1 |
= |
a) This Photo by Unknown Author is licensed under CC BY
37
#4 Create and assess reliability of forecasts
https://www.google.com/search?q=statisyics+estimation+cartoon&tbm=isch&ved=2ahUKEwif5KnRwcPuAhVFRSsKHdHPA2oQ2-cCegQIABAA&oq=statisyics+estimation+cartoon&gs_lcp=CgNpbWcQA1CWkiFY8qshYJmuIWgAcAB4AIABgGIAbwQkgEFMC44LjOYAQCgAQGqAQtnd3Mtd2l6LWltZ8ABAQ&sclient=img&ei=iD0VYJ-AGMWKrQHRn4_QBg&bih=444&biw=1024&rlz=1C1CHBF_enAU841AU846#imgrc=H1tby55I92IsuM
38
Create and assess reliability of forecasts
We can use the SLR model to create forecasts/predictions/estimates of the Y variable for a
given value of X – be careful of units and reliability.
Reliability of forecasts/predictions/estimates?
Depends on two factors:
i) Coefficient of determination, r2 :
• High r2 value (closer to 1), may suggest high reliability of the forecast.
• Low r2 value (closer to 0), may suggest low reliability of the forecast.
ii) Value of X:
• If the value of X is inside the sample data range, the estimate is called an
interpolation and may be reliable if the r2 value is reasonably high.
• If the value of X is outside the sample data range, the estimate is called
an extrapolation and is NOT reliable, irrespective of the r2 value.
#4
39
39
Coefficient of determination, r2
How good is the “fit” of the regression model with the dataset?
To answer this, we can use the coefficient of determination, r2 .
• r2 is the proportion (or percentage) of the total variation in the Y variable
explained by the regression with the X variable.
• 0 ≤ r2 ≤ 1, where, in general:
o The closer r2 is to 1, the better the fit of the linear regression
model.
o If 0.4 < r2 < 0.7, there is a relatively moderate fit
o If r2 ≤ 0.4, there is a relatively weak fit
o We will use EXCEL output instead of this formula.
This Photo by Unknown Author is licensed under CC BY-NC-ND
In EXCEL, the
coefficient of
determination
is called
R square.
2 regression sum of squares
total sum of squares
SSR
r
SST
= =
#4
40
SUMMARY OUTPUT | |
Regression Statistics |
|
Multiple R | 0.762 |
R Square | 0.581 |
Adjusted R Square | 0.528 |
Standard Error | 41.330 |
Observations | 10 |
Example
How do we interpret the coefficient of
determination coefficient for the computer
games store example?
Coefficient of determination = r2 = 0.581
Interpretation:
0.581 x 100% = 58.1%.
We can say that 58.1% of the variation in weekly sales is explained by the variation in number
of customers making a purchase. As this is just above 50%, we can say that this regression
model is a moderate fit for the dataset.
Note: This also tells us that 41.9% of the variation in weekly sales is explained by other
variables, e.g. quality of computer games sold, demographic of area where store is located,
vicinity of nearest competitor etc.
This Photo by Unknown Author is licensed under CC BY-SA
This Photo by Unknown Author is licensed under CC BY-NC-ND
#4
41
0
200
400
600
0 1000 2000 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus
number of customers
0
200
400
600
0 1000 2000 3000
Weekly sales ($000)
Number of customers
Scatterplot of Weekly sales versus number
of customers
#4 Extrapolation and interpolation
Interpolation: predictions (forecasts, estimates) within the sample data range |
Extrapolation: predictions (forecasts, estimates) outside the sample data range |
What is the predicted weekly sales when the
computer games store has 2000 customers who have
made a purchase? Discuss reliability here.
𝑊𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(2000) = $318,248
Reliable as 2000 customers is within the sample data
range and moderate r2 of 0.581
What is the predicted weekly sales when the computer
games store has 3000 customers who have made a
purchase? Discuss reliability here.
𝑊𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(3000) = $428,248
NOT reliable as 3000 customers is OUTSIDE the sample
data range, making this estimate an EXTRAPOLATION.
This Photo by Unknown Author is
licensed under CC BY-SA
42
#4 Exercise
This Photo by Unknown Author is licensed
under CC BY
0
1000
2000
0 5 10 15 20 25 30 35
Rent ($/wk)
Distance (km)
Scatterplot of rent against distance
A random sample of thirty rental properties was collected and
values for the following variables were recorded: weekly rent,
($) and distance from the city centre (km).
a) Interpret the coefficient of determination.
b) Predict the weekly rent for a property 10 km from the city centre?
Comment on the reliability of this estimate.
c) Predict the weekly rent for a property 40 km from the city centre.
Comment on the reliability of this estimate.
SUMMARY OUTPUT | |
Regression Statistics | |
Multiple R | 0.863 |
R Square | 0.745 |
Adjusted R Square |
0.736 |
Standard Error | 119.530 |
Observations | 30 |
Coefficients | |
Intercept | 820.148 |
Distance (km) | -23.495 |
44
Supplementary Exercises
• Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
• Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials “at the end of each week.
• Time permitting, the lecturer may ask students to work through some of these exercises
in class.
• Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.
45
Extension
• The following slides are an extension to this week’s topic.
• The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
46
Standard Error of the Estimate
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
The standard deviation of the variation of observations around the
regression line is estimated by the standard error of the estimate, SXY
𝑆𝑋𝑌 =
𝑆𝑆𝐸
𝑛 – 2 =
σ𝑖 𝑛=1(𝑌𝑖 – 𝑌𝑖ሻ2
𝑛 – 2
SSE = error sum of squares
n = sample size
Interpretation: This is the typical error that occurs when
the least squares regression equation is used to estimate
the value of Y for a given value of X.
How can we use this in our
forecasts/predictions/estimates?
Incorporate the typical error to enhance our estimate:
𝑌 ± S
XY
SUMMARY OUTPUT | |
Regression Statistics | |
Multiple R | 0.762 |
R Square | 0.581 |
Adjusted R Square | 0.528 |
Standard Error | 41.330 |
Observations | 10 |
This Photo by Unknown Author is licensed under CC BY-SA
47
Copyright © 2013 Pearson Au 9781442549 |
stralia (a division of Pears 272/Berenson/Business S |
on Australia Group Pty Ltd) – tatistics /2e |
|||
This Photo CC BY-SA mputer ga |
nhance the p | redicted w | eekly sales | when the co | |
s | tore has 200 | customer | who have | made a purc | hase. |
nhance this | stimate: | ||||
± S XY = $31 |
,248 ± 41 | .330 = ($31 | 8206.67, $31 | 8289.33) |
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.762 | |||||
R Square | 0.581 | |||||
Adjusted R Square | 0.528 | |||||
Standard Error | 41.330 | |||||
Observations | 10 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 1 | 18934.93478 | 18934.93 | 11.08476 | 0.010 | |
Residual | 8 | 13665.56522 | 1708.196 | |||
Total | 9 | 32600.5 | ||||
Coefficients | Standard Error |
t Stat | P-value | Lower 95% | Upper 95% |
|
Intercept | 98.248 | 58.033 | 1.693 | 0.129 | -35.577 | 232.074 |
No. customers | 0.110 | 0.033 | 3.329 | 0.010 | 0.034 | 0.186 |
by Unknown Author is licensed under
Example
E | mes |
0 | s |
Recall, 𝑊𝑒𝑒𝑘𝑙𝑦 𝑠𝑎𝑙𝑒𝑠 = 98.248 + 0.110(2000) = $318,248
𝑌
48
Comparing Standard Errors
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Y Y
X X
S
XY is a measure of the variation of observed Y values from the regression line
The magnitude of SYX can be judged relative to the size of the Y values in the sample data.
For the computer games store example, SYX = $41.33K is moderately small
relative to weekly sales in the $200 – $300K range
Small SYX Large SYX
49
Exercise
A random sample of forty employees in a large, multinational
company was collected and values for the following variables
were recorded: annual wage ($000s), work experience
(years), absenteeism (days per year) and years of education.
Use EXCEL to create scatterplots and answer the following:
a) Describe the relationship between wage and education.
b) Describe the relationship between wage and absenteeism.
This Photo by Unknown Author is licensed under CC BY-NC-ND
This Photo by Unknown Author is licensed under CC BY-NC-ND
Open the EXCEL
file named
“STAM4000 Week 9
Excel.xls” and use the
sheet named “Wage”.
50
Exercise solution
a) This Photo by Unknown Author is licensed under CC BY-SA
0
20
40
60
80
100
120
140
160
180
200
0 5 10 15 20 25
Wage ($000)
Education (years)
Scatterplot of wage against
education
This Photo by Unknown Author is licensed under CC BY-NCND
51
Exercise solution
This Photo by Unknown Author is licensed under CC BY-ND
0
50
100
150
200
0 5 10 15 20
Wage ($000)
Absenteeism (days)
Scatterplot of wage against
absenteeism
This Photo by Unknown Author is licensed under CC BY-NCb) ND
52
Example: using the correlation coefficient formula
𝑟 =
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
Technology can calculate this for us, but if you wanted to try using the formula.
Example:
You are interested in the relationship between the number of people in the sales
team and the sales generated in the chemical industry. The data below shows
gross sales, measured in millions of dollars, and the number of people in the
sales team. Calculate and interpret the coefficient of correlation.
53
Example solution
𝑥ҧ = 43.9 𝑦ത = 46
𝑟 =
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ (𝑦𝑖 – 𝑦തሻ
σ𝑖 𝑛=1 𝑥𝑖 – 𝑥ҧ 2 𝑦𝑖 – 𝑦ത 2
r =
26 -22 45 -46 + …+(24 -22ሻ(47 -46ሻ
(26 -22ሻ2(45 -46ሻ2+ …+ (24 -22ሻ2(47 -46ሻ2
r = 0.816
Based on correlation coefficient, it is concluded that there is a fairly strong
positive linear relationship between number of people in a sales team and the
sales generated.
54
A C E Correlation r = 0 Correlation r = 0.5 Correlation r = 0.9 |
B Correlation r = – 0.3 |
D Correlation r = – 0.7 |
|
F Correlation r = -0.99 |
Examples of scatterplots and correlation coefficients, r
This Photo by Unknown Author is licensed under
CC BY-NC
55
Formula for calculating the slope, 𝑏1
𝑏1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋
where
and
Formula for calculating the Y intercept, 𝑏𝑜
where and
Least Squares Estimation formulae
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics
/2e
1 1
1 1
( )( )
( )( )
n n
n n i i
i i
i i i i
i i
X Y
SSXY X X Y Y X Y
n
= =
= =
= – – = –
2
2 2 1
1 1
( )
( )
n
n n i
i
i i
i i
X
SSX X X X
n
=
= =
= – = –
𝑋ത = σ𝑖 𝑛=1 𝑋𝑖
𝑛
𝑌ത = σ𝑖 𝑛=1 𝑌𝑖
𝑛
56
Illustration
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Weekly sales ($000) |
Number of Customers |
245 | 1400 |
312 | 1600 |
279 | 1700 |
308 | 1875 |
199 | 1100 |
219 | 1550 |
405 | 2350 |
324 | 2450 |
319 | 1425 |
255 | 1700 |
A manager of a local computer
games store wishes to:
• Manually, find the regression
equation to estimate the
relationship between weekly sales
($000) and the number of
customers making purchases over
a 10-week period.
This Photo by Unknown
Author is licensed under
CC BY-SA
Using the formulas to
calculate b
0 and b1
manually – not assessed
57
Solution
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e
No. of Customers (X) |
Weekly sales in $1000s (Y) |
|||
1400 | 245 | 1960000 | 60025 | 343000 |
1600 | 312 | 2560000 | 97344 | 499200 |
1700 | 279 | 2890000 | 77841 | 474300 |
1875 | 308 | 3515625 | 94864 | 577500 |
1100 | 199 | 1210000 | 39601 | 218900 |
1550 | 219 | 2402500 | 47961 | 339450 |
2350 | 405 | 5522500 | 164025 | 951750 |
2450 | 324 | 6002500 | 104976 | 793800 |
1425 | 319 | 2030625 | 101761 | 454575 |
1700 | 255 | 2890000 | 65025 | 433500 |
Σ 17150 | 2865 | 30983750 | 853423 | 5085975 |
X 2 Y 2 XY
𝑏1 = 𝑆𝑆𝑋𝑌
𝑆𝑆𝑋 1 1
1
( )( )
n n
n i i
i i
i i
i
X Y
SSXY X Y
n
= =
=
= –
2
2 1
1
( )
n
n i
i
i
i
X
SSX X
n
=
=
= –
SSXY = 5085975 – 2865(17150ሻ
10
= 172500
SSX=30983750- 171502
10
= 1571500
𝒃𝟏 = 172500
1571500
= 0.10977
𝑌ത = 2865
10
= 286.5 and 𝑋ത = 17150
10
= 1715
b
0 = 286.5 – (0.10977)(1715) = 98.24
b Y b X 0 1 = –
Estimated Sales = 98.24 + 0.1098 Customers
This Photo by Unknown Author is licensed under CC BY-SA
58
Measuring the strength of the linear relationship: r2 values
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Y r2 = 1
X
Y
X
r2 = 1 Perfect linear relationship
between X and Y.
100% of the variation in Y is
explained by variation in X
r2 = 0
No linear relationship between X and Y
The value of Y does NOT depend on X.
(None of the variation in Y is explained by variation in X)
Y
X
59
• The least squares method will produce a regression line whether there is/is not a
linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits the data.
• Several methods are used to assess the model:
oUsing descriptive measurements such as the sum of squares for errors (SSE).
oTesting and/or estimating the regression model coefficients – see next class.
Copyright © 2013 Pearson Australia (a division of Pearson
Australia Group Pty Ltd) – 9781442549272/Berenson/Business
Statistics /2e
Assessing the model
60
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
Total Sum of Squares
𝑺𝑺𝑻 = (𝒀𝒊 – 𝒀ഥሻ𝟐
Measures the
variation of the Y
i
values around their
mean 𝒀ഥ.
Total variation is made up of two parts
SST = SSR + SSE
Regression Sum of Squares
𝑺𝑺𝑹 = (𝒀𝒊 – 𝒀ഥሻ𝟐
Explained variation
attributable to the
relationship between
X and Y.
Error Sum of Squares
𝑺𝑺𝑬 = (𝒀𝒊 – 𝒀𝒊ሻ𝟐
Variation attributable
to factors other than
the relationship
between X and Y.
61
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e
X
i X
Y
Y i 𝒀
Visualization of total variation
𝑺𝑺𝑻 = (𝒀𝒊 – 𝒀ഥሻ𝟐
𝒀ഥ 𝒀ഥ
𝑺𝑺𝑹 = (𝒀𝒊 – 𝒀ഥሻ𝟐 𝑺𝑺𝑬 = (𝒀𝒊 – 𝒀𝒊ሻ𝟐 |
𝒀