Quantitative Methods

1
1
STAM4000
Quantitative Methods
Week 10
Multiple linear regression and
inference
http://claudiaflowers.net/rsch8140/Lec1.html
2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2

3
r
d
#1
#2
Assumptions in linear regression
Multiple linear regression
#3 Inference in regression
Week 10
Multiple linea
regression an
inference
Learning
Outcomes

4
Why does this
matter?
We can create
linear models
with more
than one X
variable.
https://www.google.com/search?q=regression%20analysis%20cartoon&tbm=isch&hl=en&rlz=1C1CHBF_enAU841AU846&sa=X&ved=0CMoBEKzcAigAahcKEwiIuZO7tMPuAhUAAAAAHQAAAAAQAg&biw=1013&bih=433#imgrc=mWMAV2iiX–veM
We can also
estimate and
test
regression
statistics.

5
#1 Assumptions in linear regression
https://line.17qq.com/articles/ncpkdmmlv_p3.html
6
#1 Assumptions in linear regression
Use the acronym LINE
Linearity: the underlying relationship between X and Y is linear
Independence of errors: error values are statistically independent
Normality of error: error values (ε) are normally distributed for any given
value of X
Equal variance (homoscedasticity): the probability distribution of the errors
has constant variance
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group
Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e

7
#1 Residual analysis
The residual is the difference between an observed data value and the predicted
data value from the fitted line (the fitted value).
Error = observed Y – expected value of Y from model
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑌 – 𝑓𝑖𝑡𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 𝒀 – 𝒀෡
The residual is the estimate of the error
Zero residual: the fitted value equals the observed value
Positive residual: the fitted value is less than the observed value
Negative residual: the fitted value is greater than the observed value
8
#1 More on residual analysis
A consequence of the least squares fitting algorithm is that the sum of the residuals,
σ 𝐘 – 𝐘෡ = 𝟎 and hence their mean, is 0.
The variance of the errors is estimated by s
2 = σ(𝐘 – 𝐘෡)𝟐
𝐧-𝐤-𝟏
The estimated standard deviation of the errors is s = σ(𝐘 – 𝐘෡)𝟐
𝐧-𝐤-𝟏
The estimated standard deviation of the errors is referred to as the Standard Error, in Excel
output.
The Standard Error of the estimate is the typical error that occurs when the least squares
regression equation is used to estimate the value of Y for given values of the X variables.

9
#1 Residual Analysis
Excel calculates residuals as part of its regression analysis, for example:
We can use these residuals to determine whether the error variable is nonnormal, whether the error variance is constant, and whether the errors are
independent…

RESIDUAL OUTPUT
Observation Predicted Weekly Sales Residuals
= Observed Y – Predicted Y
1 251.92316 -6.923162
2 273.87671 38.12329
3 284.85348 -5.853484
4 304.06284 3.937162

10
Residual analysis for linearity
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals
#1

11
Residual analysis for independence
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Not independent Independent
X
X
residuals
residuals
X
residuals
#1

12
Residual analysis for normality
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Percent
Residual
A normal probability plot of the residuals can be used to check for normality
-3 -2 -1 0 1 2 3
0
100
#1
13
Residual analysis for equal variance (homoscedasticity)
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Non-constant variance Constant variance
x x
Y
x x
Y
residuals
residuals
#1
14
#1 Example
The manager of a computer games store wishes to:
Examine the residuals between weekly sales ($000) and the number of
customers making purchases.
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e

Weekly sales
($000)
No.
customers
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000
Weekly sales ($000)
Number of customers
Scatterplot of weekly sales versus
number of customers

15
Example
Does not appear to violate any regression assumptions as:
i) the residual plot is well scattered, so linearity, independence, homoscedasticity satisfied.
ii) the normal probability plot follows an upward sloping diagonal line
#1
-100.00
-50.00
0.00
50.00
100.00
0

500 1000 1500 2000 2500

3000
Residuals = Observed
weekly sales – predicted
weekly sales
Number of customers
Weekly sales residual plot

Number of
customers
Weekly
sales
($000)
1400 245
1600 312
1700 279
1875 308
1100 199
1550 219
2350 405
2450 324
1425 319
1700 255

 

Predicted
Weekly sales
($000)
Residuals
251.92 -6.92
273.88 38.12
284.85 -5.85
304.06 3.94
218.99 -19.99
268.39 -49.39
356.20 48.80
367.18 -43.18
254.67 64.33
284.85 -29.85

0
200
400
600
0 20 40 60 80 100
Weekly sales
($000)
Sample Percentile
Normal probability plot
16
#2 Multiple linear regression
https://www.pinterest.com.au/pin/515662226059772035/
17
𝑦𝑖 = 𝛽0+ 𝛽1 𝑥1𝑖 + 𝛽2𝑥2𝑖 + ⋯ + 𝛽k𝑥k𝑖 + 𝜀𝑖
The Population Multiple Regression equation with k independent variables
Y-intercept Population slopes
(population coefficients)
Random Error
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
The Sample (fitted) Multiple Regression Model (equation) with k independent variables
Estimated
y-intercept
Estimated slopes, coefficients.
Note: with b
1, b2 etc. we may, instead, use the names of the X
variable in the subscripts to the coefficients.
This Photo by Unknown Author is licensed
under
CC BY-SA
#2 Model with multiple linear regression (MLR)
18
Interpretation of coefficients or slopes in multiple linear
regression (MLR)

•There is a separate coefficient or slope for
each X variable.

In MLR, we have
multiple X variables

We try to describe the effect of one X
variable on the Y variable whilst:
o holding all other X variable
constant
or
o after allowing for the effec
of all other X variables.

In MLR, to interpret the
coefficient or slope of an X
variable …
#2
19
Multiple linear regression: correlation coefficients
and two coefficients of determination

•This table summarises the direction and strength
a linear relationship between the Y variable
separately with each individual X variable.

of
In MLR, we can use EXCEL to
create a correlation table of
correlation coefficients, r.

This value of r measures the correlation between observations and
fitted values.

In MLR EXCEL regression
output, r.

• 𝑟2, in MLR, as in SLR, we can multiply this by 100%,
to get the measure of the percentage of variation
in the dependent variable, Y, explained by the
variation in the independent variables,
𝑋𝑖.
•Adjusted 𝑟2 is amended to take into consideration
the number of independent variables,
𝑋𝑖.

In MLR, we will have two
coefficients of
determination:
#2
Sample correlation
coefficient notation:
r, R,
Multiple R
20
Adjusted r2
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
#2
Recall, the coefficient of determination, r2 measure this proportion of variation in the Y
variable explained by the regression on the X variable.
However, with Multiple Linear Regression, we now have more than one X variable – this
affects
r2
r2 never decreases when a new X variable is added to the model – this can be a
disadvantage when comparing models.
So, in MLR, we can also use the Adjusted r2
Shows the proportion of variation in Y explained by all X variables adjusted for the number
of X variables used
Penalises excessive use of unimportant explanatory variables
Adjusted r2 will be smaller than r2
Adjusted r2 is useful for comparing models
21
Example
A random sample of forty employees in a large, multinational company was collected and
values for the following variables were recorded: annual
wage ($000s), work experience
(years), absenteeism (days per year absent) and years of education. A snip of the data is
below. EXCEL was used to create the scatterplots and regression output is in the next slide.
a) Write down the regression equation.
b) Interpret the adjusted coefficient of determination.
c) Interpret the coefficient of education.
d) Interpret the coefficient of absenteeism.
e) Forecast the wage of an employee with 15 years of
experience, 0 days absenteeism and 18 years of
education. Is this estimate reliable? Explain.
f) Calculate and comment on the residual for your
estimate in part e), given the actual data value, here, is 180, in $000, (in table above).
#2
This Photo by Unknown Author is licensed under
CC BY-NC-ND

Experience
(years)
Absenteeism
(days/year)
Education
(years)
Wage
($000)
0 12 10 25
2 9 15 75
10 2 17 120
15 0 18 180

The complete data set may be found in the EXCEL file named
“STAM4000 Data for Week 10.xls” on the subject portal, Weekly materials, Week 10.

22
#2 Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

Names of the independent or explanatory or predictor, (X) variables

= Ad justed r2 is an a justed co efficient of d etermination.
In m ultiple linear reg ression, it is best to us e the adjusted r2 as it adds
prec ision and reliabi lity by con sidering the impact of addi tional X
vari bles to the fit o the mod l.
= Co rrelation coeffic ent, r = co rrelation be ween observa tions and
fitte d values.
b 0 =
b 2 =
b =
b 1 =

3
23
#2 Example
This Photo by Unknown Author is licensed under

a) Write down the regression equation.
Theoretically, for our example: 𝑌෠ = b0 + b1X1+ b2X2 + b3X3
CC BY-NC-ND

which translates to,
𝑊𝑎𝑔𝑒 ෣ = b0 + b1Experience+ b2Absenteeism + b3Education
which translates to,
𝑊𝑎𝑔𝑒 ෣ = b0 + bExperience Experience + bAbsenteeism Absenteeism + bEducation Education
The actual answer:
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030 Experience – 0.463 Absenteeism + 10.534 Education
b) Interpret the adjusted coefficient of determination.
Adjusted r2 = 0.540 which tells us that 54% of the variance in wage is explained by this
regression on experience, absenteeism and education – a moderately, strong linear
relationship.

24
#2 Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND
c) Interpret the coefficient of education.
Holding experience and absenteeism constant, we estimate for extra year of
education, that wages increase by 10.534 in $000s, i.e., by $10,534 per annum, on
average
d) Interpret the coefficient of absenteeism.
After allowing for the effects of experience and education, we estimate for each
extra day of absenteeism, that wages decrease by 0.463 in $000, i.e., by $463 per
annum, on average.

25
#2 Example
This Photo by Unknown Author is licensed under
e) Forecast the wage of an employee with 15 years of experience, 0 days absenteeism and 18 years CC BY-NC-ND
of education. Is this estimate reliable? Explain.
From the snip of data, we understand that these data values are in the sample data range, so our
estimate will not be an extrapolation. (In fact, these data values exist exactly in the sample data).
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030 Experience – 0.463 Absenteeism + 10.534 Education
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030(15) – 0.463(0) + 10.534(18)
𝑊𝑎𝑔𝑒 ෣ = 124.33 in $000
𝑊𝑎𝑔𝑒 ෣ = $124,330 per annum, estimated, on average by the model
As adjusted R
2 = 0.540, about 54% of the variation in wage is explained by this model, so moderately
reliable.
f) Calculate and comment on the residual for your estimate in part e), given the actual data value,
here, is 180 in $000.
Residual = data – model
= 180,000 – 124,330
= $55,670 > 0, this model has underestimated the annual wage for this employee.

26
#2 Exercise
A random sample of thirty rental properties was collected and values recorded for: weekly rent, ($/wk), distance
from the city centre (km), number of bedrooms, number of bathrooms and age of the property (year).
This Photo by Unknown Author is licensed under CC BY
The complete data set
may be found in the
EXCEL file named
“STAM4000 Data for
Week 10.xls” on the
subject portal, Weekly
materials, Week 10.
EXCEL was used to create the regression output on the next slide. Use this to answer the following:
a) Write down the regression equation.
b) Interpret the coefficient of distance.
c) Forecast the weekly rent for a property that is
5 km from the city centre, with 3 bedrooms, 1 bathroom, and is
1 year old
. Is this estimate reliable?
d) You are now told that the actual rent is $600 per week for a property that is
5 km from the city centre, with 3
bedrooms, 1 bathroom, and is 1 year old.
Calculate and comment on the residual.

 

Distance (km) Bedrooms Bathrooms Age (years) Rent ($/wk)
23 3 2 35 350
30 2 1 40 280
15 2 1 25 450
14 1 1 10 375
2 2 2 5 750

27
#2 Exercise

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.907
R Square 0.823
Adjusted R Square 0.795
Standard Error 105.323
Observations 30
Coefficients Standard
Error
t Stat P-value Lower 95% Upper 95%
Intercept 612.985 86.397 7.095 1.95E-07 435.046 790.923
Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.473 -15.934
Bedrooms 7.417 21.853 0.339 0.737 -37.590 52.423
Bathrooms 101.921 35.376 2.881 0.008 29.063 174.779
Age (years) 2.799 2.277 1.229 0.230 -1.891 7.489

 

This Photo by Unknown Author is licensed under CC BY

31
#3 Inference in regression
https://www.cartoonstock.com/directory/s/statistics.asp
32
Confidence intervals
for the intercept, β0
for the slopes (coefficients) β1 , β2 etc.
Hypothesis Tests we will cover:
for the intercept, β0
for the slopes (coefficients), β1 , β2 etc.
for the overall model
#3 Inference about the population regression relationship
EXCEL automatically give 95% confidence
intervals for the population intercept
and the population slope(s)
EXCEL automatically gives the calculated test
statistics and corresponding p-values to test
the following:

i) Two-tailed tests about ZERO for the
population intercept and the population
slope(s)
Test the significance of the overall model
ii)

33
Confidence intervals for the,
intercept, β0, 𝑏𝑜 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏𝑜
slope (or coefficient) β1 , 𝑏1 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏1
slope (or coefficient) β2 , 𝑏2 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏2
𝑒𝑡𝑐. for k coefficients
Note: t critical here depends on:
o CI percentage
o degrees of freedom = df = n – k – 1
where k = number of X variables in the regression
model.
Note df = n – k – 1 = residual degrees of freedom
Confidence intervals in regression
EXCEL automatically provides
95% confidence intervals (CI) in the last
two columns of regression output.
To create CI for other % of confidence,
we use relevant values from the
regression output and
find t critical from the t table.
#3 Note:
SE of 𝑏
𝑜 is the standard
error of
𝑏
𝑜
SE of 𝑏1 is the standard
error of
𝑏1
SE of 𝑏2 is the standard
error of
𝑏2 etc.
34
Example

This Photo by Unknown Author is licensed under
E.g. Say, we wanted a 90% CI for the population slope (coefficient) of
CC BY-NC-ND
f = 36, but we
ducation. For t e t critical, in t he t tables, e use residual d
F or the 95% CI fo r the populatio n slope (coef ficient) of educat ion, we read
t his directly from the regression output (6.46 2, 14.606).
c an use the 35 row, and 𝑡0.05 column to fi nd 𝑡𝑐𝑟𝑖𝑡= 1.69. T hen using the
C I formula,
𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛± 𝑡 𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏𝐸𝑑 𝑢𝑐𝑎𝑡𝑖𝑜𝑛
= 10.534 ±1.69 2.008
= (7.140, 13.928

 

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3
th
35
Steps in hypothesis testing, with regression:
Write the hypotheses
Use either of these methods:
oCritical value method: use the EXCEL provided t-calculated test
statistic if testing about zero OR create a calculated test statistic if
testing for a non-zero value, and then use t-tables to find the t-critical.
op-value method: only for tests about zero, use the appropriate pvalue, compare to the significance level, α – QUICKER
Decision
Conclusion
Hypothesis tests in regression
Hypothesis Tests we will cover:
t-test for the population intercept, β0
t-tests for the population slopes (coefficients), β1 , β2 etc.
F-test for the overall population model
#3
36
Formula for t-tests of individual coefficients or slopes
Formula t-tests for the population intercept, β0 or for t-tests for the population
slopes (coefficients), β
1 , β2 etc.
𝒕𝒔𝒕𝒂𝒕 = 𝒕𝒄𝒂𝒍𝒄 =
𝒃
𝒊 – 𝜷𝒊
𝑺𝑬 𝒐𝒇 𝒃𝒊
𝛽𝑖 = population coefficient of 𝑋𝑖
𝑏𝑖 = sample coefficient of 𝑋𝑖
SE of 𝑏𝑖 = standard error of sample coefficient of 𝑋𝑖
As we are testing values of the population coefficients, we have 𝜷𝒊 in the
hypotheses.
To find t critical, we can use degrees of freedom = df = n – k – 1 =
residual df
#3
37
Quick t-tests for individual coefficients – created by EXCEL
t-tests about zero for individual coefficients, are called tests of significance.
Excel automatically provides the p-value for these:
i)Two-tailed test about zero, p-value ≤ α, reject Ho, accept Ha etc.
ii)One-tailed test about zero, if
𝑝-𝑣𝑎𝑙𝑢𝑒
2
≤ α, reject Ho, accept Ha etc.
Note: for one-tailed test about zero, you must half the p-value, but keep the α as is.
E.g., Test the significance of the population intercept, β0
Ho: β0 = 0
Ha: β0 ≠ 0
Use p-value from regression, for the intercept, compare to α etc.
#3
E.g., Left-tailed test for β0
Ho: β0 = 0
Ha: β0 < 0 use 𝑝-𝑣𝑎𝑙𝑢𝑒
2
& α
E.g., Right-tailed test for β0
Ho: β0 = 0
Ha: β0 > 0 use 𝑝-𝑣𝑎𝑙𝑢𝑒
2
& α
38
Example
This Photo by Unknown Author is licensed under
CC BY-NC-ND
For the wages example, use the copy of the regression output in the next slide to answer the
following:
a) Test the significance of the coefficient of experience, at α of 5%.
b) Test the significance of the coefficient of absenteeism, at α of 5%.
c) Test the significance of the coefficient of education, at α of 5%.
d) Test that the coefficient of education is greater than 10, at α of 5%.
#3
39
Example
This Photo by Unknown Author is licensed under

CC BY-NC-ND
Use the P-values to d separate t- tests of signific nce for

 

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3
individual population slopes (also called coefficients).
40
a) Test the significance of
the coefficient of
experience, at α of 5%.
Ho: βExperience = 0
Ha: β
Experience ≠ 0
Decision:
p-value of 0.9674 > α of
0.05 retain Ho.
Conclusion:
There is no significant
linear relationship
between wage and
experience at α of 5%.
b) Test the significance of
the coefficient of
absenteeism, at α of 5%.
Ho: βAbsenteeism = 0
Ha: β
Absenteeism ≠ 0
Decision:
p-value of 0.7693 > α of 0.05
we retain Ho.
Conclusion: There is no
significant evidence of a
linear relationship between
wage and absenteeism at α
of 5%.
c) Test the significance of the
coefficient of education, at α
of 5%.
Ho: βEducation= 0
Ha: β
Education ≠ 0
Decision:
p-value of
7.07E-06
= 7.07 x 10
-6
≈ 0 < α of 0.05, we reject Ho,
accept Ha.
Conclusion: The coefficient of
education is statistically
significant at α of 5%.
Example solution
This Photo by Unknown Author is licensed under
CC BY-NC-ND
#3
41
d) Test that the coefficient of education is greater than 10, at α of 5%.
Ho: βEducation = 10
Ha: β
Education > 10
Use
𝑡𝑠𝑡𝑎𝑡 = 𝑡𝑐𝑎𝑙𝑐 = 𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛– 𝛽𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛
𝑆𝐸 𝑜𝑓 𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 =
10.534 -10
2.008
= 0.266
To find t critical, we can use degrees of freedom = df = n – k – 1 =
residual df = 36, we use df = 35 row of
t tables. Have
α of 5% = 0.05 in the right tail, use t 0.05 column in the t tables. t critical = 1.69
Decision: As t calc of 0.266 < t critical of 1.69, t calc does NOT lie in the rejection region. We retain Ho at
α of 5%.
Conclusion:
There is no significant evidence that the population coefficient of education is greater than 10.
Example solution continued
This Photo by Unknown Author is licensed under
CC BY-NC-ND
#3
42
If we wanted to test for a negative linear
relationship between wage and
absenteeism:
Ho: βAbsenteeism = 0
Ha: β
Absenteeism < 0
Decision:
As p-value of
0.7693
2
≈ 0.3847 > α of 0.05
retain Ho.
Conclusion:
There is no significant negative linear
relationship between wage and absenteeism.
If we wanted to test for a positive linear
relationship between wage and education:
Ho: βEducation = 0
Ha: β
Education > 0
Decision:

Use = = ≈ 0 < α
p-value 7.07E-06 7.07 x 10-6

2
2
2
of 0.05 we reject Ho and accept Ha.
Conclusion:
There is a significant positive linear relationship
between wage and education.
#3 Example: one direction test of significance
This Photo by Unknown Author is licensed under
CC BY-NC-ND
43
Here, we can use the Significance F value, as the p-value to test the significance of the overall
model. Test at α of 5%.
Ho: βExperience = βAbsenteeism = βEducation = 0
Ha: β
Experience ≠ 0 and/or βAbsenteeism ≠ 0 and/or βEducation ≠ 0
Decision: The p-value of 7.49E-07 = 7.49 × 10-7 ≈ 0 < α of 0.05, we can reject Ho, accept
Ha at α of 5%.
Conclusion: There is significant evidence of a linear relationship between wage and
experience, absenteeism and education i.e., at least one of the population coefficients
(slopes) is non-zero in the population.
(Note: see the notes for different versions of these hypotheses.)
Example: test the significance of the overall model
Also called test of goodness of fit or an F-test
This Photo by Unknown Author is licensed under CC BY
#3
44
Example continued
This Photo by Unknown Author is licensed under
CC BY-NC-ND

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.759
R Square 0.576
Adjusted R Square 0.540
Standard Error 27.165
Observations 40
ANOVA
df SS MS F Significance F
Regression 3 36053.136 12017.712 16.285 7.49E-07
Residual 36 26566.239 737.951
Total 39 62619.375
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728
Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039
Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715
Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3

Use
F as the p
the
model.

Significance
value to test
significance
of the overall

45
A random sample of thirty rental properties was collected and values for the following variables
were recorded: weekly
rent, ($/wk), distance from the city centre (km), number of bedrooms,
number of
bathrooms and age of the property (year).
EXCEL was used to create the regression output on the next slide. Use this to answer the
following:
a) Write down and interpret the 95% confidence interval for the coefficient of distance.
b) Test the significance of a linear relationship between rent and distance, at α of 5%.
c) Test for a negative linear relationship between rent and distance at α of 5%.
d) Test the significance of the overall model at α of 5%.
This Photo by Unknown Author is licensed under CC BY
#3
46
Exercise

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.907
R Square 0.823
Adjusted R
Square
0.795
Standard Error 105.323
Observations 30
ANOVA
df SS MS F Significance F
Regression 4 1292394.73 323098.6825 29.12659158 4.39E-09
Residual 25 277322.7701 11092.9108
Total 29 1569717.5
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 612.985 86.397 7.095 1.95E-07 435.05 790.92
Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.47 -15.93
Bedrooms 7.417 21.853 0.339 0.7372 -37.59 52.42
Bathrooms 101.921 35.376 2.881 0.0080 29.06 174.78
Age (years) 2.799 2.277 1.229 0.2304 -1.89 7.49

 

This Photo by Unknown Author is licensed under CC BY

#3
50
Supplementary Exercises
Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials “at the end of each week.
Time permitting, the lecturer may ask students to work through some of these exercises
in class.
Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.

51
Extension
The following slides are an extension to this week’s topic.
The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
52
Example
A random sample of forty employees in a large, multinational company was
collected and values for the following variables were recorded: annual
wage
($000s), work experience (years), absenteeism (days per year) and years of
education.
Open the EXCEL file named “STAM4000 Week 10 Excel.xls” and
use the sheet
named “Wage”.
a)Use EXCEL to create separate scatterplots of wage against each of the
explanatory variables.
b) Briefly describe each scatterplot.
This Photo by Unknown Author is licensed under
CC BY-NC-ND
53
53
Example solution
0
50
100
150
200
0 10 20 30
Wage ($000)
Education (years)
Scatterplot of wage against
education
0
50
100
150
200
0 10 20 30
Wage ($000)
Experience (years)
Scatterplot of wage against
experience
0
50
100
150
200
0 5 10 15 20
Wage ($000)
Absenteeism (days)
Scatterplot of wage against
absenteeism
a)
i)Scatterplot of wage against education: strong (as most points are tightly clustered), positive (upward sloping),
linear relationship with one unusual point, an employee with education of 10 years with a very high wage of
$140,000.
ii)Scatterplot of wage against experience: possibly curved, due to an unusual points of employees with around 20
years experience, but relatively low wages.
iii)Scatterplot of wage against absenteeism: negative (downward sloping), moderately strong (as points are not
tightly clustered), possibly linear relationship with one employee had absenteeism of 15 days absent – a
relatively large number
i) ii) iii)
This Photo by Unknown Author is licensed under
CC BY-NC-ND
54
Example
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e

Week Pie
Sales
Price
($)
Advertising
($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7

A distributor of frozen dessert pies wants to evaluate factors
thought to influence demand. Data are collected for 15 weeks.
Dependent, (response) variable: Y
Pie sales (units per week), number of pies sold per week.
Independent (explanatory or predictor) variables: X1, X2
Advertising ($100s), Price (in $)
55

Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Example continued Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) 9781442549272/Berenson/Business Statistics /2e –

𝑆෣𝑎𝑙𝑒𝑠 = 306.526 – 24.975Price + 74.131Advertising

𝑆෣𝑎𝑙𝑒𝑠 = b
0 + b1Price + b2Advertising
56
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Estimated Sales = 306.526 – 24.975(Price) + 74.131(Advertising)
b
1 = -24.975:
Holding advertising constant, number of pies sold will decrease, on average, by 24.975 pies per week for
each $1 increase in selling price, net of the effects of changes due to advertising
b
2 = 74.131:
Holding price constant, number of pies sold, will increase, on average, by 74.131 pies per week for each $100
increase in advertising, net of the effects of changes due to price
Where,
Sales is in number of pies sold per week
Price is in $
Advertising is in $00 i.e. hundreds of dollars.
Example continued
Write the regression equation.
Interpret the coefficients.

57
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Ltd) – 9781442549272/Berenson/Business Statistics /2e
Predict sales for a week in which the selling price is $5.50 and advertising is $350. Explain.
Predicted sales is 428.62 pies.
The values of selling price is $5.50 and advertising is $350, are in the sample data range, so the estimate
is not an extrapolation.
Adjusted R- square = 0.4417 which tells us that 44.17% of the variation in pie sales is explained by this
regression on price and advertising., the estimate is moderately reliable.
Estimated Sales = 306.526 – 24.975(Price) + 74.131(Advertising)
= 306.526 – 24.975 (5.50) + 74.131 (3.5)
= 428.62
Note that Advertising is in $00, so $350 means that X2 = 3.5
Example continued
If you are now told that a week in which the selling price is $5.50 and advertising is $350, the sales were
actually 400 pies, find the residual and comment.
Residual = data – model = 400 – 428.62 = -28.62 pies. As the residual is negative, this model has
overestimated the number of pies sold here.

58
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
9781442549272/Berenson/Business Statistics /2e
Test the following, at α of 5%:
a) Test the significance of the coefficient of price.
Ho: 𝛽𝑃𝑟𝑖𝑐𝑒 = 0
Ha: 𝛽𝑃𝑟𝑖𝑐𝑒 ≠ 0
p-value = 0.0398 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant linear relationship
between the number of pies sold and the price of the pie.
b) Test the significance of a positive linear relationship between sales and advertising.
Ho: 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 = 0
Ha: 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 > 0
Use
𝑝-𝑣𝑎𝑙𝑢𝑒
2
=
0.0145
2
= 0.0073 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant positive
linear relationship between the number of pies sold and amount spent on advertising
c) Test the significance of the overall model i.e. a goodness of fit test.
Ho: 𝛽𝑃𝑟𝑖𝑐𝑒 = 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 = 0
Ha: 𝛽𝑃𝑟𝑖𝑐𝑒 ≠ 0 and/or 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 ≠ 0
Use Significance F = p-value = 0.0120 < 0.05 of α, we reject Ho and accept Ha, concluding there is an overall
significant linear relationship i.e., at least one population coefficient is non-zero.
Example continued