### Quantitative Methods

1
1
STAM4000
Quantitative Methods
Week 10
Multiple linear regression and
inference
http://claudiaflowers.net/rsch8140/Lec1.html
2
COMMONWEALTH OF AUSTRALIA
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2

 3 r d #1 #2 Assumptions in linear regression Multiple linear regression #3 Inference in regression Week 10 Multiple linea regression an inference Learning Outcomes

4
Why does this
matter?
We can create
linear models
with more
than one X
variable.
We can also
estimate and
test
regression
statistics.

5
#1 Assumptions in linear regression
https://line.17qq.com/articles/ncpkdmmlv_p3.html
6
#1 Assumptions in linear regression
Use the acronym LINE
Linearity: the underlying relationship between X and Y is linear
Independence of errors: error values are statistically independent
Normality of error: error values (ε) are normally distributed for any given
value of X
Equal variance (homoscedasticity): the probability distribution of the errors
has constant variance
Pty Ltd) – 9781442549272/Berenson/Business Statistics /2e

7
#1 Residual analysis
The residual is the difference between an observed data value and the predicted
data value from the fitted line (the fitted value).
Error = observed Y – expected value of Y from model
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑌 – 𝑓𝑖𝑡𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 𝒀 – 𝒀෡
The residual is the estimate of the error
Zero residual: the fitted value equals the observed value
Positive residual: the fitted value is less than the observed value
Negative residual: the fitted value is greater than the observed value
8
#1 More on residual analysis
A consequence of the least squares fitting algorithm is that the sum of the residuals,
σ 𝐘 – 𝐘෡ = 𝟎 and hence their mean, is 0.
The variance of the errors is estimated by s
2 = σ(𝐘 – 𝐘෡)𝟐
𝐧-𝐤-𝟏
The estimated standard deviation of the errors is s = σ(𝐘 – 𝐘෡)𝟐
𝐧-𝐤-𝟏
The estimated standard deviation of the errors is referred to as the Standard Error, in Excel
output.
The Standard Error of the estimate is the typical error that occurs when the least squares
regression equation is used to estimate the value of Y for given values of the X variables.

9
#1 Residual Analysis
Excel calculates residuals as part of its regression analysis, for example:
We can use these residuals to determine whether the error variable is nonnormal, whether the error variance is constant, and whether the errors are
independent…

 RESIDUAL OUTPUT Observation Predicted Weekly Sales Residuals = Observed Y – Predicted Y 1 251.92316 -6.923162 2 273.87671 38.12329 3 284.85348 -5.853484 4 304.06284 3.937162 … … …

10
Residual analysis for linearity
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals
#1

11
Residual analysis for independence
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Not independent Independent
X
X
residuals
residuals
X
residuals
#1

12
Residual analysis for normality
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
Percent
Residual
A normal probability plot of the residuals can be used to check for normality
-3 -2 -1 0 1 2 3
0
100
#1
13
Residual analysis for equal variance (homoscedasticity)
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
Non-constant variance Constant variance
x x
Y
x x
Y
residuals
residuals
#1
14
#1 Example
The manager of a computer games store wishes to:
Examine the residuals between weekly sales (\$000) and the number of
customers making purchases.
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –

 Weekly sales (\$000) No. customers 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700

0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000
Weekly sales (\$000)
Number of customers
Scatterplot of weekly sales versus
number of customers

15
Example
Does not appear to violate any regression assumptions as:
i) the residual plot is well scattered, so linearity, independence, homoscedasticity satisfied.
ii) the normal probability plot follows an upward sloping diagonal line
#1
-100.00
-50.00
0.00
50.00
100.00
0

 500 1000 1500 2000 2500

3000
Residuals = Observed
weekly sales – predicted
weekly sales
Number of customers
Weekly sales residual plot

 Number of customers Weekly sales (\$000) 1400 245 1600 312 1700 279 1875 308 1100 199 1550 219 2350 405 2450 324 1425 319 1700 255

 Predicted Weekly sales (\$000) Residuals 251.92 -6.92 273.88 38.12 284.85 -5.85 304.06 3.94 218.99 -19.99 268.39 -49.39 356.20 48.80 367.18 -43.18 254.67 64.33 284.85 -29.85

0
200
400
600
0 20 40 60 80 100
Weekly sales
(\$000)
Sample Percentile
Normal probability plot
16
#2 Multiple linear regression
https://www.pinterest.com.au/pin/515662226059772035/
17
𝑦𝑖 = 𝛽0+ 𝛽1 𝑥1𝑖 + 𝛽2𝑥2𝑖 + ⋯ + 𝛽k𝑥k𝑖 + 𝜀𝑖
The Population Multiple Regression equation with k independent variables
Y-intercept Population slopes
(population coefficients)
Random Error
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
The Sample (fitted) Multiple Regression Model (equation) with k independent variables
Estimated
y-intercept
Estimated slopes, coefficients.
Note: with b
1, b2 etc. we may, instead, use the names of the X
variable in the subscripts to the coefficients.
This Photo by Unknown Author is licensed
under
CC BY-SA
#2 Model with multiple linear regression (MLR)
18
Interpretation of coefficients or slopes in multiple linear
regression (MLR)

 •There is a separate coefficient or slope for each X variable.

In MLR, we have
multiple X variables

 • We try to describe the effect of one X variable on the Y variable whilst: o holding all other X variable constant or o after allowing for the effec of all other X variables.

In MLR, to interpret the
coefficient or slope of an X
variable …
#2
19
Multiple linear regression: correlation coefficients
and two coefficients of determination

 •This table summarises the direction and strength a linear relationship between the Y variable separately with each individual X variable.

of
In MLR, we can use EXCEL to
create a correlation table of
correlation coefficients, r.

 This value of r measures the correlation between observations and fitted values.

In MLR EXCEL regression
output, r.

 • 𝑟2, in MLR, as in SLR, we can multiply this by 100%, to get the measure of the percentage of variation in the dependent variable, Y, explained by the variation in the independent variables,𝑋𝑖. •Adjusted 𝑟2 is amended to take into consideration the number of independent variables, 𝑋𝑖.

In MLR, we will have two
coefficients of
determination:
#2
Sample correlation
coefficient notation:
r, R,
Multiple R
20
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
#2
Recall, the coefficient of determination, r2 measure this proportion of variation in the Y
variable explained by the regression on the X variable.
However, with Multiple Linear Regression, we now have more than one X variable – this
affects
r2
r2 never decreases when a new X variable is added to the model – this can be a
So, in MLR, we can also use the Adjusted r2
Shows the proportion of variation in Y explained by all X variables adjusted for the number
of X variables used
Penalises excessive use of unimportant explanatory variables
Adjusted r2 will be smaller than r2
Adjusted r2 is useful for comparing models
21
Example
A random sample of forty employees in a large, multinational company was collected and
values for the following variables were recorded: annual
wage (\$000s), work experience
(years), absenteeism (days per year absent) and years of education. A snip of the data is
below. EXCEL was used to create the scatterplots and regression output is in the next slide.
a) Write down the regression equation.
b) Interpret the adjusted coefficient of determination.
c) Interpret the coefficient of education.
d) Interpret the coefficient of absenteeism.
e) Forecast the wage of an employee with 15 years of
experience, 0 days absenteeism and 18 years of
education. Is this estimate reliable? Explain.
f) Calculate and comment on the residual for your
estimate in part e), given the actual data value, here, is 180, in \$000, (in table above).
#2
CC BY-NC-ND

 Experience (years) Absenteeism (days/year) Education (years) Wage (\$000) 0 12 10 25 2 9 15 75 10 2 17 120 15 0 18 180 … … … …

The complete data set may be found in the EXCEL file named
“STAM4000 Data for Week 10.xls” on the subject portal, Weekly materials, Week 10.

22
#2 Example
CC BY-NC-ND

 SUMMARY OUTPUT Regression Statistics Multiple R 0.759 R Square 0.576 Adjusted R Square 0.540 Standard Error 27.165 Observations 40 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728 Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039 Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715 Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

Names of the independent or explanatory or predictor, (X) variables

 = Ad justed r2 is an a justed co efficient of d etermination. In m ultiple linear reg ression, it is best to us e the adjusted r2 as it adds prec ision and reliabi lity by con sidering the impact of addi tional X vari bles to the fit o the mod l. = Co rrelation coeffic ent, r = co rrelation be ween observa tions and fitte d values. b 0 = b 2 = b = b 1 =

3
23
#2 Example

 a) Write down the regression equation. Theoretically, for our example: 𝑌෠ = b0 + b1X1+ b2X2 + b3X3 CC BY-NC-ND

which translates to,
𝑊𝑎𝑔𝑒 ෣ = b0 + b1Experience+ b2Absenteeism + b3Education
which translates to,
𝑊𝑎𝑔𝑒 ෣ = b0 + bExperience Experience + bAbsenteeism Absenteeism + bEducation Education
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030 Experience – 0.463 Absenteeism + 10.534 Education
b) Interpret the adjusted coefficient of determination.
Adjusted r2 = 0.540 which tells us that 54% of the variance in wage is explained by this
regression on experience, absenteeism and education – a moderately, strong linear
relationship.

24
#2 Example
CC BY-NC-ND
c) Interpret the coefficient of education.
Holding experience and absenteeism constant, we estimate for extra year of
education, that wages increase by 10.534 in \$000s, i.e., by \$10,534 per annum, on
average
d) Interpret the coefficient of absenteeism.
After allowing for the effects of experience and education, we estimate for each
extra day of absenteeism, that wages decrease by 0.463 in \$000, i.e., by \$463 per
annum, on average.

25
#2 Example
e) Forecast the wage of an employee with 15 years of experience, 0 days absenteeism and 18 years CC BY-NC-ND
of education. Is this estimate reliable? Explain.
From the snip of data, we understand that these data values are in the sample data range, so our
estimate will not be an extrapolation. (In fact, these data values exist exactly in the sample data).
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030 Experience – 0.463 Absenteeism + 10.534 Education
𝑊𝑎𝑔𝑒 ෣ = -64.832 -0.030(15) – 0.463(0) + 10.534(18)
𝑊𝑎𝑔𝑒 ෣ = 124.33 in \$000
𝑊𝑎𝑔𝑒 ෣ = \$124,330 per annum, estimated, on average by the model
2 = 0.540, about 54% of the variation in wage is explained by this model, so moderately
reliable.
f) Calculate and comment on the residual for your estimate in part e), given the actual data value,
here, is 180 in \$000.
Residual = data – model
= 180,000 – 124,330
= \$55,670 > 0, this model has underestimated the annual wage for this employee.

 26 #2 Exercise A random sample of thirty rental properties was collected and values recorded for: weekly rent, (\$/wk), distance from the city centre (km), number of bedrooms, number of bathrooms and age of the property (year). This Photo by Unknown Author is licensed under CC BY The complete data set may be found in the EXCEL file named “STAM4000 Data for Week 10.xls” on the subject portal, Weekly materials, Week 10. EXCEL was used to create the regression output on the next slide. Use this to answer the following: a) Write down the regression equation. b) Interpret the coefficient of distance. c) Forecast the weekly rent for a property that is 5 km from the city centre, with 3 bedrooms, 1 bathroom, and is 1 year old. Is this estimate reliable? d) You are now told that the actual rent is \$600 per week for a property that is 5 km from the city centre, with 3 bedrooms, 1 bathroom, and is 1 year old. Calculate and comment on the residual.

 Distance (km) Bedrooms Bathrooms Age (years) Rent (\$/wk) 23 3 2 35 350 30 2 1 40 280 15 2 1 25 450 14 1 1 10 375 2 2 2 5 750 … … … … …

27
#2 Exercise

 SUMMARY OUTPUT Regression Statistics Multiple R 0.907 R Square 0.823 Adjusted R Square 0.795 Standard Error 105.323 Observations 30 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 612.985 86.397 7.095 1.95E-07 435.046 790.923 Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.473 -15.934 Bedrooms 7.417 21.853 0.339 0.737 -37.590 52.423 Bathrooms 101.921 35.376 2.881 0.008 29.063 174.779 Age (years) 2.799 2.277 1.229 0.230 -1.891 7.489

31
#3 Inference in regression
https://www.cartoonstock.com/directory/s/statistics.asp
32
Confidence intervals
for the intercept, β0
for the slopes (coefficients) β1 , β2 etc.
Hypothesis Tests we will cover:
for the intercept, β0
for the slopes (coefficients), β1 , β2 etc.
for the overall model
#3 Inference about the population regression relationship
EXCEL automatically give 95% confidence
intervals for the population intercept
and the population slope(s)
EXCEL automatically gives the calculated test
statistics and corresponding p-values to test
the following:

 i) Two-tailed tests about ZERO for the population intercept and the population slope(s) Test the significance of the overall model ii)

33
Confidence intervals for the,
intercept, β0, 𝑏𝑜 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏𝑜
slope (or coefficient) β1 , 𝑏1 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏1
slope (or coefficient) β2 , 𝑏2 ± 𝑡𝑐𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏2
𝑒𝑡𝑐. for k coefficients
Note: t critical here depends on:
o CI percentage
o degrees of freedom = df = n – k – 1
where k = number of X variables in the regression
model.
Note df = n – k – 1 = residual degrees of freedom
Confidence intervals in regression
EXCEL automatically provides
95% confidence intervals (CI) in the last
two columns of regression output.
To create CI for other % of confidence,
we use relevant values from the
regression output and
find t critical from the t table.
#3 Note:
SE of 𝑏
𝑜 is the standard
error of
𝑏
𝑜
SE of 𝑏1 is the standard
error of
𝑏1
SE of 𝑏2 is the standard
error of
𝑏2 etc.
34
Example

 This Photo by Unknown Author is licensed under E.g. Say, we wanted a 90% CI for the population slope (coefficient) of CC BY-NC-ND f = 36, but we ducation. For t e t critical, in t he t tables, e use residual d F or the 95% CI fo r the populatio n slope (coef ficient) of educat ion, we read t his directly from the regression output (6.46 2, 14.606). c an use the 35 row, and 𝑡0.05 column to fi nd 𝑡𝑐𝑟𝑖𝑡= 1.69. T hen using the C I formula, 𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛± 𝑡 𝑟𝑖𝑡 𝑆𝐸 𝑜𝑓 𝑏𝐸𝑑 𝑢𝑐𝑎𝑡𝑖𝑜𝑛 = 10.534 ±1.69 2.008 = (7.140, 13.928

 SUMMARY OUTPUT Regression Statistics Multiple R 0.759 R Square 0.576 Adjusted R Square 0.540 Standard Error 27.165 Observations 40 ANOVA df SS MS F Significance F Regression 3 36053.136 12017.712 16.285 7.49E-07 Residual 36 26566.239 737.951 Total 39 62619.375 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728 Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039 Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715 Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3
th
35
Steps in hypothesis testing, with regression:
Write the hypotheses
Use either of these methods:
oCritical value method: use the EXCEL provided t-calculated test
statistic if testing about zero OR create a calculated test statistic if
testing for a non-zero value, and then use t-tables to find the t-critical.
op-value method: only for tests about zero, use the appropriate pvalue, compare to the significance level, α – QUICKER
Decision
Conclusion
Hypothesis tests in regression
Hypothesis Tests we will cover:
t-test for the population intercept, β0
t-tests for the population slopes (coefficients), β1 , β2 etc.
F-test for the overall population model
#3
36
Formula for t-tests of individual coefficients or slopes
Formula t-tests for the population intercept, β0 or for t-tests for the population
slopes (coefficients), β
1 , β2 etc.
𝒕𝒔𝒕𝒂𝒕 = 𝒕𝒄𝒂𝒍𝒄 =
𝒃
𝒊 – 𝜷𝒊
𝑺𝑬 𝒐𝒇 𝒃𝒊
𝛽𝑖 = population coefficient of 𝑋𝑖
𝑏𝑖 = sample coefficient of 𝑋𝑖
SE of 𝑏𝑖 = standard error of sample coefficient of 𝑋𝑖
As we are testing values of the population coefficients, we have 𝜷𝒊 in the
hypotheses.
To find t critical, we can use degrees of freedom = df = n – k – 1 =
residual df
#3
37
Quick t-tests for individual coefficients – created by EXCEL
t-tests about zero for individual coefficients, are called tests of significance.
Excel automatically provides the p-value for these:
i)Two-tailed test about zero, p-value ≤ α, reject Ho, accept Ha etc.
𝑝-𝑣𝑎𝑙𝑢𝑒
2
≤ α, reject Ho, accept Ha etc.
Note: for one-tailed test about zero, you must half the p-value, but keep the α as is.
E.g., Test the significance of the population intercept, β0
Ho: β0 = 0
Ha: β0 ≠ 0
Use p-value from regression, for the intercept, compare to α etc.
#3
E.g., Left-tailed test for β0
Ho: β0 = 0
Ha: β0 < 0 use 𝑝-𝑣𝑎𝑙𝑢𝑒
2
& α
E.g., Right-tailed test for β0
Ho: β0 = 0
Ha: β0 > 0 use 𝑝-𝑣𝑎𝑙𝑢𝑒
2
& α
38
Example
CC BY-NC-ND
For the wages example, use the copy of the regression output in the next slide to answer the
following:
a) Test the significance of the coefficient of experience, at α of 5%.
b) Test the significance of the coefficient of absenteeism, at α of 5%.
c) Test the significance of the coefficient of education, at α of 5%.
d) Test that the coefficient of education is greater than 10, at α of 5%.
#3
39
Example

 CC BY-NC-ND Use the P-values to d separate t- tests of signific nce for

 SUMMARY OUTPUT Regression Statistics Multiple R 0.759 R Square 0.576 Adjusted R Square 0.540 Standard Error 27.165 Observations 40 ANOVA df SS MS F Significance F Regression 3 36053.136 12017.712 16.285 7.49E-07 Residual 36 26566.239 737.951 Total 39 62619.375 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728 Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039 Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715 Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3
individual population slopes (also called coefficients).
40
a) Test the significance of
the coefficient of
experience, at α of 5%.
Ho: βExperience = 0
Ha: β
Experience ≠ 0
Decision:
p-value of 0.9674 > α of
0.05 retain Ho.
Conclusion:
There is no significant
linear relationship
between wage and
experience at α of 5%.
b) Test the significance of
the coefficient of
absenteeism, at α of 5%.
Ho: βAbsenteeism = 0
Ha: β
Absenteeism ≠ 0
Decision:
p-value of 0.7693 > α of 0.05
we retain Ho.
Conclusion: There is no
significant evidence of a
linear relationship between
wage and absenteeism at α
of 5%.
c) Test the significance of the
coefficient of education, at α
of 5%.
Ho: βEducation= 0
Ha: β
Education ≠ 0
Decision:
p-value of
7.07E-06
= 7.07 x 10
-6
≈ 0 < α of 0.05, we reject Ho,
accept Ha.
Conclusion: The coefficient of
education is statistically
significant at α of 5%.
Example solution
CC BY-NC-ND
#3
41
d) Test that the coefficient of education is greater than 10, at α of 5%.
Ho: βEducation = 10
Ha: β
Education > 10
Use
𝑡𝑠𝑡𝑎𝑡 = 𝑡𝑐𝑎𝑙𝑐 = 𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛– 𝛽𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛
𝑆𝐸 𝑜𝑓 𝑏𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 =
10.534 -10
2.008
= 0.266
To find t critical, we can use degrees of freedom = df = n – k – 1 =
residual df = 36, we use df = 35 row of
t tables. Have
α of 5% = 0.05 in the right tail, use t 0.05 column in the t tables. t critical = 1.69
Decision: As t calc of 0.266 < t critical of 1.69, t calc does NOT lie in the rejection region. We retain Ho at
α of 5%.
Conclusion:
There is no significant evidence that the population coefficient of education is greater than 10.
Example solution continued
CC BY-NC-ND
#3
42
If we wanted to test for a negative linear
relationship between wage and
absenteeism:
Ho: βAbsenteeism = 0
Ha: β
Absenteeism < 0
Decision:
As p-value of
0.7693
2
≈ 0.3847 > α of 0.05
retain Ho.
Conclusion:
There is no significant negative linear
relationship between wage and absenteeism.
If we wanted to test for a positive linear
relationship between wage and education:
Ho: βEducation = 0
Ha: β
Education > 0
Decision:

 Use = = ≈ 0 < α p-value 7.07E-06 7.07 x 10-6

2
2
2
of 0.05 we reject Ho and accept Ha.
Conclusion:
There is a significant positive linear relationship
between wage and education.
#3 Example: one direction test of significance
CC BY-NC-ND
43
Here, we can use the Significance F value, as the p-value to test the significance of the overall
model. Test at α of 5%.
Ho: βExperience = βAbsenteeism = βEducation = 0
Ha: β
Experience ≠ 0 and/or βAbsenteeism ≠ 0 and/or βEducation ≠ 0
Decision: The p-value of 7.49E-07 = 7.49 × 10-7 ≈ 0 < α of 0.05, we can reject Ho, accept
Ha at α of 5%.
Conclusion: There is significant evidence of a linear relationship between wage and
experience, absenteeism and education i.e., at least one of the population coefficients
(slopes) is non-zero in the population.
(Note: see the notes for different versions of these hypotheses.)
Example: test the significance of the overall model
Also called test of goodness of fit or an F-test
#3
44
Example continued
CC BY-NC-ND

 SUMMARY OUTPUT Regression Statistics Multiple R 0.759 R Square 0.576 Adjusted R Square 0.540 Standard Error 27.165 Observations 40 ANOVA df SS MS F Significance F Regression 3 36053.136 12017.712 16.285 7.49E-07 Residual 36 26566.239 737.951 Total 39 62619.375 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -64.832 35.778 -1.812 0.0783 -137.392 7.728 Experience (years) -0.030 1.020 -0.030 0.9764 -2.100 2.039 Absenteeism (days/year) -0.463 1.567 -0.295 0.7693 -3.641 2.715 Education (years) 10.534 2.008 5.246 7.07E-06 6.462 14.606

#3

 Use F as the p the model.

Significance
value to test
significance
of the overall

45
A random sample of thirty rental properties was collected and values for the following variables
were recorded: weekly
rent, (\$/wk), distance from the city centre (km), number of bedrooms,
number of
bathrooms and age of the property (year).
EXCEL was used to create the regression output on the next slide. Use this to answer the
following:
a) Write down and interpret the 95% confidence interval for the coefficient of distance.
b) Test the significance of a linear relationship between rent and distance, at α of 5%.
c) Test for a negative linear relationship between rent and distance at α of 5%.
d) Test the significance of the overall model at α of 5%.
#3
46
Exercise

 SUMMARY OUTPUT Regression Statistics Multiple R 0.907 R Square 0.823 Adjusted R Square 0.795 Standard Error 105.323 Observations 30 ANOVA df SS MS F Significance F Regression 4 1292394.73 323098.6825 29.12659158 4.39E-09 Residual 25 277322.7701 11092.9108 Total 29 1569717.5 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 612.985 86.397 7.095 1.95E-07 435.05 790.92 Distance (km) -22.704 3.287 -6.907 3.07E-07 -29.47 -15.93 Bedrooms 7.417 21.853 0.339 0.7372 -37.59 52.42 Bathrooms 101.921 35.376 2.881 0.0080 29.06 174.78 Age (years) 2.799 2.277 1.229 0.2304 -1.89 7.49

#3
50
Supplementary Exercises
Students are advised that Supplementary Exercises to this topic may be found on the
subject portal under “Weekly materials”.
Solutions to the Supplementary Exercises may be available on the portal under “Weekly
materials “at the end of each week.
Time permitting, the lecturer may ask students to work through some of these exercises
in class.
Otherwise, it is expected that all students work through all Supplementary Exercises
outside of class time.

51
Extension
The following slides are an extension to this week’s topic.
The work covered in the extension:
o Is not covered in class by the lecturer.
o May be assessed.
52
Example
A random sample of forty employees in a large, multinational company was
collected and values for the following variables were recorded: annual
wage
(\$000s), work experience (years), absenteeism (days per year) and years of
education.
Open the EXCEL file named “STAM4000 Week 10 Excel.xls” and
use the sheet
named “Wage”.
a)Use EXCEL to create separate scatterplots of wage against each of the
explanatory variables.
b) Briefly describe each scatterplot.
CC BY-NC-ND
53
53
Example solution
0
50
100
150
200
0 10 20 30
Wage (\$000)
Education (years)
Scatterplot of wage against
education
0
50
100
150
200
0 10 20 30
Wage (\$000)
Experience (years)
Scatterplot of wage against
experience
0
50
100
150
200
0 5 10 15 20
Wage (\$000)
Absenteeism (days)
Scatterplot of wage against
absenteeism
a)
i)Scatterplot of wage against education: strong (as most points are tightly clustered), positive (upward sloping),
linear relationship with one unusual point, an employee with education of 10 years with a very high wage of
\$140,000.
ii)Scatterplot of wage against experience: possibly curved, due to an unusual points of employees with around 20
years experience, but relatively low wages.
iii)Scatterplot of wage against absenteeism: negative (downward sloping), moderately strong (as points are not
tightly clustered), possibly linear relationship with one employee had absenteeism of 15 days absent – a
relatively large number
i) ii) iii)
CC BY-NC-ND
54
Example
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –

 Week Pie Sales Price (\$) Advertising (\$100s) 1 350 5.50 3.3 2 460 7.50 3.3 3 350 8.00 3.0 4 430 8.00 4.5 5 350 6.80 3.0 6 380 7.50 4.0 7 430 4.50 3.0 8 470 6.40 3.7 9 450 7.00 3.5 10 490 5.00 4.0 11 340 7.20 3.5 12 300 7.90 3.2 13 440 5.90 4.0 14 450 5.00 3.5 15 300 7.00 2.7

A distributor of frozen dessert pies wants to evaluate factors
thought to influence demand. Data are collected for 15 weeks.
Dependent, (response) variable: Y
Pie sales (units per week), number of pies sold per week.
Independent (explanatory or predictor) variables: X1, X2
55

 Regression Statistics Multiple R 0.72213 R Square 0.52148 Adjusted R Square 0.44172 Standard Error 47.46341 Observations 15 ANOVA df SS MS F Significance F Regression 2 29460.027 14730.01 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Example continued Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) 9781442549272/Berenson/Business Statistics /2e –

 𝑆෣𝑎𝑙𝑒𝑠 = 306.526 – 24.975Price + 74.131Advertising

𝑆෣𝑎𝑙𝑒𝑠 = b
56
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
Estimated Sales = 306.526 – 24.975(Price) + 74.131(Advertising)
b
1 = -24.975:
Holding advertising constant, number of pies sold will decrease, on average, by 24.975 pies per week for
each \$1 increase in selling price, net of the effects of changes due to advertising
b
2 = 74.131:
Holding price constant, number of pies sold, will increase, on average, by 74.131 pies per week for each \$100
increase in advertising, net of the effects of changes due to price
Where,
Sales is in number of pies sold per week
Price is in \$
Advertising is in \$00 i.e. hundreds of dollars.
Example continued
Write the regression equation.
Interpret the coefficients.

57
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty
Predict sales for a week in which the selling price is \$5.50 and advertising is \$350. Explain.
Predicted sales is 428.62 pies.
The values of selling price is \$5.50 and advertising is \$350, are in the sample data range, so the estimate
is not an extrapolation.
Adjusted R- square = 0.4417 which tells us that 44.17% of the variation in pie sales is explained by this
regression on price and advertising., the estimate is moderately reliable.
Estimated Sales = 306.526 – 24.975(Price) + 74.131(Advertising)
= 306.526 – 24.975 (5.50) + 74.131 (3.5)
= 428.62
Note that Advertising is in \$00, so \$350 means that X2 = 3.5
Example continued
If you are now told that a week in which the selling price is \$5.50 and advertising is \$350, the sales were
actually 400 pies, find the residual and comment.
Residual = data – model = 400 – 428.62 = -28.62 pies. As the residual is negative, this model has
overestimated the number of pies sold here.

58
Copyright © 2013 Pearson Australia (a division of Pearson Australia Group Pty Ltd) –
Test the following, at α of 5%:
a) Test the significance of the coefficient of price.
Ho: 𝛽𝑃𝑟𝑖𝑐𝑒 = 0
Ha: 𝛽𝑃𝑟𝑖𝑐𝑒 ≠ 0
p-value = 0.0398 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant linear relationship
between the number of pies sold and the price of the pie.
b) Test the significance of a positive linear relationship between sales and advertising.
Ho: 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 = 0
Ha: 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 > 0
Use
𝑝-𝑣𝑎𝑙𝑢𝑒
2
=
0.0145
2
= 0.0073 < 0.05 of α, we reject Ho and accept Ha, concluding there is a significant positive
linear relationship between the number of pies sold and amount spent on advertising
c) Test the significance of the overall model i.e. a goodness of fit test.
Ho: 𝛽𝑃𝑟𝑖𝑐𝑒 = 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 = 0
Ha: 𝛽𝑃𝑟𝑖𝑐𝑒 ≠ 0 and/or 𝛽𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 ≠ 0
Use Significance F = p-value = 0.0120 < 0.05 of α, we reject Ho and accept Ha, concluding there is an overall
significant linear relationship i.e., at least one population coefficient is non-zero.
Example continued