### Correlation and simple linear regression

1
1
STAM4000
Quantitative Methods
Week 9
Correlation and simple linear
regression
aAJwAHgAgAGrAogB9kGSAQYwLjQ3LjKYAQCgAQGqAQtnd3Mtd2l6LWltZ7ABCsABAQ&sclient=img&ei=mVMXYIClHMG_rtoP4cK24Ag&bih=470&biw=1013&rlz=1C1CHBF_enAU841AU846&hl=en#imgrc=I3tF8lFRL1EZiM
STAM4000 students are expected to know how to INTERPRET EXCEL output;
But NOT expected to know how to create EXCEL output.
In this class, we will learn about the following:
Correlation measures the strength and direction of a linear association between
two quantitative variables.
Simple linear regression is used to model the relationship between two or more
variables and the model may be used to predict future values.
Kaplan Business School (KBS), Australia 1
2
COMMONWEALTH OF AUSTRALIA
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2
Kaplan Business School (KBS), Australia 2

 3 d #1 #2 #3 Examine the relationship between two quantitative variables Differentiate between correlation and causation Model with simple linear regression #4 Create and assess reliability of forecasts Week 9 Correlation an simple linear regression Learning Outcomes

By the end of this class, students will be able to:
Create and describe a scatterplot between two quantitative variables.
Understand the difference between correlation and causation.
Create and interpret sections of simple linear regression Excel output.
Use a regression equation to predict values and assess the reliability of those predictions.
Kaplan Business School (KBS), Australia 3
4
Why does this matter?
If there is an
association
between two
quantitative
variables, we can
model the
relationship to
predict future
values.
EAxwQ_AUoAXoECBIQAw&biw=1024&bih=444#imgrc=kC7IUqSGWMulMM
Our problem objective is to analyse the relationship between numerical variables; regression analysis is the tool we
will study in this class.
Regression analysis is used to predict the value of one variable (the
dependent variable) on the basis of another
variable(s) (the
independent variable(s).
Dependent variable: denoted
Y
Independent variables: denoted X1, X2, …, Xk
Kaplan Business School (KBS), Australia 4
5
#1 Examine the relationship
between two quantitative variables
iw=1024&bih=444#imgrc=HDY2nP8j3G4I5M
When we first examine the relationship between two quantitative or numerical variables (numbers with
units), we should create the visualisation of a scatterplot.
The scatterplot will show us how the variables relate to each other:
direction
shape or form
strength
unusual features
Kaplan Business School (KBS), Australia 5
Describe the relationship between two quantitative variables
We have two quantitative variables and we want to understand the relationship, so we draw a picture.
We use cartesian axes, also called X, Y axes, to create a scatterplot of two quantitative variables: the Y variable against, or
versus, the X variable.
Scatterplots are a tool for representing the relationship between two variables. They are useful when thinking about
constructing a mathematical model of a data set, since they provide an insight to the type of model we may need.
X is the “independent” or “explanatory” variable or “predictor” variable.
Y is the “dependent” or “response” variable.
Before creating a scatterplot, it is best to decide which variable is responding to the other.
The cartesian axes have four quadrants:
The X and Y axis intersect at the “0”, called the origin.
The X axis is the horizontal axis, negative x values to the left of origin and positive x values to the right of the origin.
The Y axis is the vertical axis; negative y values are beneath the origin and positive y values are above the origin.
Each point on the scatterplot is a coordinate (x, y).
Note:
The scatterplot should be titled using the names of the variables.
A scatterplot title is usually of the form: “Scatterplot of the named Y variable versus, or against ,the named X variable”.
In business, we mainly have positive numbers and deal with the top right quadrant of the cartesian axes.
Axes should also be labelled with the name of the variable and the corresponding units of the variable in brackets.
6
#1 Make a picture
Scatterplots are a tool for representing the relationship
between two quantitative variables,
(numbers with units).
X is the “independent” or “explanatory” or “predictor” variable
Y is the “dependent” or “response” variable
Before creating a scatterplot, it is best to decide which
variable is responding to the other.
Note: in business, we usually have positive numbers and deal
with the top right quadrant of the X, Y axes.
-10

 -5 -5 0 5 X 5 10

0
15
-10 10
Scatterplot of Y against X
5 0
10
15
0 2 4 6
Y (units)
X (units)
Scatterplot of of Y against
Y X
We say that the Y variable is RESPONDING to values of the X
variable.

7
#1
Here are two examples of scatterplots representing the
relationship between two quantitative variables.
Linear model Non-linear model
0
50
100
0 5 10
Exam score (%), Y
Cups of coffee before a test, X
Scatterplot of exam score against
cups of coffee before a test
0
10
20
30
0 50 100
Exam score (%), Y
Hours of study, X
Scatterplot of exam score against
hours of study
Illustration
Illustration
The first scatterplot:
X = Hours of study
Y = Exam score (%)
This scatterplot has points that seem to be following a straight line – the general form or shape of the
scatterplot is linear.
The points do not need to be exactly all on the same line – the general shape of a line is enough to
suggest that there is a linear relationship between these two quantitative variables of exam score and
hours of study.
Here, as the hours of study increase, the exam score tends to increase, or follows, an upward
direction, suggesting the exam score responds positively to hours of study.
The second scatterplot:
X = Cups of coffee before a test
Y = Exam score
This scatterplot has points that seem to be following a curve.
Patterns that are not straight lines are called non-linear.
At the lower end of cups of coffee, the exam scores are increasing, and peak at 5 cups of coffee.
However, after 5 cups of coffee, the exam scores start to decrease – suggesting that more than 5 cups
of coffee will have a negative (downward) effect on a test score.
Here, exam scores respond both positively and negatively to the number of cups of coffee.

8
8
1st scatterplot displays
how exam scores, Y,
RESPONDS to hours of
study, X, by the student.
LINEAR MODEL: where
we can “visualise” a
“linear pattern”
between the Y and the X
variable.
Linear model, we have a
CONSTANT “SLOPE” or
CONSTANT “RATE OF
CHANGE”
2
nd scatterplot displays how
exam scores, Y, RESPONDS to
number of cups of coffee
before the exam, X.
Curved relationship is also
called a NON-LINEAR
relationship.
NON-LINEAR relationship,
the slope or rate of change
varies. In this chart:
positive or upward
zero
negative or downward
8
9
9
9
How do we create a scatterplot in EXCEL?
For the data used in this example, go to the EXCEL file name STAM4000 Week 10
Excel.xls and the sheet named “Rent”.
Note: If using EXCEL to create the scatterplot, check that the X and Y variables are
displayed as you would like.
When highlighting the columns, it is best to highlight the X variable, first, and the Y
variable, second.
For the first scatterplot:
Highlight the columns labelled “Distance” and “Rent”.
Go to the INSERT tab.
Select the Scatterplot icon.
Highlight the chart.
Use the “+” button to add/remove items of the scatterplot.
For the second scatterplot:
Highlight the columns labelled “Bedrooms” and “Rent”.
Go to the INSERT tab.
Select the Scatterplot icon.
Highlight the chart.
Use the “+” button to add/remove items of the scatterplot.
10
#1
This Photo by Unknown Author is
Example
A random sample of thirty rental properties was collected and values for the following
variables were recorded: weekly
rent, (\$/wk), distance from the city centre (km), number of
bedrooms, number of bathrooms and age of the property (year).
EXCEL was used to create the following scatterplots.
How could we describe the relationship between rent and distance?
How could we describe the relationship between weekly rent and number of bedrooms?

 2, 1200

0
500
1000
1500
0 10 20 30 40
Rent (\$/wk), Y
Distance (km), X
Scatterplot of rent, Y against distance,
X
0
500
1000
1500
0 1 2 3 4 5
Rent (\$/wk), Y
Bedrooms, X
Scatterplot of rent, Y against
bedrooms, X

11
11
Before creating a scatterplot, it is best to decide which variable is responding to
the other.
If we want to do a scatterplot of rent with distance:
which variable do we believe is RESPONDING to the other?
1
st scatterplot here: rent is RESPONDING to distance from the city centre.
11
Describing a scatterplot
We assess the scatterplot, discussing the following:
Direction: what is the general direction of the scatterplot?
Positive also called upward sloping
Negative also called downward sloping
Form: what is the general shape of the scatterplot?
Do the points form a linear shape?
Do the points form a curved shape?
Strength: how tightly clustered are the points giving that form?
Strong: the points are tightly clustered to create that form
Moderate: the points are reasonably clustered to create that form
Weak: the points are very loosely clustered to create that form
Unusual features: is there anything unusual about the scatterplot?
Are there any outliers (unusual points)?
Are there any irregularities with the shape?
12
#1 Example
13
#2 Differentiate between correlation and causation