Data visualization and descriptive statistics

1
1
STAM4000
Quantitative Methods
Week 2
Data visualization and
descriptive statistics
https://twitter.com/thesmartjokes/status/681927905073606656
Kaplan Business School (KBS), Australia 1
2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan
Business School pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further
reproduction or communication of this material by you may be the subject of copyright
protection under the Act.
Do not remove this notice.
2
Kaplan Business School (KBS), Australia 2

3
on
e
#1
#2
#3
Create data visualisations
Distinguish between measures of
central tendency
Distinguish between measures of
dispersion
Week 2
Data visualisati
and descriptiv
statistics
Learning
Outcomes

Kaplan Business School (KBS), Australia 3
4
#1 Create data visualisations
https://www.google.com/search?q=turn+chart+upside+down+comic&rlz=1C1CHBF_enAU841AU846&sxsrf=ALeKk031aVpVFZqBvapo95C5JwC7IKU8XA:1610532923293&tbm=isch&source=iu&ictx=1&fir=deEu733a1GTzsM%252CkE9Jb3TLpRkJ8M%252C_&vet=
1&usg=AI4_-kT88IfIf_dkQGI1tipICwu3u78KHQ&sa=X&ved=2ahUKEwjPr6rW1pjuAhW7yDgGHfODA-AQ9QF6BAgJEAE#imgrc=YQEFz4DyRQqgNM&imgdii=ZEF-JXVm8KeepM
Kaplan Business School (KBS), Australia 4
5
https://www.google.com/search?q=cutest+cat&t
bm=isch&hl=en&chips=q:beautiful+cutest+cat,g_
1:beautiful:y5l6wMp0MCI%3D,online_chips:kitte
n&rlz=1C1CHBF_enAU841AU846&sa=X&ved=2ah
UKEwiHkLzl53uAhVZCLcAHQ8hCZgQ4lYoBnoECAEQIg&biw=
Do you like to draw diagrams when explaining something? 1466&bih=635#imgrc=r4ntESJIpg3H3M
When using Google maps for directions, do you prefer to watch
the map and mute the audio?
When meeting new people, do you find it easier to remember
faces, instead of names?
Do you use a mind map diagram with links and words to organize
and remember things?
Are you more of a
VISUAL thinker or a VERBAL thinker?
Count how many times you reply
“yes” to the following quiz questions:
Why
does
this
matter?
A picture
tells a
thousand
words …
Are you more of a visual thinker or a verbal thinker?
Visual thinkers are those that learn better through pictures and diagrams.
Verbal thinkers are those that learn better through words.
People are usually a combination of both, but some lean closer to one style of thinker.
This short quiz on visual versus verbal learning is a brief illustration of the difference between two types of thinking.
How many times did you reply “yes” in this quiz?
4 or 3 For these questions, you are a strong visual thinker and prefer pictures to words.
2 For these questions, you evenly think visually and verbally.
1 or 0 For these questions, you are a strong verbal thinker and prefer words to pictures.
This quiz was adapted from the following online sources:
1. https://mindmappingsoftwareblog.com/how-much-of-a-visual-thinker-are-you/.
2. https://www.quotev.com/quiz/7206012/Are-you-A-Visual-Thinker
Kaplan Business School (KBS), Australia 5
6
6
How many times did you reply “yes” in this quiz?
4 or 3 For these questions, you are a strong visual thinker and prefer pictures to
words.
2 For these questions, you evenly think visually and verbally.
1 or 0 For these questions, you are a strong verbal thinker and prefer words to
pictures.
Kaplan Business School (KBS), Australia 6
7
#1 Example of visualisations
Example of
charts
Categorical
Pie chart
Bar chart and Pareto chart
Quantitative
Pie chart
Histogram
Frequency
Polygon
Frequency
curve
Stem and Leaf Plot Ogive
Box Plot
Kaplan Business School (KBS), Australia 7
8
Categorical, Qualitative: names, labels for NOMINAL;
rankings for ORDINAL
Quantitative:
numbers with units or “units of measure” e.g., $, $000, cm, kg, grams, people per
household
Kaplan Business School (KBS), Australia 8
9
Pie chart:
Label segments or use a legend.
Check segment size
Check segment values
Check categories are mutually
exclusive and collectively
exhaustive.
Check total value of pie chart:
o If frequencies, check totals
sample size.
o If relative frequencies, check
totals 1 or 100%
o Note: If the pie chart is for
quantitative data and
displaying numerical, check
totals to sum of values.
Charts for categorical data
Biotechnology
10%
Capital
Markets
10%
Diversifes
Banks
40%
Grocery Stores
10%
Home
Improvement
Retail
10%
Metals &
Mining
20%
Pie chart for top ASX 10 companies in Australia (%)
#1
Relative frequency: is a count RELATIVE THE
TOTAL COUNTS, a %, fraction or decimal
Data for this example may be found in the file named STAM4000 Data for Week 2.xls
Create a pie chart respecting rules:
Place categories in segments of a pie chart or use a legend and check that the pie chart covers all categories.
Check the size of each segment: the segments of the pie chart must be in proportion to the values for that segment
relative to the total value.
If the pie chart displays relative frequencies, check that the sum is 100% or 1.
If the pie chart displays frequencies, check that the sum is the total number of frequencies.
Note: If the pie chart is for quantitative data and displaying numerical, check totals to sum of values.
Advantage of using a pie chart: Common and easy to create manually or in EXCEL.
Disadvantages of using a pie chart:
Cannot be used to compare two or more variables.
Are easily misleading or difficult to read if using pie chart effects, e.g. 3-dimensional pie charts
https://chartio.com/learn/charts/pie-chart-complete-guide/
Note: To create a sorted pie-chart in Excel, you must sort the data within the Excel spreadsheet first.
Kaplan Business School (KBS), Australia 9
10
More charts for categorical data
Bar chart Pareto bar chart
One bar per category
Bar height reflects frequency or RELATIVE
FREQUENCY (%)
Equal bar width
Gaps between bars BECAUSE CATEGORICAL
Sorted bars in ascending or descending
order
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%

Bar chart of top 10 companies from Australia (%)

#1
Data for this example may be found in the file named STAM4000 Data for Week 2.xls
Bar chart for one categorical variable:
One bar (rectangle) per category
Height of bar is the frequency or relative frequency per category
Bar width should be equal
Gaps between bars (necessary) indicate category order is not fixed
Pareto chart: special type of bar chart with sorted bars, usually in descending height order. Gaps are still needed.
Bar charts for two or more categorical variables
Side by side bar chart
Stacked bar chart
To create a bar chart in Excel:
Highlight the column of categories and with the frequency or relative frequency column.
Go to the insert tab.
Select the bar chart icon.
To create a Pareto bar chart in Excel:
Using an existing bar chart, copy the bar chart, click on any bar.
Right click, select Change Series Chart Type.
Scroll to Histogram, select Pareto.
The line in a Pareto chart created by Excel is cumulative (a running total).
This can be removed if necessary, by clicking on the cumulative line and changing the outline colour to white.
Kaplan Business School (KBS), Australia 10

11
Histogram
One bar per class
Bar height may reflect frequency or
relative frequency
Equal class widths
NO gaps between bars. Why?
Following number scale.
Description of histogram:
General shape:
o symmetric (evenly balanced) OR
o skewed (tail on either end)
Peaks: number and position = MODE
Unusual features: gaps, multiple
peaks, no peak etc.
Common chart for quantitative data numbers with units
5 0
15
#1
Example: The call centre of an electricity provider has
received a number of complaints from customers that the
call wait time is too long. The manager of the call centre
claims that most wait times are 15 minutes or less. To
investigate the complaints, a consumer group telephoned
the electricity provider 25 times and recorded the call wait
times . This histogram displays the data collected by the
consumer group.
E X C E L

1 2
0
3
7
12
10
(0, 5] (5, 10] (10, 15] (15, 20] (20, 25] (25, 30]
Frequency
Class (minutes)
Histogram of call wait times
Data for this example may be found in the file named STAM4000 Data for Week 2.xls
Example, call waiting times raw data
25.5 23.5 24.3 26.5 28.2
19.7 28.5 25.5 28.5 26.5
7.9 3.2 27.9 26.0 23.8
23.9 24.6 23.3 28.2 17.6
15.5 26.6 22.5 6.5 28.3
Charts for quantitative data
Create a histogram
Histogram for one quantitative variable:
One bar (rectangle) per class
Height of bar is the frequency or relative frequency per class
Width of classes should be equal
NO gaps between bars, unless gap in data. Why? Numerical data should follow a number scale.
Are customers complaints justified?
Description of histogram
General shape: There is a tail on the left of the histogram, with most of the call wait times on the right; most callers in this sample waited greater than 10
minutes.
Number and position of peaks: This histogram has one peak in the (25, 30] class with a frequency of 12.
Unusual features: Gap in the histogram at the (10, 15] class.
Based on this sample, the customers may be correct – the call waiting times seem excessive with up to almost 30 minutes.
Most of the call wait times (88%), were greater than 10 minutes, thus refuting the manager’s claim.
How to create this histogram in EXCEL:
When creating the frequency table, you could have selected Chart output.
Otherwise, highlight the first two columns of the frequency table and go to the INSERT tab, select the column chart, 2-D.
Note, remove gap width from histogram:
When you get your histogram, click on the on the rectangles, right click for the drop-down menu, select Format Data Series, go to Gap Width, and
decrease this value to 0%.
We must do this as there should be no gaps between the rectangles in a histogram, (unless there are gaps in the data), as we are plotting numerical data,
following a number line.
Click on the histogram, then at the tab select CHART DESIGN etc. similarly to the previous charts covered earlier in these slides.
Make changes to the chart: Click on the scatterplot and use the + or brush symbol. Go to the menu bar for Design Chart or Format.
Right click on the chart for other change e.g., format chart area to outline the bars.
Kaplan Business School (KBS), Australia 11
12
This histogram is a visualisation of the
frequencies for the table in Week 1
lectures, last learning outcome example.
Example:
GENERAL SHAPE:
Distribution of call wait times is
SKEWED (later we will see that this is
skewed to the LEFT, as tai is on the
LEFT”)
PEAK: UNIMODAL
at the (25, 30] class with the highest
frequency of 12.
UNUSUAL FEATURES:
Gap at the (10, 15] class; majority of call
wait times greater than 15 minutes
Kaplan Business School (KBS), Australia 12
13
Understanding the importance of shape

Bimodal
Mo Mo

Multimodal
Uniform
Unimodal and symmetric
Positively skewed (or skewed to the right)
Negatively skewed (or skewed to the left)
How would
you
describe the
shape of
the
histogram
for the call
waiting
example?
#1
Mo
Mo
Mo
Mo Mo Mo
No clear mode
Understanding the importance of shape.
We learned how to create different visualisations for categorical and quantitative data.
In the visualisations for categorical data, we used bar charts. The general shape of the bar chart can vary depending on the ordering of the categories.
Note that the general shape itself of a bar chart cannot be analysed as any shape can be created, by rearranging the ordering of the categories.
In the visualisations for quantitative data, we used histograms, as a chart of a frequency table. The classes are numerical, and as such must follow a
number line. Hence the order of classes in a histogram are fixed – so the general shape of a histogram can be analysed. The classes themselves can be
changed by changing the class width, but the histogram must still follow a number line.
For quantitative data of one variable, we can use EXCEL to create a histogram.
We describe the SHAPE of a histogram by discussing the number of peaks (modes), the symmetry or skew of the histogram and any unusual features such
as extreme values or gaps in the histogram.
There are three general shapes of histograms:
i) Symmetric: each half of the histogram is a mirror image of the other half.
If you can fold the histogram along an imaginary vertical line through the middle, and each half is the same, then the histogram is symmetric.
A unimodal and symmetric histogram is a special type of symmetric histogram that may be described as bell-shaped or mound shaped.
ii) Negatively skewed or skewed to the left: a few values on the lower end of the scale drags a tail to the left of the histogram.
iii) Positively skewed or skewed to the right: a few values on the higher end of the scale pulls a tail to the right of the histogram.
How about the number of modes?
Unimodal – one mode
Bimodal –two modes; may indicate a lurking variable. E.g.. A histogram of annual salaries that is bimodal can suggest there is a lurking variable of say
gender or years of education etc.
Kaplan Business School (KBS), Australia 13
14
Unimodal =
0ne peak or
one highest
bar.
Symmetric:
each half is a
mirror image
of the other
half.
PEAK has
HIGHEST
FREQUENCY
or Rel Freq
PEAK has
HIGHEST
FREQUENCY
or Rel Freq
Kaplan Business School (KBS), Australia 14
15
How do we
describe a data set?
We use descriptive statistics.

Shape
For a histogram
or frequency
curve:
Is there a single
peak or several
peaks?
oIs it symmetrical
or skewed?

 

Centre
•If you had to
pick a single
number to
describe all the
data, what
would you
choose?

 

Spread
•Since statistics is
about variation,
how dispersed is
our data?

 

Unusual features
•Are there any
gaps in the data
set?
•Is there more
than one mode?
If so, is there a
lurking variable?
E.g.: annual
income, lurking
or hidden
variable of

gender
#1
Frequency
curve is a
SMOOTH
histogram.
Kaplan Business School (KBS), Australia 15
16
Kaplan Business School (KBS), Australia 16
17
© 2010 Pearson Education
Example: These histograms compare the daily volume (number) of shares traded by
month on the New York Stock Exchange (NYSE) in one year, divided by January to
June and July to December.
Histograms are OK for comparing two groups; box and
whisker plots (or boxplots) are better when comparing several groups. See the next
slide.
#1 Compare datasets with visualizations
These histograms show the daily volume (number) of shares traded by month on the New York Stock
Exchange (NYSE) for 6 months each of the same year.
If we want to do a comparison of each month’s trading, we should use boxplots.
Source: Sharpe, De Veaux & Velleman, “Business Statistics” Pearson International Edition, 2010, page
125.

18
18
Kaplan Business School (KBS), Australia 18
19
© 2010 Pearson Education
Example
This chart of box and whisker plots compares the daily volume (number) of shares traded by
month on the New York Stock Exchange (NYSE) in one year. The
months follow a calendar
year and are denoted by numbers. E.g.., 1= January
#1
Source: Sharpe, De Veaux & Velleman, “Business Statistics” Pearson International Edition, 2010, page 126.
20
20
Box plot:
BOX starts at the first quartile (25
th percentile) and the box ends at the third quartile
(75
th percentile).
LINE inside the box IS THE MEDIAN or the MIDDLE VALUE
Lines either side of the Box are called WHISKERS.
ENDPOINTS on WHISKERES are called the FENCES
Any dots or stars are EXTREME values or OUTLIERS
Kaplan Business School (KBS), Australia 20
21
© 2010 Pearson Education
From this visualization, we can ascertain the following:
March had the least variation overall; June and December had the greatest variation
overall.
May and November have the highest median sales traded; August had the lowest median
March had the smallest interquartile range; December had the largest interquartile range
March, May, June, July, September and November each had trading days with extreme
values.
All months had skewed distributions. The middle value is the 50th percentile = MEDIAN.
This tells us that 50% of the values are LESS THAN THE MEDIAN and 50% of the values are
GREATER THAN THE MEDIAN.
#1 Example continued
Interquartile range = width of the box
= Q3 – Q1
= range of the middle half of the
data set (SORTED)
When comparing boxplots, we can comment on the following:
The range, from minimum to maximum.
The median, the vertical bar inside the box of each boxplot.
The interquartile range, displayed by the width of the box.
Unusual features of extreme values; here the circle represents 1.5IQR from the relevant quartile and the star represents
3IQRs from the relevant quartile
Overall shape of each distribution
22
Box and whisker plot (boxplots)
Displays a five-number summary:
o minimum
o Q1 = 1st quartile = 25th percentile
o median, Q2 = 2nd quartile = 50th percentile
o Q3 = 3rd quartile = 75th percentile
o maximum
Median shown inside box |
Length of box displays interquartile range = IQR = Q3 – Q1
Whiskers show data values considered usual
Shapes e.g., dot or asterisk, represent unusual data values (outliers);
o dot to represent values outside 1.5 IQR
o asterisk to represent values outside 3 IQR, from nearest quartile
#1
https://twitter.com/statsols/status/929006600664354816
Kaplan Business School (KBS), Australia 22
23
Boxplot
https://lsc.deployopex.com/box-plot-with-jmp/
#1
MIN MAX
100%
50% 50%
Kaplan Business School (KBS), Australia 23
24
General shapes of frequency curves and boxplots
Negatively
skewed
Unimodal
and
symmetric
Positively
skewed
#1
Kaplan Business School (KBS), Australia 24
25
#2 Distinguish between measures of central tendency
http://methods.sagepub.com/book/testing-and-measurement/n4.xml
Kaplan Business School (KBS), Australia 25

26
Population parameters and sample statistics
Population
parameters
•Measurements
based on the
entire data set.
Sample
statistics
•Measurements
based on a
sample of
data.
Notation
•Greek letters
for population
parameters.
•English letters
for sample
statistics.
https://www.causeweb.org/cause/resources/fun/cartoons/parameter-notation
#1

Population parameters and sample statistics
Measurements based on the entire population are called population parameters. These tend to be fixed, unless there has
been a major change, that will then change the population of data.
E.g.. The number of visitors to Australia has drastically decreased due to COVID.
Measurements based on a sample of data are called sample statistics. These vary from sample to sample.
We usually cannot access an entire population of data, so we rely on sampling methods to generate a random sample to
represent the population.
Descriptive statistics are then calculated for the sample of data to do just that – describe that sample only.
To differentiate between population parameters and sample statistics, statisticians use Greek letters to represent
population parameters and English letters to represent sample statistics.
Kaplan Business School (KBS), Australia 26
Distinguish between measures of central tendency.
What is a typical value of the data set?
When we think of a typical value, we look for the centre of the distribution.
Measures of central tendency (or central location) yield information about the centre of a dataset.
Three common measures of central tendency are the mean, median and the mode.
The mode is the most common value of the data set.
The median is the middle value of a sorted data set.
The mean is the average value of the data set.
27
#2 What is the typical value for a data set?
https://nebusresearch.wordpress.com/tag/statistics/
28
#2
Modal
value:
most
frequently
occurring
value
Modal class:
the class(s) with
the highest
frequency, or
tallest peak(s)
in a bar chart or
histogram
Mode, Mo, same symbol for a parameter and a statistic

•It can be found for both
categorical and quantitativ
data.

Advantage of
the mode:

•It’s use is limited to
descriptive statistics.
•It does not use all the value
in a data set.

Disadvantages
of the mode:
What is the mode? The mode is usually the most frequently occurring value. Mo is the symbol for the mode, whether we
have a sample of data or a population of data.
In a bar chart or a histogram, we can also have a modal class which is that class that has the rectangle(s) with the highest
peak(s).
How do we find the mode manually?
Sort the data first. Find the value or category that has the highest frequency. If two or more values/categories have
the same highest frequency, there is two or more modes.
How do we find the mode in EXCEL?
Go to the DATA tab, then select DATA ANALYSIS
Select Descriptive Statistics
Input Range: select the data values with the title
Labels in first row: tick this
Output range: click in here and select a cell in the spreadsheet that you want the output
Summary statistics: tick this
Click OK
Note: If there is no mode, the EXCEL Summary statistics output will have Mode: #N/A, Not Applicable.
If there is more than one mode, EXCEL will display the first mode only.
Real world application of the mode:
Moda in Italian translates to fashion – the most popular style of the day to express oneself in clothing, shoes etc.
The mode is used in the fashion industry by manufacturing the most common sizes of clothing, with modal size of small,
medium, large and extra large – the modal sizes.
Kaplan Business School (KBS), Australia 28
e
29
#2
A dataset with one
mode is unimodal.
E.g.. A sample of
latte prices ($):
5, 3, 6, 5, 4, 6, 5
Mo = $5
A dataset with
two
modes is bimodal.
E.g.. A sample of
espresso prices ($:)
4, 5, 6, 3, 6, 5, 6, 4, 5
Mo = $5 and $6
A dataset with
three
or more
modes
is
multimodal.
E.g.. A sample of
ice-coffee prices ($):
5, 8, 7, 6, 5, 9, 7, 6
Mo = $5, $6 and $7
A dataset with
no mode
is
uniform.
E.g.. A sample of
cappuccino prices ($):
5, 3, 4, 6
No mode
Number of modes
Kaplan Business School (KBS), Australia 29
30
#2 Median, Me

•It is not influenced
or by extreme low values. Hence,
when we have a
skewed data set,
the median is usually the best
measure of central tendency.

by extreme high
Advantage of
the median:

•It does not use
data set.
•Only used in descriptive statis
•It is tedious to calculate manu
•Cannot find the median for
categorical data.

all the values in a
Disadvantages
of the median:
Median:
the middle
value
(midpoint) in an
ordered set of
numbers.
SORT FIRST
Me same symbol for a parameter and a statistic
What is the median?
A set of numbers arranged from the smallest to the largest is called an ordered array. The median is the middle value (midpoint) in an ordered set of
numbers, or the middle value in an ordered array. As the median is the middle value in a SORTED data set, it can only be found for quantitative data. Me
is the symbol for the median, whether we have a sample of data or a population of data, where the units of the median are the same as the units of the
data. (Later, we will look at quartiles, where the median is the second quartile or the 50
th percentile.)
The median is a measure of central tendency by virtue of position. Half the values are above the median & half the values are below the median.
How to find the median manually:
Sort the data in ascending order from the minimum to the maximum.
Count in from the ends of the sorted data until the middle value is reached.
i) If there is an odd number of values, the median is the middle value.
ii) If there is an even number of values, the median is the average of the 2 middle values.
Include units with the median.
How do we find the median in EXCEL?
Go to the DATA tab, then select DATA ANALYSIS
Select Descriptive Statistics
Input Range: select the data values with the title
Labels in first row: tick this
Output range: click in here and select a cell in the spreadsheet that you want the output
Summary statistics: tick this
Click OK
Real world application of the median:
The word median, as used in statistics, is thought to have originated in 1883 to be “middle number of a series“.
The French statistician Antoine Cournot in 1843 is recorded to be the first to use the term median (
valeur médiane) as the the value that separates a
distribution into two equal halves. (https://en.wikipedia.org/wiki/Median). Nowadays, the median is used extensively as a measure of central tendency
for datasets that have extreme values.
For example, in the business of real-estate, the median house selling price is quoted as the typical value, reflective of house selling prices for a suburb.
Another example, is census data collated by the Australian Bureau of Statistics (ABS), who quote the median for variables that would have skewed
histograms, such as annual household income.
Kaplan Business School (KBS), Australia 30
31
#2 Median, Me
If n Is ODD, the median is the middle
value in a sorted dataset.
E.g.. Sample of customer sales ($)
8, 12, 4, 10, 7
Sorted: 4, 7, 8, 10, 12
n = 5, odd
Median = $8,
IS an actual data value
If n is EVEN, the median is the
average of the two middle values in
a sorted dataset.
E.g.. Sample of customer sales ($)
8, 12, 4, 10, 7, 13
Sorted: 4, 7,
8, 10, 12, 13
n = 6
Median = (8 +10)/2 = $9
is NOT an
actual data value
HINT: Sort data first
n = sample size
N = population size
Kaplan Business School (KBS), Australia 31
32
#2 Mean, μ or