SIT741 Assignment 1

SIT741 Assignment 1
Due: 23 August April 2019
Assignment 1 contributes to 25% of your Ýnal SIT741 mark. The full mark is 25. It must be completed
individually, and submitted to CloudDeakin before the due date: 11 pm, 23/08/2019 (Week 6 Friday).
Learning goals
In this assignment, you will work on a real-world problem to consolidate your learning in the Ýrst Ýve
weeks, including organise your data as tidy data and perform simple statistical analyses. This activity
also serves as scaÜolding for the upcoming Assignment 2.
Please start early so that you can identify any skill/knowledge gap and seek support from the teaching
staÜ and other students.
Background
In Australia, we have experienced extreme heat in the year 2019. With the inevitable rise of extreme
weather events, it is crucial that we better understand its potential impact on our everyday life.
In November 2016, a storm in Victoria triggered an unexpected surge of emergency department visits
at the local public hospitals. Some consequences of this weather event were captured in this news
article:
http://bit.ly/2gC8j6U
Apart from such storms, various weather events may aÜect the demand for care at our emergency
departments (EDs). In SIT741, you will use publicly available data to understand the relationship
between weather patterns and ED demands. Your analysis could provide crucial knowledge for
resource planning at our health care systems.
Assignment 1 will focus on the analysis of ED demand data.
Task 1: Obtaining ED demand data (4 points)
First, let’s Ýnd data measuring ED demands. We will use the emergency departments admissions and
attendances data set provided by the Department of Health of Western Australia:
http://data.gov.au/dataset/emergency-department-admissisons-and-attendances
18/08/2019 Page Title
https://d2l.deakin.edu.au/d2l/le/content/830722/viewContent/4520848/View 2/5
Task 1.1 Download the data set using the link below.
http://bit.ly/2nkCUEh
Task 1.2 Answer the following questions:
How many rows and columns are in the data?
How many hospitals are in the data?
What data types are in the data?
What time period does the data cover?
What’s the diÜerence between “Attendance” and “Admissions”?
What do the variables Tri_1 , Tri_2 , … represent?
Hint: You may need to consult the relevant background document, for example, the government
webpage here: https://ww2.health.wa.gov.au/About-us/Policy-frameworks/InformationManagement/Mandatory-requirements/Emergency-Department-and-Emergency-Services-PatientLevel-Data-Collection-and-Reporting .
Task 2: Tidy data (5 points)
Task 2.1 Cleaning up columns
You may notice that the ED csv Ýle has two rows of heading. This is quite common in data generated by
BI reporting tools. Let’s clean up the column names.
ed_data_link <- ‘govhack3.csv’
top_row <- read_csv(ed_data_link, col_names = FALSE, n_max = 1)
second_row <- read_csv(ed_data_link, n_max = 1)
column_names <- second_row %>%
unlist(., use.names=FALSE) %>%
make.unique(., sep = “__”) # double underscore
column_names[2:8] <- str_c(column_names[2:8], ‘0’, sep=’__’)
daily_attendance <-
read_csv(ed_data_link, skip = 2, col_names = column_names)
18/08/2019 Page Title
https://d2l.deakin.edu.au/d2l/le/content/830722/viewContent/4520848/View 3/5
Now print out a list of healthcare facilities (hospitals) in the data set.
Task 2.2 Tidying data
1. Now we have a data frame. Answer the following questions for this data frame.
Does each variable have its own column?
Does each observation have its own row?
Does each value have its own cell?
2. Use spreading and/or gathering to transform the data frame into tidy data. The key is to put data
from the same measurement source in a column and to put each observation in a row. Please
answer the following questions.
How many spreading operations do you need?
How many gathering operations do you need?
Explain the steps.
3. Are the variables having the expected variable types in R? Clean up the data types.
4. Are there any missing values? Fix the missing data. Justify your actions.
Task 3: Exploratory Data Analysis (5 points)
It is often a good idea to eyeball your data before Ýtting a model. The purpose is to understand the
distribution of diÜerent measurements and their relations.
Task 3.1 Select a hospital
Select a hospital and create a data set for only that hospital. Print out the hospital’s name, the total
number of ED attendances and the total number of admissions.
Task 3.2 For the hospital selected, if we want to compare the volume of ED
demands across the year, which plot can we use? Show your plot and explain
what the plot shows. (Hint: Which variables measure the ED demands?)
Task 3.3 How do the ED demands change during a week? Show it visually.
Task 3.4 Which distributions are appropriate for modelling the ED demand?
Which variables meet the assumptions for the Poisson distribution? To reduce
the dependence between consecutive days, we randomly sample 200
records out of the whole dataset (all records for the selected hospital) for
modelling.
18/08/2019 Page Title
https://d2l.deakin.edu.au/d2l/le/content/830722/viewContent/4520848/View 4/5
Task 4: Fitting distributions (5 points)
As you may see in the previous step, although we are dealing with count data, a Poisson distribution
may not provide a good Ýt. Actually, unconditional Poisson distribution is too restrictive for most realworld applications. In this task, we will Ýt a couple of distributions to the Triage 2 attendance using the
same sample of Task 3.4.
Task 4.1: Fitting distributions
Fit a Poisson distribution and a negative binomial distribution on Tri_2 . You may use functions
provided by the package fitdistrplus .
Task 4.2: Compare distributions
Compare the log-likelihood of two Ýtted distributions.
Which distribution Ýt the data better? Why?
Task 5: Research question (3 points)
There are more than one ways to Ýt a distribution to a set of numbers. Produce a short literature review
on diÜerent distribution Ýtting methods, showing the pros and cons of each method.
Task 6: Ethics question (2 points)
During your work, have you identiÝed any issues that have ethical implications? Does it concern security
or privacy? How do you mitigate the risk?
Task 7: ReÞection (1 point)
Answer the following questions:
1. What help did you receive from other students? What did you learn from them?
2. Please estimate the mark that you will receive for assignment 1. Please provide both a point
estimate and an interval estimate (a conÝdence interval). You don’t need to provide a mathematical
model, but please explain how do you use conditional information to reach the estimates. Based on
the conditional information, explain what you would have done diÜerently to improve that mark?
What to submit
By the due date, you are required to submit the following Ýles to the assignment Dropbox in
CloudDeakin.
1. An MS Word or PDF Ýle containing your answers to all the assignment questions.
2. An R Notebook Ýle Assignment1_submission.Rmd Ýlled in with the script for your calculations. The
Ýle should be able to run. Include sußcient comments so that the script can be understood by the
18/08/2019 Page Title
https://d2l.deakin.edu.au/d2l/le/content/830722/viewContent/4520848/View 5/5
marker. Indicate all the packages that need to be installed separately.
Marking criteria
Your submission will be marked using the following criteria.
Showing good eÜort through completed tasks.
Applying statistical thinking to understand the problems and to identify solutions.
Applying statistical programming skills to obtain data and to process them for data analysis.
Applying visualisation techniques to discover distribution patterns and relationships among
variables.
Demonstrating creativity and resourcefulness in solutions.
Showing attention to details through a good quality assignment report.