Data Science Practice Assignment 3

Data Science Practice
Task 3
Semester 2, 2019
Assessment and Submission Details
Marks: 40% of the Total Assessment for the Course
Due Date: 11:59pm Friday, Exam Week 1
The assignment will be marked out of a total of 100 marks and forms 40% of the total
assessment for the course.
Assignment Task
This assignment consists of two deliverables, being:
One code implementation (50%). The code file in Jupyter Notebook format and the relevant
data set files should be contained within a folder named: Task 3-Your NameStudent_Number, the folder is then to be zipped and uploaded to blackboard.
A report (50%). The report must be uploaded as a separate file.
Part I – PySpark source code (50%)
Important Note: For code reproduction, your code must be self-contained. That is, it should
not require other libraries besides PySpark environment we have used in the workshops. The
data files are packaged properly with your code file.
In this component, we need to utilise Python 3 and PySpark to complete the following data
analysis tasks:
1. Exploratory data analysis
2. Recommendation engine
3. Classification
4. Clustering
You need to choose a dataset from Kaggle ( to complete
these tasks. Remember to include the data set file in you source code submission.
Note: In your notebook, please use Heading 1 Markdown cell to separate each sub task.
Task I.1: Exploratory data analysis
This subtask requires you to explore your dataset by
telling its number of rows and columns,
doing the data cleaning (missing values or duplicated records) if necessary
selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each to
summarise it
Task I.2: Recommendation engine
This subtask requires you to implement a recommender system on Collaborative filtering
with Alternative Least Squares Algorithm. You need to include
Model training and predictions
Model evaluation using MSE
Task I.3: Classification
This subtask requires you to implement a classification system with Logistic regression with
LogisticRegressionWithLBFGS class. You need to include

Logistic Regression model training
Model evaluation
Task I.4: Clustering
This subtask requires you to implement a clustering system with K-means. You need to
Model training
Model evaluation
Part II –Report (50%)
You are required to write a report to explain your design and implementation of the machine
learning parts in your code, including the following topics:
Introduction/summary/explanation to the ML algorithm/concepts
The learning settings, such as how to prepare training/testing set, what are the key
parameters and how you set them up
Comments/evaluation for the models learnt
Your report should use the following template:
Table of Contents
1.0 Introduction
Explain the data set you’ve chosen, including its source URL. Demonstrate your
exploratory data analysis in this section.
2.0 Machine learning implementation
2.1 Collaborative filtering
2.3 Logistic regression
2.4 K-Means
3.0 Conclusion
The marking rubrics are viewable on the blackboard.
Report Format
Your report should be about 1000 words, but no more than 1500 words.
The report MUST be formatted using the following guidelines:
Title Page – Must not contain headers, footers, or page numbering. Include your name
as the report’s author.
Header – Report title
Footer – your name and the page number
Paragraph text – 12 point Calibri single line spacing
Headings – Arial in an appropriate type size
Margins – 2.5cm on all margins
Page numbering
Introduction and onwards to use conventional numerals (1, 2, 3, 4) starting at
page 1 from the introduction.
The report is to be created as a single Microsoft Word document (version 2007 or
later). No other format is acceptable and doing so will result in the deduction of
Please follow the conventions detailed in:
Summers, J. & Smith, B., 2014, Communication Skills Handbook, 4th Ed, Wiley, Australia.
The report is to include (at least 5) appropriate references and these references should follow
the Harvard method of referencing. Note that ALL references should be from journal articles,
conference papers, technical papers or a recognized expert in the field. DO NOT use
Wikipedia as a reference. The use of unqualified references will result in the deduction of
This assignment will take several weeks to complete and will require a good understanding of
machine learning and PySpark for successful completion. It is imperative that students take
heed of the following points in relation to doing this assignment:
1. Ensure that you clearly understand the requirements for the assignment – what must be
done and what are the deliverables.
2. If you do not understand any of the assignment requirements – Please ASK your tutor.
3. Each time you work on any aspect of the assignment reread the assignment requirements to
ensure that what is required is clearly understood.
4. We have practiced nearly all coding tasks in DataCamp before. If you have any difficulty,
redoing the practices in DataCamp is recommended.

5. Prior to submitting your code, you should ensure not only that it executes as required, but
also looks professional. It is expected that you adhere to python standards for naming and
indenting. All methods should be adequately documented such that another programmer
examining your code will readily know what the code is doing.
