Predictive Analytics

MIS772 Predictive Analytics (2019 T2) Assignment A2 / Workshops M1-M2-M3
Assignment A2 / Workshops M1-M3: RM
his assignment covers all workshops in modules M1-M3. By completing the workshops
and assignment students will understand how to use RapidMiner (RM) to explore data,
gain insights into the problem domain, create and validate estimation and clustering
models, perform segmentation analysis and text mining. The workshop will rely on
students’ knowledge of methods and techniques introduced in a series of classes. The
assignment will have two deliverables in the form of learning portfolios LP3 and LP4.
During the workshop (on-campus and on-cloud) students will work in teams but submit
their individual reports based on their tasks as related to the data set. The work is
expected to use RapidMiner Studio. Demonstrations and lab exercises will assist skill
development.
Before attending RM workshops, students are required to become familiar with class
notes and all textbook readings (see the topic schedule with chapter references).

Activities – No late arrivals for the on-campus sessions! Topic
1. Learn to use RapidMiner Studio. Preparation
2. The workshop facilitator will explain the case in the focus of this assignment.
Work in groups of up to 4 (also 1-2-3).
M1,
M2T1
Classification
Cross-Validation
Optimisation
Data Prep
Start by formulating a business problem (it may change later).
3. Revise classification models (such as k-NN and decision trees), cross
validation, clustering and simple model optimisation.
Learn about the problem area and the assignment data.
Download your data as a CSV (or JSON if brave) file, explore your data.
Select attribute types, nominate them as labels and predictors.
Do not modify these ‘raw’ data files outside of the RM environment.
4. Learn to parse and represent text data, reduce data dimensionality, perform
segmentation analysis, create and evaluate predictive models with attributes
derived from text, visualise results.
M2T2
Text Mining &
Sentiment
5. Use RM to clean and transform data, deal with missing values, produce
simple statistics and charts, build estimation models using multiple
regression and neural networks. Learn how to create model ensembles,
such as random forests, boosting, stacking and bootstrapping ensembles.
M2T3 & M2T4
Estimation
Neural Nets
Ensembles
6. Study the techniques associated with the deployment or analytic processes.
Extend your work on neural networks with deep learning systems.
M3T1 & M3T2
Deployment
Deep Learning
7. As a team member, prepare an individual report using the provided
template. The report should be in PDF format. Also, include all RM
processes in RMP format. If you have altered the data, attach the modified
data to your submission.
Report and
Executive
Summary
8. By the specified deadline, individually submit two components of your
learning portfolio, i.e.
LP3 and later LP4 parts of the assignment via
CloudDeakin dropbox. With each submission, include your
report in PDF,
formatted using the provided template plus a
ZIP archive of all models, i.e.
your RapidMiner scripts (.RMP files) –
do not use other file formats!
Submission /
Learning Portfolio

1 of 4
Objectives
Methods
Prerequisites
Workshop
Schedule

MIS772 Predictive Analytics (2019 T2) Assignment A2 / Workshops M1-M2-M3
This mini case study will be used in all workshops of module 1, i.e. M1T1-M1T4. All
amendments, extensions and assumptions should be recorded in the final submission.
Australian Wine Importers (AWI) asked you to develop a
method of estimating rating (points) of imported wines based
on their text and structured attributes.
AWI provided you with a sample of 130,000 wine tasting
results, which include:
Wine “title” (name + vintage);
Country, Province and Region;
Variety and Winery;
Description and Designation;
Price (US$).
However:
Taster name and Points to be excluded.
In the future, AWI would like to get the preliminary insight as to the wine quality based
on social media reviews. The following questions are of interests to AWI:
A) What group of wines the new wine is most similar to, and why / how? and,
B) What is the estimated rating of the newly introduced wine to the Australian market?
(fractional ratings permitted)
AWI wants you to cleanup and explore wine tasting data, develop and evaluate a wine
rating estimator, and minimize the estimation error in the process.
In technical terms:
Your project objectives form a learning portfolio. The first objective (LP3) is to acquire
and explore the available data using clustering and segmentation analysis, visualise and
report relationships in text and structured data, also prepare data for further processing.
The second objective (LP4) is to create an estimation system able to answer management
questions using all available data. Text mining will be strongly featured in assignment
A2. Reports in PDF format and models developed in LP3 and LP4 in ZIP archives are to
be submitted via CloudDeakin by their respective deadlines.
Data:
Data: http://www.deakin.edu.au/~jlcybuls/pred/data/Wine-Reviews.zip
Original data source: https://www.kaggle.com/zynicide/wine-reviews
Hints on the process:
Formulate a business problem using plain English statements, however, cross-reference
them with technical aspects described in the subsequent sections. When describing the
problem and its solution keep in mind what can be achieved by using the available data.
Note that what you have been asked for and what can be delivered are two different
things, e.g. to solve the problem you may need to narrow or slightly change the problem
scope or the model may provide quality answers only within a specific range of data
characteristics, if so then this is what you need to report or recommend to AWI
management.
2 of 4
Mini
Case Study

MIS772 Predictive Analytics (2019 T2) Assignment A2 / Workshops M1-M2-M3
Explore your text and non-text attributes in terms of their clustering and segmentation.
Use appropriate visualisations, analyse and interpret them. As the report template
provides very limited space, be selective about what you include in the report – each
chart and table must have a purpose and a description to advance your argument, use
them as evidence!
Depending on the model, some attributes may need to be transformed before using them
in modelling tasks. You may also have to deal with incorrect or missing values. Look at
your modelling options, optimise their parameters and compare evaluation results.
Check the assessment criteria on the next page to see how you are going to be assessed.
Stick to the recommended process. Complete the basics first before moving to the more
advanced tasks or any extensions and research tasks.
You will submit your work in two learning portfolio parts LP3 and LP4.
Each part needs to be lodged via CloudDeakin dropbox before the deadline.
You will be allowed to submit your work once only!
It is essential that your reports use LP3 and LP4 templates.
Follow instructions embedded in the templates!
Both reports must fit into a strict page limit imposed by the template.
Only pages within the template limit will be reviewed and assessed!
Make sure that the problem statement and the executive summary are aimed at nontechnical readers, while the remaining parts of the reports aim at a data / business analyst
(and not highly technical programmers).
Your submission must include the
report in PDF format and a ZIP archive of .RMP
script files (these can be found in the RM project folder – simply ZIP these files).
Submissions not in a PDF and ZIP format will not be open or assessed!
There is a strict deadline for each submission. In cases of some documented illness, a
special consideration may be granted but must be applied for well ahead of the deadline.
In general, requests for special considerations received less than three days before
deadline will not be considered!
An automatic late penalty of 5% of the available marks per day (up to 5 days) will be
applied to all late assignment submissions.
Late penalties apply immediately past the deadline – even 1 second!
Both parts LP3 and LP4 will be marked together after part LP4 is submitted.
Feedback will be provided on both parts together.
Team work and collaboration is encouraged but plagiarism will be penalised.
Team members can share ideas and help each other in solving technical problems. Seek
your team’s feedback on all aspects of your assignment, especially before its submission.
However, your assignment needs to be completed individually.
Ensure that your assignment is unique, otherwise plagiarism will be assumed!
3 of 4
Assignment
Submission

MIS772 Predictive Analytics (2019 T2) Assignment A2 / Workshops M1-M2-M3
The work will be assessed based on the following criteria. Use RapidMiner for both
assignment tasks LP3 and LP4. Other tools can be used for the tasks associated with the
research section only. Do not start the advanced tasks before meeting the expectations
first (or no points will be given). Use submission template for both LP3 and LP4.

LP3 Exceptional
Ranges: 80–90–100%
Meets Expectations
Ranges: 50–65–79%
Unacceptable
Ranges: 0–25–49%
5 One page limit 0
Problem Identify what decisions need to
be drawn and what actions
need to be supported.
Succinctly state a business problem (or
question) and specify what insights need to be
generated from data.
Not provided or in
comprehensible.
25 One page limit 0
Data Prep Deal with errors and missing
values. Reduce data
dimensionality. Provide
comprehensive analysis,
tabulate your results. Answer
the management question (A).
Parse text attributes. Then, conduct clustering
and segmentation analysis of
both structured
and text data
. In the process, identify
relationships in data. Visualise and interpret the
obtained results. Annotate all charts (with text
and arrows) to highlight important insights.
Not meeting
expectations. Missing
RM process files.
Over the page limit.

Include: Report (use template, in PDF) and RMP files (in ZIP), with explanation how to reproduce all results.

LP4 Exceptional
Ranges: 80–90–100%
Meets Expectations
Ranges: 50–65–79%
Unacceptable
Ranges: 0–25–49%
5 One page limit 0
Exec Report Narrow down the business
problem. Identify decisions and
actions that will be supported
by the analytic solution. Include
a list of used academic refs.
Restate / Redefine a business problem from
the previous stage LP3.
Not provided or in
comprehensible.
Solution not justified.
No cross-references
to the rest of report.
Succinctly describe the solution and justify it.
Provide references to the analytic results and
supporting evidence, e.g. charts and plots.
15 Two pages limit 0
Models Analyse and eliminate
anomalies. Rely on your data
clustering. Use PCA to visualise
clusters and anomalies. Create
and use a composite model
ensemble. Answer the
management question (B).
Create at least these two models, i.e. (M1)
decision trees (or GBTs / Random Forests) and
(M2) neural nets. Ensure you consider three
types of solutions, which are based on (A1)
structured data only, (A2) text data only, (A3) a
mix of structured and text data. Describe and
justify the operators’ parameters.
Not meeting
expectations. Missing
RM process files.
Over the page limit.
20 Two pages limit 0
Evaluation Use systematic grid
optimisation of the models’
hyper-parameters. Visualise
results of grid optimisation.
Use cross-validation, e.g. 5-fold. Optimise the
models’ performance to minimise overall error
in ratings. Tabulate performance of all models
(including ensembles if used), using R2,
correlation and others. Identify the best model
and justify its selection.
Not meeting
expectations. Missing
RM process files.
Over the page limit.
15 One page limit 0
Deploym. Ensure all preprocessing and
predictive models, as well as,
word lists, weights and PCA
models are saved during
optimisation and then retrieved
and applied in deployment.
Create a quality deployment process using your
best model with the optimum parameters.
Score new data with the model and discuss
results. Explain how to prepare data and use
the results in practice.
Not meeting
expectations. Missing
RM process files.
Over the page limit.
15 One page limit 0
Research Surprise us with your insights.
Ensure to use a few academic
references in this section,
include the list of references in
the executive section (above).
Extend your work with RM features beyond
what was covered in class, to improve the
model, use novel visualisations, or analyse
results in the most effective way. Alternatively,
conduct independent research to assess and
contextualise your results.
Not meeting
expectations. Missing
RM process files.
Over the page limit.

Include: Report (use template, in PDF) and RMP files (in ZIP), with explanation how to reproduce all results.
4 of 4
Assessment
Criteria
Part LP3
submitted
by deadline
Part LP4
submitted
by deadline