BUAN 573 Individual Assignment, Week 2
June 12, 2022
(a) The ToyotaCorrolla.csv dataset (that must be downloaded and placed in the same folder where
your R markdown file is saved) contains data on used cars for sale during the late summer of 2004 in
Netherlands. Load the ToyotaCorrolla.csv dataset into RStudio and name the loaded data frame as
(b) How may rows and columns does toyota.df have? Use the dim() function to answer this question.
(c) Display the top few rows of toyota.df.
(d) Describe what rows (data entities) represent.
(e) Describe what columns (data features) represent.
(f) Display the column names.
(g) Suppose you wanted to explore/browse the entire imported dataset (toyota.df) as you would normally
do in Microsoft Excel, how would you go about doing it?
(h) Some of the columns of toyota.df are quantitative and some of them are categorical variables. What 2
variables are categorical variables that are coded as text (rather than as numerical)? Give an example of
2 variables that are numerical and an example of 2 variables that are categorical but coded as numerical.
(a) Two of the textual categorical variables of toyota.df are Fuel_Type and Color. Convert these variable
into binary numerical variables and substitute Fuel_Type and Color in toyota.df with the newly
created dummy variables. Note: Upon substitution, be sure to drop one of the dummy variables for
each original categorical variable.
(b) One of the numerical variables that an analyst may be very interested in is Price. First standardize
then rescale this variable. What effect does normalization and rescaling have on the variable?
(c) Apply the equal frequency method to discretize the Price variable, label the discrete bins of the new
variable as low, average, and high, and append the new variable to toyota.df.
(a) Using the concept of overfitting, explain why when a model is fit to training data, zero error with those
data is not necessarily good.
(b) Describe the difference in roles assumed by the validation and test data partitions.
(c) Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate
than model B on the training data, but slightly less accurate than model B on the validation data.
Which model are you more likely to consider for final deployment?
(d) Prepare toyota.df for data mining techniques of supervised learning by creating partitions in RStudio.
Select all the variables of toyota.df and use default values for random seed and partitioning percentages
for training (50%), validation (30%), and test (20%) sets. Describe the roles that these partitions will
play in modeling.
This question uses the mutate(), select(), slice(), filter(), arrange(), summarize(), and group_by()
functions of the dplyr package.
The ggplot2 library contains several data sets, one of which is the diamonds data with the prices of over
50,000 round cut diamonds. Each record of the diamonds data represents the price (in US dollars), carat
weight, cut (quality), color (from D=best to J=worst), clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1,
IF (best)), x (length in mm), y (width in mm), z (depth in mm), depth (total depth, percentage), and table
(width of top of diamond relative to widest point) of a sold diamond.
(a) Upload the diamonds data by activating the ggplot library and executing the following command:
(b) Diamonds are measured in carats, where 1 carat=0.200 grams. Using the mutate() function write
code that will create another table (call it aug_diamonds) that will have all the columns of the original
diamonds table plus two additional columns, one for price per carat and one for weight in grams. Name
the added columns price_carat and weight respectively.
(c) Using the select() function, select columns carat, cut, price_carat, and weight, order the resulting
selected data by price_carat in descending order (using the arrange() function), and display the top
5 rows (using the slice() function).
(d) Write code that using the filter() function will pull a subset of the aug_diamonds table that contains
only observations with ideal cut. Display the top 3 rows.
(e) Using the select() function, pull columns price and cut out of the aug_diamonds table. Further,
using the summarise() and group_by() functions, obtain the median price of diamonds grouped by
cut. Is the result intuitive?
(f) Repeat the steps in part (e) above but instead of price use price_carat. Does the result in this part
offer any clues as to why the results in part (e) might have been counter-intuitive?
BUAN 573 Individual Assignment, Week 2