Individual Assignment

BUAN 573 Individual Assignment, Week 2
Your Name
June 12, 2022
Part 1.
(a) The ToyotaCorrolla.csv dataset (that must be downloaded and placed in the same folder where
your R markdown file is saved) contains data on used cars for sale during the late summer of 2004 in
Netherlands. Load the
ToyotaCorrolla.csv dataset into RStudio and name the loaded data frame as
(b) How may rows and columns does
toyota.df have? Use the dim() function to answer this question.
(c) Display the top few rows of
(d) Describe what rows (data entities) represent.
(e) Describe what columns (data features) represent.
(f) Display the column names.
(g) Suppose you wanted to explore/browse the entire imported dataset (
toyota.df) as you would normally
do in Microsoft Excel, how would you go about doing it?
(h) Some of the columns of
toyota.df are quantitative and some of them are categorical variables. What 2
variables are categorical variables that are coded as text (rather than as numerical)? Give an example of
2 variables that are numerical and an example of 2 variables that are categorical but coded as numerical.
Part 2
(a) Two of the textual categorical variables of toyota.df are Fuel_Type and Color. Convert these variable
into binary numerical variables and substitute
Fuel_Type and Color in toyota.df with the newly
created dummy variables. Note: Upon substitution, be sure to drop one of the dummy variables for
each original categorical variable.
(b) One of the numerical variables that an analyst may be very interested in is
Price. First standardize
then rescale this variable. What effect does normalization and rescaling have on the variable?
(c) Apply the equal frequency method to discretize the
Price variable, label the discrete bins of the new
variable as
low, average, and high, and append the new variable to toyota.df.
Part 3
(a) Using the concept of overfitting, explain why when a model is fit to training data, zero error with those
data is not necessarily good.
(b) Describe the difference in roles assumed by the validation and test data partitions.
(c) Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate
than model B on the training data, but slightly less accurate than model B on the validation data.
Which model are you more likely to consider for final deployment?
(d) Prepare
toyota.df for data mining techniques of supervised learning by creating partitions in RStudio.
Select all the variables of
toyota.df and use default values for random seed and partitioning percentages
for training (50%), validation (30%), and test (20%) sets. Describe the roles that these partitions will
play in modeling.

Part 4
This question uses the mutate(), select(), slice(), filter(), arrange(), summarize(), and group_by()
functions of the dplyr package.
ggplot2 library contains several data sets, one of which is the diamonds data with the prices of over
50,000 round cut diamonds. Each record of the
diamonds data represents the price (in US dollars), carat
weight, cut (quality), color (from D=best to J=worst), clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1,
IF (best)), x (length in mm), y (width in mm), z (depth in mm), depth (total depth, percentage), and table
(width of top of diamond relative to widest point) of a sold diamond.
(a) Upload the
diamonds data by activating the ggplot library and executing the following command:
(b) Diamonds are measured in carats, where 1 carat=0.200 grams. Using the mutate() function write
code that will create another table (call it
aug_diamonds) that will have all the columns of the original
diamonds table plus two additional columns, one for price per carat and one for weight in grams. Name
the added columns
price_carat and weight respectively.
(c) Using the
select() function, select columns carat, cut, price_carat, and weight, order the resulting
selected data by
price_carat in descending order (using the arrange() function), and display the top
5 rows (using the
slice() function).
(d) Write code that using the
filter() function will pull a subset of the aug_diamonds table that contains
only observations with
ideal cut. Display the top 3 rows.
(e) Using the
select() function, pull columns price and cut out of the aug_diamonds table. Further,
using the
summarise() and group_by() functions, obtain the median price of diamonds grouped by
cut. Is the result intuitive?
(f) Repeat the steps in part (e) above but instead of
price use price_carat. Does the result in this part
offer any clues as to why the results in part (e) might have been counter-intuitive?