Data and Processes under the Homework repository

Part 1

1.Create a repository in RapidMiner called Homework and create two subfolders Data and Processes under the Homework repository

2.Import the data from the file Titanic.xls attached on Blackboard and store it in a repository called Titanic in the Data subfolder you created above by doing the following

a.In Design View Manually add the Read Excel Operator

b.Manually add the Store operator and specify the repository

c.Run the process to import and save the data to your local repository

3.Filter the data for just the ones that survived the Titanic disaster

a.In the design view add the Filter Examples Operator and specify the filter on the column “Survived” by clicking on the Add Filters button in the Parameters section

b.You can also use Turbo Prep to filter the data, if you would like and add the process

c.Run the process to check if the filtering worked

4.Replace “Male” with “Man” in the Sex column

a.In the design view add the Replace Operator and specify the attribute filter type as single and fill in the additional details to replace “Male” with “Man”

b.Run the process to check if the replacement worked

c.Go to File > Export Process and save the process you created above as a .rmp file

Part 2

1.Download bats_data.csv and Geo_data.csv that are attached on blackboard and save it on your computer

2.In RapidMiner do the following:

Step 1 – Open a new process file and Read the two data files and store them as RapidMiner repositories (you can call them Bats Data and Geo Data). Hint: Use the Read csv and store operator.

Step 2 – Now that you have read and stored the data in a local RapidMIner repository you can retrieve it from the repository.  Now, open another new process file and retrieve the two datasets from the RapidMiner repositories created above. Hint: Use the Retrieve operator.  You can use two Retrieve operators, one to retrieve Bats Data and the other to retrieve Geo Data.

Step 3 – Replace the missing values in the Foraging column in the Bats Data dataset with 0 (zero). Hint: Use the Replace Missing Values operator as shown in the slides.

Step 4 – Merge the Bats Data dataset and the Geo Data dataset such that all the records in the Bats Data dataset are retained and merged with the Latitude and Longitude columns in the Geo Data dataset.  Hint: Choose a left or right join depending on whether you have chosen the Bats Data dataset to be the left or right dataset.

Step 5 – Now find the aggregate sum of Foraging by Site.  Hint: Use the Aggregate operator. Click on the group by attributes button and choose Site as the groupby column and click on aggregation attributes and choose Foraging as the column and sum as the function.

Step 6 – Go to File > Export Process and save the process you created above as a .rmp file