Learning objectives: Gain experience processing text files, building indexes using Python dictionaries, and using regular expressions.
Project specification: For this project, you will build an index of words in a collection of short stories. Specifically, we will index the Project Gutenberg text of Grimm’s' Fairy Tales, by The Brothers Grimm. I have posted a single text file called grimms.txt on the course Sakai site that contains the text of all the stories. Use this file for the project. You do not need to index text from any other Project Gutenberg text. When downloading text files from Sakai, you should use your browser's “Save as” feature to save the file exactly as it is posted rather than trying to cut and paste the text into a new file The grimms.txt file contains short stories (fairy tales) in a single text file. When processing the file, you should programmatically skip over the introductory lines of text that has information about Project Gutenberg and the table of contents. The short stories start after line 124. Each story starts with a blank line followed by the title of the story in all capital letters on a line by itself, followed by another blank line. After this the text of the story begins. One of the stories, “The Adventures of Chanticleer and the Part let,” has two parts. Treat both parts as belonging to one story. I strongly recommend detecting the story titles from the lines after line 124 rather than using the “CONTENTS” on lines 47-115. The story title lines will be in all capital letters, and some titles may have characters such as a dash
Checkpoint #1: Building the Index-Due 5:00pm, Feb 23
Your program should read the grimms.txt file and create an index of all the words that appear in the fairy tales except for those in a provided stop word list. For this assignment, when building the index, you should REMOVE all characters that are not a letter, number, or space character (e.g., [a-zA-20-9 1) In addition, when building the index; you should convert all words to lower-case. You will need to make the same conversions to queries that are entered so that they will match words as you have indexed them. I have posted a file named stopwords.txt on Sakai that contains a list of words (one per line) that should NOT be included in your index. As you are building your index, if you encounter a word in the list of the stop words, you should skip it. Do NOT index words that appear outside the text of the fairy tales (e.g., in the Project Guttenberg sections that appear at the beginning and end of the file, and the table of contents). For each word that you include in your index, you should keep track of the stories and line numbers that contain the word. For example, the word “raven” appears in three stories, on the lines show below. Note that the line numbers refer to line numbers in the grimms.txt file, starting from line 1 at the top of the file. Hint: one method would be to build a dictionary of dictionaries of lists. The outer dictionary keys are the words, the inner dictionary keys are the story titles, and the list is a list of the line numbers where that word appears in that story
THE LITTLE PEASANT: [3894, 3924, 3933, 3936, 3939]
THE RAVEN: [6765, 6767, 6772, 6773, 6785, 6802, 6807, 6820, 6823, 6839]
Stop Words Text:
Thanks for your help! Let me know if you need clarification. I would download each text from the Google docs as a plain text (.txt) file and use that in your code