Learn basics of large-scale data analysis

Aims: Learn basics of large-scale data analysis and data mining using MapReduce and Hadoop.

Implement Map and Reduce programs for several useful data analytics problems, and analyse their performance.

Prerequisites: Students attempting this project should be good at programming, and comfortable with basic data structures and algorithms. Basic knowledge of statistics and probability will be helpful. Students should be comfortable with Linux command line tools, and shell scripting.

Background: MapReduce, and its open source implementation Hadoop, have become a primary tool for analysing massive data sets on large-scale clusters of commodity machines (such as found in today’s cloud infrastructures). The key to its popularity is a simple programming model allowing developers to implement their analytics logic by providing code for just two primitives: Map and Reduce. The resulting MapReduce program is then passed to the runtime environment, which automatically parallelises its execution on a cluster of machines. The developers can tune performance of their MapReduce programs by tweaking values of a number of parameters affecting the degree of parallelism, data locality, and others.

In this project, you are asked to code MapReduce programs for a number of typical analytics problems with varied degree of difficulty, deploy them onto a Hadoop cluster, analyse their performance, and propose an optimal parameter setting.

To test your programs, you will need to generate sample data sets, and/or download them from online sources. Pointers to several such sources will be provided by the advisor.The problems you are asked to solve are as follows:

  1. Given a large collection of web pages, find a number of occurrences of each word. This is a “Hello World!” of MapReduce.
  2. Distributed Grep: Given a large collection of web pages, and a regular expression, find all strings matching the regular expression. The output must be a list of tuples each of which consists of a matching string and its location in the input (e.g., file name, and offset or line number).
  3. Basic statistics analysis of a simple data set: Given a large collection of numbers (e.g., representing times to failure of a software component resulting from a large number of experiments), generate a frequency distribution of the numbers at fixed percentiles (e.g., 10%), find values of various different order statistics, such as minimum, maximum, and median, and visualise the shape of the distribution.
  4. Implement an advanced data mining algorithm (e.g., K-Means, or a classifier) from [4], apply it to discover interesting features in a realistic data set, and visualise the results. The problem, algorithms, and data sets will be chosen in coordination with the advisor.

Early Deliverables

  1. Solution of Problems 1 and 2, their performance analysis, recommendation for cluster configuration and parameter settings, and result visualisation.
  2. Reports: Map and Reduce algorithm description; experience with running your code on a Hadoop cluster; practical problems being encountered, and their solution; experience with performance analysis, and parameter tuning; recommendation for optimal parameter setting for each of the analysed problem.

Final Deliverables

  1. Solution for Problems 3 and 4, their performance analysis, recommendation for cluster configuration and parameter settings, and visualisation of the results.
  2. The report will provide a reflective overview of the MapReduce paradigm, and its Hadoop implementation.
  3. The report will describe implementation of the Map and Reduce primitives for all four problems.
  4. The report will discuss and motivate the choice of the data sets for each of the four problems.
  5. If synthetic data sets were used, the report will outline the procedures used to generate the data sets.
  6. The report will reflect on the experience of running, analysing and tuning your implementation on a Hadoop cluster.

Suggested Extensions

  • Implement advanced data mining techniques. Apply them to interesting data sets.
  • Talk to machine learning experts in the department to identify open massive scale analytics problems they are interested in solving, and implement and run them in the Hadoop framework. One such problem can be identifying trends in the epidemics spread using real-time data supplied by mobile devices. If you are interested to look at this problem, you should talk to Chris Watkins and Gregory Chockler.


  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 2004, pages 137-150, 2004.
  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010.
  • Jeffrey Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT ’06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, page 1, New York, NY, USA, 2006. ACM.
  • Rajaraman, Lesocovec, Ulman. Mining of Massive Datasets. Cambridge University Press. Available online at http://infolab.stanford.edu/~ullman/mmds.html.
  • MapReduce bibliography by A. Kamil, 2010: http://www.columbia.edu/~ak2834/mapreduce.html.
  • Hadoop: http://hadoop.apache.org.