data sets various machine learning algorithms

Aims: To implement and compare on benchmark data sets various machine learning algorithms

Background: Machine learning allows us to write computer programs to solve many complex problems: instead of solving a problem directly, the program can learn to solve a class of problems given a training set produced by a teacher. This project will involve implementing a range of machine learning algorithms, from the simplest to sophisticated, and studying their empirical performance using methods such as cross-validation and ROC analysis. In this project you will learn valuable skills prized by employers using data mining.

Early Deliverables

  1. You will implement simple machine learning algorithms such as nearest neighbours and decision trees.
  2. They will be tested using simple artificial data sets.
  3. Report: a description of 1-nearest neighbour and k-nearest neighbours algorithms, with different strategies for breaking the ties;
  4. Report: a description of decision trees using different measures of uniformity.

Final Deliverables

  1. May include nearest neighbours using kernels and multi-class support vector machines.
  2. The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
  3. For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
  4. The performance of all algorithms will be explored using a hold-out test set and cross-validation.
  5. The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
  6. Ideally, it will have a graphical user interface.
  7. The report will describe: the theory behind the algorithms.
  8. The report will describe: the implementation issues necessary to apply the theory.
  9. The report will describe: the software engineering process involved in generating your software.
  10. The report will describe: computational experiments with different data sets and parameters.

Suggested Extensions

  • Modifications of known algorithms.
  • Dealing with machine learning problems with asymmetrical errors (such as spam detection) and ROC analysis.
  • Nontrivial adaptations of known algorithms to applications in a specific area, such as medical diagnosis or option pricing.
  • Exploring the cross-validation procedure, in particular the leave-one out procedure. What is the optional number of folds?
  • Comparative study of different strategies of reducing multi-class classifiers to binary classifiers (such as one-against-one, one-against-the-rest, coding-based).

Reading

  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Second edition. Springer, New York, 2009.
  • Tom M. Mitchell. Machine Learning. McGrow-Hill, New York, 1997.
  • Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second edition. Springer, New York, 2000.
  • Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
  • Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu.
  • Delve archive. http://www.cs.toronto.edu/~delve.

Prerequisites: Java or C++ programming skills. In particular, they should be comfortable with data structure programming such as trees. CS3920 is highly recommended. This project would suit the specialised AI degree