implement a popular machine learning technique

Aims: The aim of the project is to implement a popular machine learning technique called ridge regression, apply it to real-world data, analyse its performance, analyse its performance, and compare it against other methods.

Background:

Regression algorithms play an important role in machine learning. They allow us to learn dependencies between multidimensional attributes and continuous outcomes. Regression algorithms can be applied to different domains.

An example is the Boston housing problem. The Boston housing database consists of records describing houses in Boston (to be precise, average houses in certain neighbourhoods). For each house the values of attributes such as the number of rooms, distance from the city centre, quality of schools etc are known and the price, as evaluated by an expert estate agent, is given. The goal of the learner is to determine the dependency between attributes and prices and to predict the price of a house using the values of the attributes. The program is first shown a number of training examples so that it could learn the dependency and then is tested on test examples. Ridge Regression is known to perform well on this dataset.

Another example is time series data. A significant amount of real world data is temporal in nature, in that the values of the variables are sampled at multiple points over some time period, so that the data stored for each entity has an additional ‘time’ dimension. Such data are called time series. Examples of time series are the daily closing value of the FTSE 100 index or network traffic measurement on a router. Time series forecasting is the use of a model to predict future values based on previously observed values and regression algorithms can be used here.

Applying a learning algorithm to data is not a straightforward task. One needs to preprocess the data and choose parameters to optimise the performance. The algorithms need to be implemented in such a way so as to run fast and not to suffer from numerical error. Making sense of the results and comparing different algorithms in a fair meaningful way can be tricky. Does one algorithm outperforms another consistently, or is it a mere coincidence?

Early Deliverables

  1. Report: An overview of ridge regression describing the concepts ‘training set’ and ‘test set’, giving the formulas for a regression algorithm and defining all terms in them.
  2. Proof of concept program: a regression algorithm applied to a small artificial dataset.
  3. Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.
  4. Loading a real-world dataset, simple pre-processing and visualisation.

Final Deliverables

  1. The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
  2. The program will work with a real-world dataset, read the data from a file, preprocess the data and apply regression using parameters selected by the user with parameters entered by the user.
  3. The program will have a graphical user interface.
  4. The program will automatically perform tests such as comparison of different kernels, parameters, etc and visualise the results.
  5. The program will implement another learning algorithm such as nearest neighbours or neural networks, and compare regression against it.
  6. The report will describe the theory of regression and derive the formulas.
  7. The report will describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
  8. The report will describe the software engineering process involved in generating your software.
  9. The report will describe computational experiments with different kernels and parameters and draw conclusions.
  10. The report will describe the competitor algorithm(s), compare the performance and draw conclusions.

Suggested Extensions

  • Use of regression in the on-line mode with growing and sliding window.
  • Application of regression to different datasets (including those found by the student).
  • Implementation of regression-type algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies etc and computational experiments with them.

Prerequisites:

  • Taking CS3920 (Computer Learning) in the 3rd year is recommended.
  • Interest in numerical methods.
  • Good command of Java, C++, or Python.

Suitable for CS with AI students.