associated preprocessing technique

Compression and the Burrows-Wheeler transform

Aims: To implement lossless compression techniques along with preprocessing by the Burrows-Wheeler transform.

Background: This project is concerned with data compression and associated preprocessing techniques. Data compression, or source coding, involves encoding information using fewer bits than the original file or representation. Compression can be either lossy or lossless, and is applied in order to reduce resources such as storage or transmission capacity.

The Burrows-Wheeler Transform is a text transformation scheme that has found applications in various aspects of the data explosion problem, from data compression to indexing. Specifically, the transform is a permutation of an input string which often clusters occurrences of the same letter in the input into `runs’ of the letter. Hence it is useful as a preprocessing step in compression methods like run-length encoding.

A string of letters over an alphabet is a Lyndon word if it is the least in lexicographic (dictionary) order amongst the set of its rotations. For example, “data” is not a Lyndon word since it is not lexicographically least in the set {data, atad, tada, adat} whereas “adat” is a Lyndon word. Similarly, “computer” is a Lyndon word while “science” is not. An important technique for handling string computations is to split an input string into substrings, each with a common pattern. One such method is the Lyndon factorisation, where each substring is a Lyndon word. Efficiencies are achieved with the Burrows-Wheeler transform, by first factoring the input string into Lyndon words, and then applying the transform. It is also known that representing a string as a suffix tree can improve the speed of the Burrows-Wheeler transform.

In this project you will study lossy and lossless compression techniques. You will study relevant Shannon information theory such as entropy encoding, and some of the classic compression algorithms such as Huffman coding including their complexities. You will implement lossless compression algorithms applied to textual data, and preprocessing techniques.

Benchmarking will be performed in order to compare compression ratios: between different encoding methods, plus with and without preprocessing . You will develop a GUI suitable for importing text files; the GUI will incorporate HCI (human computer interaction) concepts.

Early Deliverables

  1. Report on classic lossless algorithms: Implement: run-length encoding; move-to-front encoding; Huffman encoding.
  2. Program: GUI built with buttons etc.
  3. Program: Efficient Suffix Tree implementation
  4. Program: Naive implementation of Burrows-Wheeler
  5. Report: How does Burrows-Wheeler work with an emphasis on writing and understanding a proof that the reverse transform works.
  6. Report: Shannon information theory and lossy v lossless compression; complexity theory; pseudo-code of algorithms implemented.
  7. Reports: Complexity, NP hardness and the big O notation
  8. Report: The preprocessing phase for compression.
  9. Program: Implement an efficient Burrows-Wheeler Transform using Suffix Trees.
  10. Program: Apply the Lyndon factorisation to an input string, and apply the Burrows-Wheeler transform to each of the Lyndon factors.

Final Deliverables

  1. The program will have a Graphical User Interface with which to input text files and display compression ratios for compressed files.
  2. The Lyndon factorisation and Burrows-Wheeler transformation of a string will be displayed.
  3. The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
  4. The report will describe the software engineering process involved in generating your software, interesting programming techniques and data structures used (or able to be used) on the project.
  5. The report will describe the complexity analysis of your computations.
  6. The report will include some results of benchmarking compression algorithms with a discussion of their meanings.
  7. The report will include discussion of applications of the Burrows-Wheeler transform and Lyndon factorisation.
  8. The report will include a User Manual.

Suggested Extensions

  • Build a mobile Burrow-Wheeler application: this will require use of Worker Threads.
  • Show the output using an effective JTree with callbacks to enable user interaction.
  • Animate the Burrows-Wheeler in order to help explain its working to a student unfamiliar with the algorithm.
  • Adding Digital Signatures to compressed files.

Prerequisites: Mathematical aptitude, in particular combinatorics, and competent programming skills to include handling a variety of data structures.

Reading

  • D. Adjeroh, T. Bell, A. Mukherjee, The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching, Springer (2008) 352 pp.
  • M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation (1994).
  • M. Salson, T. Lecroq, M. Leonard and L. Mouchard, A four-stage algorithm for updating a Burrows-Wheeler transform, Theoret. Comput. Sci. 410 (43) (2009) 4350-4359 doi:10.1016/j.tcs.2009.07.016.
  • http://en.wikipedia.org/wiki/Burrows-Wheeler transform
  • M. Lothaire, Combinatorics onWords, Addison-Wesley, Reading, MA, 1983; 2nd Edition, Cambridge University Press, Cambridge, 1997.
  • J. P. Duval, Factorising words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
  • http://en.wikipedia.org/wiki/Data_compression