Midterm Project

Kolmogorov Complexity Midterm Project

Updates

February 6
- Modified the readable weather data (removed some empty lines). A new packed file was created from this modified readable file.
- Corrected file sizes of future weather forecasts. Added a fourth (very regular) weather prediction for february.
February 5
- Fixed broken link for unkown gene file
- Made some small changes to future weather forecasts
- Changed due date: February 24th

Please let me know about any other broken links or file discrepancies.

The Rules

You will find below various files containing different types of data. This project takes the form of a contest: working in groups you will try to maximally compress each data file. The group which compresses the most wins, and receives the highest grade. All members of a group receive the same grade.

You are allowed to use any available compression algorithm (from the internet or other sources), or you can write your own. Even if you do not win the contest you can still receive a high grade through a nice presentation and defense of your results. The contest begins now, and you will present your results in class on February 24th. No late projects will be accepted!

The Presentation

Presentation of your results should include the following:

Initial guesses of the expected size of the compressed files, and brief defenses (theoretical or intuitive) of these guesses.
A short description of the algorithm behind the compression program which you found to give the best results for each type of file. If you use the algorithm which writes Romeo and Juliet on empty input then we will count the size of the algorithm plus program.
Finally, the actual compressed files, a nicely formatted presentation of the sizes of these compressed files compared to the originals, and some evidence that your compression program reconstructs the original file from the compressed file.

We suggest, if possible, to create a website to present your results. The groups will also give a short (15 minute) presentation in class on monday the 24th discussing which compression algorithms they tried on the data sets and how these algorithms fared.

A Word About Encoding

Most of the files below are given in two forms, one which is nice to read and one which is "packed". In the readable file, for example with the gene sequences, you will see on one line a sequence consisting of letters from the alphabet {a,g,t,c}, representing the actual sequence of nucleic acids in the gene. This file is very easy to compress by a factor of 4, because only 2 bits (4 possibilities) are used for each byte=8 bits. This compression, however, does not capture anything interesting about possible patterns present in the gene sequence, and might dominate the compression which could be observed from these more subtle relationships. Therefore we also make available a "packed" version of the sequences, using the full ascii code-- ie each possible block of 4 letters on the alphabet {a,g,t,c} is mapped to an ascii character. Experiment with compressing both files, but trying to compress the packed file should be more interesting and telling about the nature of the file.

Gene Sequences

Human

readable file
- file size 16569 bytes.
packed file
- file size 4143 bytes.

Mouse

readable file
- file size 16295 bytes.
packed file
- file size 4074 bytes.

Bonus: Do you think the animal belonging to this unknown sequence is more related to a mouse or a man?

readable file
- file size 16364 bytes.
packed file
- file size 4091 bytes.

Shakespeare

First three scenes of Romeo and Juliet.

Formatted Version If you want some nice reading to decompress before compression.
Unformatted Version Here the first three scenes of Romeo and Juliet are given on one line. End of line markers, stage directions, and character labels have been removed. The size of this file is 18352 bytes.

Weather Data

Almost 20 years of mean daily temperatures in Christmas Valley, Oregon, US, from April 25, 1985 to January 30, 2003.

Readable File
- file size 19319 bytes
Packed File
- file size 6442 bytes

Bonus Question: What weather should the residents of Christmas Valley expect in the month of February 2003? Choose the February forecast you think is most likely.

Forecast 1

Readable File
- file size 90 bytes
Packed File
- file size 30 bytes

Forecast 2

Readable File
- file size 90 bytes
Packed File
- file size 30 bytes

Forecast 3

Readable File
- file size 93 bytes
Packed File
- file size 31 bytes

Forecast 4

Readable File
- file size 90 bytes
Packed File
- file size 30 bytes

An Executable

We know that the shortest program x* to compute x is itself not compressible by more than a constant. What about the executables of programs in every day use? The following is found in almost everyone's /bin

executable
- file size 63488 bytes

A Random Sequence

A file of 17000 bytes created from linux's /dev/random.