Kolmogorov Complexity Midterm Project


Updates

Please let me know about any other broken links or file discrepancies.

The Rules

You will find below various files containing different types of data. This project takes the form of a contest: working in groups you will try to maximally compress each data file. The group which compresses the most wins, and receives the highest grade. All members of a group receive the same grade.

You are allowed to use any available compression algorithm (from the internet or other sources), or you can write your own. Even if you do not win the contest you can still receive a high grade through a nice presentation and defense of your results. The contest begins now, and you will present your results in class on February 24th. No late projects will be accepted!

The Presentation

Presentation of your results should include the following: We suggest, if possible, to create a website to present your results. The groups will also give a short (15 minute) presentation in class on monday the 24th discussing which compression algorithms they tried on the data sets and how these algorithms fared.

A Word About Encoding

Most of the files below are given in two forms, one which is nice to read and one which is "packed". In the readable file, for example with the gene sequences, you will see on one line a sequence consisting of letters from the alphabet {a,g,t,c}, representing the actual sequence of nucleic acids in the gene. This file is very easy to compress by a factor of 4, because only 2 bits (4 possibilities) are used for each byte=8 bits. This compression, however, does not capture anything interesting about possible patterns present in the gene sequence, and might dominate the compression which could be observed from these more subtle relationships. Therefore we also make available a "packed" version of the sequences, using the full ascii code-- ie each possible block of 4 letters on the alphabet {a,g,t,c} is mapped to an ascii character. Experiment with compressing both files, but trying to compress the packed file should be more interesting and telling about the nature of the file.

Gene Sequences

Human

Mouse

Bonus: Do you think the animal belonging to this unknown sequence is more related to a mouse or a man?


Shakespeare

First three scenes of Romeo and Juliet.

Weather Data

Almost 20 years of mean daily temperatures in Christmas Valley, Oregon, US, from April 25, 1985 to January 30, 2003. Bonus Question: What weather should the residents of Christmas Valley expect in the month of February 2003? Choose the February forecast you think is most likely.

Forecast 1

Forecast 2

Forecast 3

Forecast 4


An Executable

We know that the shortest program x* to compute x is itself not compressible by more than a constant. What about the executables of programs in every day use? The following is found in almost everyone's /bin

A Random Sequence

A file of 17000 bytes created from linux's /dev/random.