Kolmogorov Complexity Midterm Project
Updates
- February 6
- Modified the readable weather data (removed some empty lines). A
new packed file was created from this modified readable file.
- Corrected file sizes of future weather forecasts. Added a fourth
(very regular) weather prediction for february.
- February 5
- Fixed broken link for unkown gene file
- Made some small changes to future weather forecasts
- Changed due date: February 24th
Please let me know about any other broken links or file discrepancies.
The Rules
You will find below various files containing different types of data.
This project takes the form of a contest: working in groups you will try
to maximally compress each data file. The group which compresses
the most wins, and receives the highest grade. All members of a group
receive the same grade.
You are allowed to use any available compression algorithm (from the
internet or other sources), or you can write your own. Even if you do not
win the contest you can still receive a high grade through a nice presentation
and defense of your results. The contest begins now, and you will
present your results in class on February 24th. No late projects will be
accepted!
The Presentation
Presentation of your results should include the following:
- Initial guesses of the expected size of the compressed files,
and brief defenses (theoretical or intuitive) of these guesses.
- A short description of the algorithm behind the compression program
which you found to give the best results for each type of file. If you
use the algorithm which writes Romeo and Juliet on empty input then we will
count the size of the algorithm plus program.
- Finally, the actual compressed files, a nicely formatted presentation
of the sizes of these compressed files compared to the originals, and
some evidence that your compression program reconstructs the original file
from the compressed file.
We suggest, if possible, to create a website to present your results. The
groups will also give a short (15 minute) presentation in class on monday
the 24th discussing which compression algorithms they tried on the data sets
and how these algorithms fared.
A Word About Encoding
Most of the files below are given in two forms, one which is nice to read and one
which is "packed". In the readable file, for example with the gene sequences, you will
see on one line a sequence consisting of letters from the alphabet {a,g,t,c},
representing the actual sequence of nucleic acids in the gene. This file is very
easy to compress by a factor of 4, because only 2 bits (4 possibilities) are used
for each byte=8 bits. This compression, however, does not capture anything interesting
about possible patterns present in the gene sequence, and might dominate the
compression which could be observed from these more subtle relationships. Therefore
we also make available a "packed" version of the sequences, using the full ascii code--
ie each possible block of 4 letters on the alphabet {a,g,t,c} is mapped to an ascii
character.
Experiment with compressing both files, but trying to compress the packed file should
be more interesting and telling about the nature of the file.
Gene Sequences
Human
Mouse
Bonus: Do you think the animal belonging to this unknown sequence is more
related to a mouse or a man?
Shakespeare
First three scenes of Romeo and Juliet.
- Formatted Version If you want some nice
reading to decompress before compression.
- Unformatted Version Here the first three
scenes of Romeo and Juliet are given on one line. End of line markers, stage
directions, and character labels have been removed. The size of this file is
18352 bytes.
Weather Data
Almost 20 years of mean daily temperatures in Christmas Valley, Oregon, US,
from April 25, 1985 to January 30, 2003.
Bonus Question: What weather should the residents of Christmas Valley expect
in the month of February 2003? Choose the February forecast you think is
most likely.
Forecast 1
Forecast 2
Forecast 3
Forecast 4
An Executable
We know that the shortest program x* to compute x is itself not compressible
by more than a constant. What about the executables of programs in every
day use? The following is found in almost everyone's /bin
A Random Sequence
A file of 17000 bytes created from linux's
/dev/random.