COMS 6998 Search Engine Technology ===================================== HW 0 Introduction to UNIX, Perl, and Clairlib Instructor: Dragomir Radev TAs: Archiman Dutta (ad2839@columbia.edu) & Abhishek Srivastava (aas2234@columbia.edu) Contact: TBD Due: January 26th, 2011 The purpose of this assignment is to get you acquainted with some concepts that are important for the course. UNIX Files and Directories =========================== All data in UNIX are stored in files. Files in UNIX are organized into directories. A "directory" is a location where files are kept in a list. When you login to a Unix system, you are in your home directory. Your home directory is the directory assigned to you to store all your files. UNIX has commands that can be used to handle files and directories. Some of them are listed below: - The "ls" command The Unix command ls lists the contents of the current working directory. - The "mkdir" command The Unix command mkdir creates new directories. - The "cd" command The Unix command cd moves around from one directory to another. - The "pwd" command The Unix command pwd displays the current working directory. - Wildcards You can use the wildcard character "*" to list all of the files matching a certain filename pattern. For example "ls *.html" lists all files ending with ".html". A list of UNIX tutorials: http://www2.ocean.washington.edu/unix.tutorial.html http://www.doc.ic.ac.uk/~wjk/UnixIntro/ http://www.fsid.cvut.cz/cz/U201/linux.html http://hven.swarthmore.edu/~burns/unix.html Perl ==== Perl is a high-level, general-purpose, interpreted programming language. Perl is best known for its text processing capabilities like dealing with files, strings, and regular expressions. A list of Perl tutorials: http://cslibrary.stanford.edu/108/EssentialPerl.html http://www.perl.com/pub/a/2000/10/begperl1.html http://www.comp.leeds.ac.uk/Perl/start.html http://www.cbkihong.com/download/perltut.pdf http://www.tizag.com/perlT/ Assignment ========== In this assignment, you are required to write a few Perl scripts that compute the term frequency and document frequency of each word in a set of documents. The sample files are in /home/cs6998/tfnew (NOTE: you will need a CRF account to access these files. Kindly logon to your CRF account either remotely: clic.cs.columbia.edu or by accessing the machines in the CLIC lab and use the above path to get the files More information at http://www.cs.columbia.edu/crf/) 1- Term Frequency The term frequency of a given word in a given document is the number of times the term (word) occurs in the document. Write a Perl script (tf) that calculates the term frequencies of all terms in a given document. Your script should contain the following steps: - Read the text file. - Split the file into words. - Change all words to lower case. - Count the number of times each word occurs. 2- Document Frequency The document frequency of a given term is the number of document that contain that term. Write a Perl script (build-df.pl) that calculates the term frequencies of all terms in a given set of document. Your script should contain the following steps: - Initialize the document frequency of each term to 0. - For each text file: - Read the text file. - Split the file into words. - Change all words to lower case. - Increment the document frequency of each word. - Writes all document frequencies to a dbm file 3- Inverse Document Frequency The inverse document frequency is a measure of the general importance of the term. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the log. Write a Perl script (idf.pl) that calculates the inverse document frequencies of all terms in a given set of documents. Your script should contain the following steps: - Read the dbm file containing the document frequencies - Compute and display the idf of each term You may find the following Perl functions and structures useful: - The function "split" Splits up a string and places it into an array. The function is used as follows: $str = "This is an example"; @words = split(/ /, $str); which has the same overall effect as @words = ("This", "is", "an", "example"); - The function "lc" Takes a string, convert it to lowercase, and then returns the new string. - Perl Hashes Hashes are an advanced form of arrays. Arrays have the limitation that they can only be accessed by integer indices. For example, imagine that you have a list of students and their grades. The hash can represent this data very neatly by allowing us to access that grades not by an index, but by a scalar key. For example, we may use the student name as the key to get their grades. Read this Howto to find information about creating, modifying, and accessing hashes: http://www.cs.mcgill.ca/~abatko/computers/programming/perl/howto/hash/ - Perl DBM Files DBM files provide an implementation for a light database, with an easy API, using simple key-value pairs to store and manipulate a relatively small number of records. To write/read data to/from a dbm file: - Bind a DBM file to a hash object using dbmopen(). - Wrie/Read data to/from the hash object. - close the dbm file using dbmclose(). For more information about perl and dbm files, read the following tutorial: http://www.herongyang.com/Perl/dbmopen.html Other Guidelines: - Use the 100 text files at /home/cs6998/tfnew. - You should not use any external modules (even clairlib) for this assignment. - It might be helpful to put the common functions (split words, convert to lowercase, etc.) in a single Perl module (Util.pm). - There are different variants of the formula to use for calculating IDF. - The specs are underdefined with respect to tokenization and a number of other issues. - Create a directory called "tfidf" under your home direcotry and put all your code and data under this direcotry. - You can also use simple UNIX commands to find the tf/df of any term. To find the tf of the word "year" in document 002.txt: > cat 002.txt | tr -cs '[A-Za-z]' '\n' | sort | uniq -c | grep -w "year" To find the df of the word "year": > grep -iwo -m1 "year" *.txt | wc -l - Your output should match the following sample output: % perl tf.pl 001.txt department 2 stemming 1 repeating 1 date 1 which 1 come 1 para 1 judgment 1 two 4 if 2 ... % perl build-idf.pl [no output but it should write all frequencies to a dbm file] % perl idf.pl a 96 0.04 aa 3 3.51 ababa 1 4.61 abbas 1 4.61 abbott 2 3.91 abbott~s 2 3.91 abbreviated 3 3.51 abc 2 3.91 abdel 1 4.61 abdelaziz 1 4.61 ... Clairlib ======== Clairlib is a Perl library developed by the clair group at the University of Michigan that handles a number of NLP and IR tasks. It will significantly simplify your future projects if you call the functions in clairlib instead of implementing everything on your own from scratch! (For this pre-assignment, don't use clairlib). To use clairlib on your CS account: - Set the following environment variables. (You can add these lines to your ~/.profile file so you don't have to run them every time.) export PERL5LIB=/home/cs6998/clairlib/clairlib-fall07/clairlib-core/lib:/home/cs6998/clairlib/perl-fall07:/home/cs6998/clairlib/perl-fall07/lib/perl5/site_perl/5.8.5 export PATH=$PATH:~cs6998/clairlib/clairlib-fall07/clairlib-core/util - Run the scripts from the tutorial (at http://www.clairlib.org/index.php/Utilities_Tutorial). Other useful Clairlib links: http://www.clairlib.org/index.php/Documentation http://www.clairlib.org/clair/clairlib/clTut.pdf http://www.clairlib.org/clair/clairlib/pdoc Assignment Solution ========= # Util.pm package Util; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(split_words uniq count_hash lc_words); sub lc_words { my @res = (); foreach my $w (@_) { push @res, lc($w); } return @res; } sub split_words { my $text = shift; $text =~ s/[^A-Za-z\']/ /g; $text =~ s/(\w)\'(\w)/$1~$2/g; $text =~ s/\'/ /g; my @words = split /\s+/, $text; return @words; } sub count_hash { my @words = @_; my %hash = (); foreach my $w (@words) { if (exists $hash{$w}) { $hash{$w}++; } else { $hash{$w} = 1; } } return %hash; } sub uniq { my @in = sort @_; my @out = (); my $last = ""; foreach my $w (@in) { next if $w eq $last; push @out, $w; $last = $w; } return @out; } ---------------------------------------- #tf.pl #!/usr/local/bin/perl use Util; $file = shift; $text = `cat $file`; my @words = split_words ($text); my @words = lc_words (@words); my %count = count_hash(@words); while (my ($w, $c) = each %count) { print "$w\t$c\n"; $df{$w} = 1; } ---------------------------------------- #build-df.pl #!/usr/local/bin/perl use Util; my %df; dbmopen %df, "~/df", 0666; %df = (); @tfiles = `ls *.txt`; $num_docs = $#tfiles + 1; print "ND = $num_docs\n"; $df{NB_DOCS} = $num_docs; foreach $file (@tfiles) { print "FILE = $file"; $text = `cat $file`; my @words = split_words ($text); my @words = lc_words (@words); my %count = count_hash(@words); while (my ($w, $c) = each %count) { $df{$w}++; } } ---------------------------------------- #idf.pl #!/usr/local/bin/perl use Util; my %df; dbmopen %df, "df", 0666; $num_docs = $df{NB_DOCS}; my @words = sort keys %df; foreach $w (@words) { unless ($w eq "NB_DOCS") { $idf = -log($df{$w}/$num_docs); printf "$w\t$df{$w}\t%6.2f\n",$idf; } }