Instructions for Threader- Email Thread Summarizer
Contents:
Procedures:
- Notes: This documentation assumes that
you have copied ~swan1/projects to your own directory. In this
document, we will refer to the top level directory of the Threader code
as ~user/projects/Threader, where ~user
is your user home directory. See Documentation-A Description of the
Threader Makefile Environment for more details. A link to
a tarball of Threader will be added shortly.
- Initiating 3rd party servers: KDD and MySQL
- Assuming you have previously installed the KDD Server and
MySQL, you must have your MySQL
server running in the background. For more documentation on MySQL
please visit the CRF website. For documentation on the use
of KDD and
MySQL in Threader, please see Dependencies below.
- Running Threader from the prompt
(ADD THIS TO
JAVADOC)
- Summarizer.java: the old test executable, I used this for testing
various issue detection strategies.
- TestSummarizer.java: The main executable to run from the prompt -
iterates through test corpus
- SummarizerObject.java: This gets called from the web servlet to
construct a summary
- ThreadSummaryObject.java: called by TestSummarizer and SummarizerObject
to do the work of constructing a nice looking summary. Note that to
run tests, you may not want to use this class as it contains an "oracle"
issue detector that applies various issue detection strategies, and picks
the "best" one.
- Utilities:
- PrintSentences.java: prints sentences from corpus
- CollectCorpusStats.java: collects term frequency descriptive stats
- Test.java: tests that the connection to the MySQL database housing
the corpus works
- <ADD MORE TEXT>
- Starting the Threader webserver
- Assuming you have Apache Tomcat 5.x installed and you have set
up the web application (see below on Dependencies-Apache Tomcat),
simply execute the script startup.sh in
the bin directory of your Tomcat installation.
- Note: You will need to have certain environmental variables set,
as described by the Tomcat documentation. This includes the JAVA_HOME
variable.
- Updating the Threader Javadoc
- To update the Threader Javadoc found at
http://www.cs.columbia.edu/~swan1/javadoc/api:
- in ~user/projects/Threader/src, type make javadoc.
- change directory to ~user/projects/Threader/documentation/
and recursively make all files in the javadoc directory readable and
executable (chmod -r 755 javadoc/). The webpage has a soft link
to this directory.
Dependencies
- KDD and MySQL
- Lokesh Shrestha maintains the KDD package which uses MySQL to
store the ACM Email Corpus. The package provides a useful java
interface to the MySQL database.
- You must first set up MySQL for your own account. See
the CRF webpage (http://www.cs.columbia.edu/crf/mysql) for details.
- See Lokesh for details of how to perform the following steps:
- Preparing the corpus for storage in the MySQL database
(requires splitting up threads into separate emails according to a
naming convention)
- Starting the KDD Server
- Testing the KDD Server
- Populating the MySQL database with the ACL corpus with a
request to the KDD Server
- From this point onward, you no longer require the KDD server
to be running.
- Headliner
- The Headliner package contains code to provide a single
sentence summary of a single document. Threader uses some of the
basic supporting classes within Headliner for performing tasks such as
maintaining a vocabulary and the creation of tf.idf vectors.
Threader currently requires the Headliner JAR file to be kept at
~user/projects/Threader/lib. A link to the official page of
Headliner and its documentation will soon be added. In the
meantime you can get the jar file at
~swan1/projects/Threader/lib/headliner.jar.
- Jama Linear Algebra Package
- Threader uses the JAMA package for performing operations on
Matrices such as Singular Value Decomposition. The JAMA website
can be found at: math.nist.gov/javanumerics/jama. This
website also contains a useful Javadoc API at
http://math.nist.gov/javanumerics/jama/doc. Simply download the
JAR file from http://math.nist.gov/javanumerics/jama/Jama-1.0.1.jar and
place the file in the lib directory: ~user/projects/Threader/lib.
- Apache Tomcat (http://jakarta.apache.org/tomcat)
- The web application for Threader is implemented using the
Apache Tomcat web server which provides an implementation of Java
Servelets. The version of Tomcat used is 5.x. Installation
is straightforward, however, ensure that you are using gnu tar to untar
the installation file.
- To set up the Threader web application
- Get the application code:
- This is stored in
~user/projects/Threader/doc/webapp/summarizer
- Copy this to the webapp directory of your Apache Tomcat
installation
- Run Apache Tomcat (type start
server in the bin directory of your Tomcat installation)
Documentation
- The Threader Makefile Environment
- The Threader Makefile environment is used for code compilation
and executing Threader at the prompt. As all the code
is in Java,
Threader should work on any platform, provided MySQL is
installed. At last check, the KDD package should also be platform
independent. The Makefile environment is provided for Linux or
Unix machines.
- Customising Makefiles
- ~user/projects/Threader/src contains one file which you need
to customise in order for gnu make to
work on Threader. Makefile.local
contains two variables $??? and
$??? which specify,
respectively, the
full path of the Threader code, and the full path of the directory
which will house the compiled object files. The first variable
should
be set to ~user/projects/Threader. The second is up to you to
specify,
but I usually set that to ~user/projects/classes.
- ~user/projects/Threader/Makefile also contains a variable
$CLASSPATH which specifies a classpath parameter to give to the java
compiler at the prompt
- The Threader Config File
- The config file in ~user/projects/Threader/data/config.xml
contains information that Threader requires depending on what you are
asking Threader to do (see executing Threader from the prompt).
You may have to modify this file to point to resources in your own
workspace if you decide to make personal copies any of the resources.
- Further Documentation
- The Javadoc for Threader can be viewed at
http://www.cs.columbia.edu/~swan1/javadoc/api.
- The main project webpage for this can be found at
http://www.cs.columbia.edu/~swan1, which contains further documents
relating to Threader.
If you would like to recommend changes to this documentation, or have
comments about the documentation, send mail to swan@cs.columbia.edu