Using Bins for Text Categorization

Under the supervision of Ken Church at AT&T Labs Research in Florham Park, I have implemented a text categorization system called BINS which uses bins to empirically estimate term weights. This approach groups words that share statistical features into a common bin, and a single term weight is computed for all words in a single bin. Using bins in this matter can be thought of as a smoothing technique added to Naive Bayes; the use of bins avoids inaccurate term weights for words with scarce evidence, and also allows the user to examine which features of words are most important for indicating categories.

BINS is a user-friendly system, and it will soon be publicly available for researchers. When that happens, I will post a link from here to a page with instructions on how to obtain the system. BINS also allows bin-based term weights to be combined with standard Naive Bayes term weights (by using weights for individual words when there is enough evidence, and falling back to bins otherwise). I have seen that this can significantly boost performance. BINS also provides an interesting method for encorporating unlabeled data, and although I have not yet seen this boost performance, there are reasons that I believe the approach is promising.

The latest version of BINS has not yet been described in published literature, although it is described in detail in Chapter 5 (and several appendices) of my thesis (one-sided version, two-sided version). Below are some links to other relevant information, including a published paper describing an earlier version of the system.

I will update this page periodically, and will soon add links to pages from which the BINS system, and also a new text categorization corpus that I have created, can be downloaded.


Click here to learn more about my research.

Send questions or comments to sable@cs.columbia.edu.