Domain-independent Single Document Summarization through Focus Analysis

The domain independent focus based summarization system (FociSum), is designed in a modular fashion so as to decompose the pipeline processing into distinct steps. The core of the summarization system is shown in the diagram below, highlighted in the pink portion.

The process starts with a single document, that proceeds through a layout recognition procedure to find and remove sections of unwanted texts (such as tables and lists). Next a length filter removes short articles from undergoing the laborious task of the focus based summarization. The shorter articles seem more amenable to a simpler, lead-based summarization strategy.

All remaining articles and texts are put through the focus based summarization system. The first step is to identify foci, using an information extraction system. We use the named entity extraction system, Talent, created by IBM's research group on advanced text analysis. We use the output of Talent's processing on the text to help determine the foci of a target article in the first module, the Foci Finder.

Now that the foci of the article has been determined, the Questioner module suggests types of relationships between them as well as questions about the foci themselves, that may be answered in the text.

Once the possible relationships that the article might explain are enumerated, the answers must be found in the text. The Answerer takes the original document, passes it through IBM's English Slot Grammar Parser, creating a parse tree. Answerer then searches this tree for parse tree patterns involving the foci or their variant forms that mark an answer to the questions that the previous module had concluded.

Content Orderer. The final module takes the snippets of sentences and clauses from the original text, and reorganizes it into a summary form, targetting the production of a more coherent and informative summary than is normally possible using traditional approaches.