KAREN SPARCK JONES Opening talk, WORKSHOP on INTELLIGENT SCALABLE TEXT SUMMARISATION 11 July 1997, Madrid in conjunction with ACL/EACL 1997 Abstract Summarising: Where are we now? Where should we go? ------------------------------------------------- Karen Sparck Jones Computer Laboratory, University of Cambridge Summarising covers a range from text extraction to content condensation. Its essential features are picking important concepts from, and reducing, source text or information, to deliver summary information or text. General strategies for doing this are clearly preferable to application-specific ones. So far, we have found that statistically-based sentence extraction and concatenation does not produce effective summaries. But we have not yet found general methods of content analysis and condensation. We can only identify key source content and present it in summary with heavy domain and goal guidance. The most pressing need is to develop `sufficient to the day' techniques that do more than surface sentence extraction without depending, MUC-like, on prior specifications for sought content. These needed intermediate techniques include passage extraction and linking; deep phrase selection and ordering; entity identification and relating. Such strategies benefit from, or require, shallow text analysis and do or can exploit statistical data. They may be enhanced by modern display resources. They are applicable to individual source texts or to data sets as wholes. Most importantly, we can tackle this level of summarising because current robust parsing technology may succeed, given source redundancy, in getting enough of value from sources to help users, and because current text production methods can deliver usable summary texts. We should push this line hard, seeking to minimise application-specific domain knowledge, to take advantage of discourse structure, and to address summary function for the user. Presentation slides ------------------------------------------- Karen Sparck Jones SUMMARISING : WHERE ARE WE NOW ? WHERE SHOULD WE GO ? 1 what is and where are 2 where go : methodology focus - context factors 3 where go : strategy focus - shallow processing ------------------------------------------- SUMMARISING : a reductive transformation of source text to summary text through content condensation by selection / generalisation on what's important basic process model 3-stage I - source text interpretation to source representation T - source rep transformation to summary representation G - summary text generation from summary representation ------------------------------------------- WORK SO FAR : major camps TE - text extraction (open) `what you see is what you get' FE - fact extraction (closed) `what you know is what you get' ( DT data to text ) ------------------------------------------- FEATURES TE : I + T via statistics/ location/ cues G smoothing `through a glass darkly' FE : I + T via frames G synthesis `one perspective only` TE generality but low quality FE better quality but application specific ------------------------------------------- how get more power than TE ? more flexibility than FE ? role of large scale discourse structure (source) critical, but long-term topic what for nearer-term progress ? ------------------------------------------- * METHODOLOGY * examine CONTEXT FACTORS idea of general-purpose summary - ignis fatuuus idea of basic summary - hidden factor assumptions ------------------------------------------- INPUT FACTORS - source text form - structure, scale, medium, genre eg progress report subject type - ordinary/ specialised/ restricted eg chemical analysis unit - single/ multiple eg publication set PURPOSE FACTORS - summary situation - tied/ floating audience - untargetted/ targetted eg womens magazine readers use - retrieve/ preview/ substitute/ refresh/ alert/ ... eg course synopsis ------------------------------------------- OUTPUT FACTORS - summary text material - covering/ partial eg novel plot format - running/ headed eg test results style - informative/ indicative/ critical/ aggregative/ ... eg judicial summing up summary FUNCTION : given IF data, satisfy PF requirements via OF properties ------------------------------------------- EG book review summaries for librarian purchaser IF : form structure - simple running scale - variable medium - literary prose genre - critique subject type - ordinary unit - single PF : situation floating - service list audience untargetted - educated pro use substitute - treat as review OF : material covering - whole review format headed - book, comments style indicative - area, view ------------------------------------------- context factor analysis for informed choice rational design and EVALUATION conventional evaluation : against source : ? concepts caught how tell ? againt humans : ? same concepts why need ? evaluate against constraints : ESPECIALLY use, audience ------------------------------------------- EG library purchase case summary includes review basics ? - apply checklist summary allows decision ? - study librarian users allowed decision `correct' ? - compare summary/review decisions ------------------------------------------- IMPLICATIONS : where next ? factor analysis lever for methods BUT discourse interpretation tough FOCUS : situations for * indicative, skeletal summaries * ie where user has loose task other aids is knowledgeable interactive so want summaries `sufficient for the day' eg for browsing, notification ------------------------------------------- unadventurous ?? NO : far from trivial to do good grounding for better * USEFUL * (right now) ------------------------------------------- * STRATEGY * (with current NLP) not surface text hacking deep fact checking BUT intermediate source processing eg parse to logical forms local anaphor resolution ie get as far as can linguistically without domain model ------------------------------------------- can expect : more visible entities/relations than given text more sympathetic to original than prescribed facts build shallow source representation derive summary representation via statistical data - frequency etc markedness data - location, cues etc (representation allows text ties) ------------------------------------------- CORE ARGUMENT : full analysis impossible but robust parsing feasible will get enough predications for summarising - because text is REDUNDANT especially for key information ------------------------------------------- CHALLENGE : exploring the detail - building the source predication net deriving the summary representation * evaluating the summary texts * ( how for loose tasks tolerant users ? ) ------------------------------------------- EG shallow approach - Richard Tucker ( in progress, test rig ) I : parse to logical form decompose to simple predications derive predication cohesion graph using common predicate common argument (within sentence) similar arguments (across sentences) (if same semantic head) *weak* structure : refers to entities entities underdetermined ------------------------------------------- T : node set selection via weights for edge type scoring function seeking centrality representativeness coherence greedy algorithm G : synthesise text from selected predications *limited* data ==> `semi-text' indicative summary noting main topics ------------------------------------------- shallow strategy ADVANTAGES : we have the NLP technology to start we should get something practically useful we can learn for the tougher cases >>> natural attack on LONG TEXTS so, GO FOR IT ------------------------------------------- ------------------------------------------- Karen Sparck Jones : publications on summarising Computer Laboratory, University of Cambridge {\em Discourse modelling for automatic summarising}, Technical Report 290, Computer Laboratory, University of Cambridge, 1993; and in {\em Travaux du Cercle Linguistique de Prague} (Prague Linguistic Circle Papers), New Series, Volume 1, 1995, Amsterdam: John Benjamins, 201-227. `How do I centre large-scale text structure?' {\em Workshop on Centering in Naturally-Occurring Discourse}, IRCS, University of Pennsylvania, 1993. `Summarising as a lever for studying large-scale discourse structure', {\em ACL Workshop on Intentionality and Structure in Discourse Relations}, Columbus, OH, 1993, 125-127. `What might be in a summary?', {\em Information Retrieval 93: Von der Modellierung zur Anwendung} (Ed. Knorz, Krause and Womser-Hacker), Konstanz: Universitatsverlag Konstanz), 1993, 9-26.\\ (via FTP use ftp.cl.cam.ac.uk/public/papers/ksj/ ksj-whats-in-a-summary.ps.gz - transfer in binary and then gunzip) (Editor, with B. Endres-Niggemeyer and J. Hobbs):\\ {\em Summarising text for intelligent communication}, Dagstuhl Seminar Report 79, 13.12-17.12.93 (9350), IBFI, Schloss Dagstuhl, Wadern, Germany, 1995.\\ (Full version: via WWW http://www.bid.fh-hannover.de/SimSum/Abstract/, 1995) `Summarising: analytic framework, key component, experimental method', in {\em Summarising text for intelligent communication} (Ed. Endres-Niggemeyer, Hobbs and Sparck Jones), Dagstuhl Seminar Report 79, 13.12-17.12.93 (9350), 1995.\\ (Report Full version: via WWW http : //www.bid.fh-hannover.de/SimSum/Abstract/, 1995) `Summarisation', in {\em Survey of the State of the Art in Human Language Technology} (Ed. Cole et al), Cambridge: Cambridge University Press, 1997;\\ (book text published electronically via http://www.cse.ogi.edu/CSLU/HLTsurvey) with B. Endres-Niggemeyer:\\ `Introduction: Automatic summarising', Editors' introduction to the special issue on summarising text, {\em Information Processing and Management}, 31, 1995, 625-630. `Natural language processing: she needs something old and something new (maybe something borrowed and something blue, too)', Presidential Address, Association for Computational Linguistics, June 1994;\\ published electronically (1995) by the ACL via http://www.cs.columbia.edu/acl, and by the Computatation and Language E-Print Archive, http://xxx.lanl.gov/cmp-lg/9512004).