Summarization content units (SCUs)

The goal of SCU annotation is to identify sub-sentential content units that can allow for comparison of the information in several summaries. It is well-known that when summarizing people make different choices about what information to include in their summary. The SCU annotation aims at highlighting what people agreed on. After the annotation is completed, some SCUs might appear in only one summary, but its annotation will allow a person to read a brand new summary and look for that SCU in this new summary.

An SCU consist of a label and contributors. The label is a concise English sentence that states the semantic meaning of the content unit. The contributors are snippet(s) of text coming from the summaries that show the wording used in a specific summary to express the label. It is possible for an SCU to have a single contributor, in the case when only one of the analyzed summaries expresses the label of the SCU.

The definition of content unit is somewhat fluid -- it can sometimes be a single word but it is usually bigger than a clause. Any event realized by a verb or a nominalized verb (e.g, "blow up" and "bombing" in the examples below) is a candidate SCU.

The three questions that will help you identify an SCU contributor are

Is the information expressed by it repeated in some other summary? Note, the wording need not be the same for the expressed meaning to be the same; we are looking for the same meaning. When an information unit is expressed in two or more summaries, the amount of information overlap will serve as a main indication of which parts of the corresponding sentences will become contributors.
Spans of words that indicate location or time, or otherwise provide more specific information about another SCU are also SCUs. Usually these are expressed in adjuncts such as prepositional phrases and are not an obligatory argument to any verb. Noun phrases containing premodification can also be split into more than one SCU when the premodifiers include additional information. For example, if the summaries under annotation convey that there was a bombing and the location of the bombing, then the annotator would identify two SCUs, one with the main event, and one with the additional detail information.
Is the difference important for the story? Occasionally there will be minor differences in wording that if put under scrutiny could be construed to have different nuances. We are not interested in the finest grained distinctions---these will be too many to describe in a reasonable way.

Overall, the annotation involves semantic judgements and it is thus difficult to list all possible syntactic constructions that can give rise to a content unit. The goal is to split the text in small semantic units that the original summary writers have put together in several sentences to form their summary. During the annotation, the context of the sentence and the entire summary can be used to interpret a specific text segment.Example 1: The three sentences below come from four different summaries A, B, C and D.

A: In 1992 the U. N. voted sanctions against Libya for its refusal to
turn over the suspects. 

B: The United Nations imposed sanctions on Libya in 1992 because of
their refusal to surrender the suspects. 

C: The U.N. imposed international air travel sanctions on Libya to
force their extradition. 

D: Since 1992 Libya has been under U.N. sanctions in effect until the suspects are turned over to United States or Britain.

Among other information, all four sentences express the fact that "Libya was under U.N. sanctions" and this is the label for the SCU. The contributors are marked in brackets below (ignore SCU2 for now.)

A: In 1992 [the U. N. voted sanctions against Libya]1 [for its refusal to
turn over the suspects.]2 

B: [The United Nations imposed sanctions on Libya]1 in 1992 [because of
their refusal to surrender the suspects.]2 

C: [The U.N. imposed]1 international air travel sanctions on Libya [to
force their extradition.]2 

D: Since 1992 [Libya has been under U.N. sanctions]1 [in effect until
the suspects are turned over]2 to United States or Britain.

Other information, such as when the sanctions where imposed, what specific sanctions were imposed, why they were imposed etc, will form their own SCUs. Identifying a main topic event in the summaries and asking yourself such questions as above about specifics will help you formulate labels and identify the SCU contributors. The contributors of an SCU need not share identical wording. For example in the sentences above, the SCU with label "The goal behind the sanctions is to make Libya surrender the suspects" is expressed by the text coindexed with "2". Sentence B differs in wording from the rest of the sentences, but the meaning is the same as that of the other contributors, expressing the fact that Libya does not want to surrender the suspects and the other nations involved want to force their extradition. (Note that this is an example of only two SCUs that will be derived from the sentences, the full analysis will lead to identifying more SCUs and will lead to complete bracketing of the sentences.)

Let's look at one more example of sentences from the different summaries that share some common information.

A. In 1998 [two Libyans indicted]1 [in 1991]2 for the Lockerbie [bombing]3 were
still in Libya. 

B. [Two Libyans were indicted]1 [in 1991]2 [for blowing up]3 [a Pan Am]5
[jumbo jet]4 over Lockerbie, Scotland in 1988. 

C. [Two Libyans, accused]1 by the United States and Britain [of bombing]3 [a
New York bound]6 [Pan Am]5 [jet]4 over Lockerbie, Scotland in 1988, killing
270 people, for 10 years were harbored by Libya who claimed the
suspects could not get a fair trail in America or Britain. 

D. [Two Libyan suspects were indicted]1 [in 1991]2.

All share the information that (1) "Two Libyans are held responsible for a crime". The contributors are surrounded by brackets and coindexed by 1. Note that C differs in its wording from the other sentences--accused is not the same as indicted. But because the goal of the annotation is to find as much shared information as possible, and the sense of "accused" is so close to that of "indicted", the contributors will be grouped together, and the label expresses the general meaning of both accused and indicted.

The time expression prepositional phrase "in 1991" forms a separate SCU because the phrase "in 1991" can be omitted for example from sentence D without making the sentence ungrammatical or incomprehensible. There will be loss of information, and this is why the phrase can indicate a new *content* unit! The contributors of the SCU with label "The libyans were accused in 1991" are coindexed with "2".

Now we have to proceed and find what other information is repeated. For example, what was the crime committed? The different sentences give different amount of detail. When deciding where to start from--remember that the main goal is identifying the same information! All sentences agree on the fact that "the crime in question is a bombing" -- the contributors are coindexed with 3.

What was bombed? "An airplane was bombed" is another SCU with index 4. This information is expressed in two bigger noun phrases " Pan Am jumbo jet" and "a New York bound Pan Am jet" but "New York bound" and "Pan Am" can be omitted and the sentences will still be acceptable, so this information will be marked in a separate content unit.

The contributors are simply a part of the sentence--not all grammatical arguments necessary to reconstruct the label will be included in the contributor. This is ok, because the label will "bring in" any argument needed.

It is best if the SCU contributor can be a complete grammatical phrase. But this is sometimes not possible, so use your best judgment in assigning the specific token boundaries of the contributor.

Some specific annotation rules

Length of contributors: contributors are usually not very long (average of 6 words), since the content units express small, almost atomic units of information. Thus, whenever the annotator can imagine a part of the sentence conveying additional information about an event could be removed from the sentence, this part become a candidate contributor for a new SCU. For example, a sentence like "Dr. Alan Cox was tried and convicted in Winchester, England for the attempted murder of a patient in 1992." will give rise to several content units---"Dr. Cox was convicted", "The trial was in Winchester, England", "He was tried for the attempted murder a patient" and "Dr. Cox's conviction was in 1992". The main event, location, time and additional specification each represent a content unit that could be expressed seperately in another summary.

General vs. specific information: often times one of the summarizer will convey more specific information than other. For example, two different summaries can contain the sentences "Dogs are used to control soccer fans" and "In Italy, dogs are used to control soccer fans" or "Dogs are used to control soccer fans in the UK". All three sentences would contribute to an SCU expressing that "Dogs are used to control soccer fans" (three contributors), while the two specific locations, Italy and the UK will be in seperate SCUs with one contributor each. Similarly, any significant modification of a statement is split a seperate SCU. For example, the in the sentences "Birth rates have decreased" and "Birth rates have decreased by 50%", there are two content units expressed, and the fact that the decrease was by half will be split in a seperate SCU because in contributes significant new information

Differences in meaning: In many of the examples above we saw that near paraphrases are grouped in the same content unit. A word of caution is needed here---for example two sentences such as "China has a forced sterilization policy" and "China has a forced contraception policy" are quite similar syntactically but are semantically different, one being rather more drastic than the other, and this each would form a seperate SCU. If the two concepts were linked in the same sentence by a summarizer, as in "China has forced contraception policy, including forced sterilzation", we would have a case of general vs. specific information and the previous rule will be applicable.

References to time: oftentimes different people refer to the time of an event using different wording, for example, in different summaries an event can be described as having happened "in 1993", "in 1994", or "in the early 90s". Such reference would usually be grouped together in one content unit, and a match in a new summary close in meaning will match the entire SCU.

Most often, the clauses of a complex sentence express express different details about an event or entity, and thus the goal is to limit each content unit to a clause. Occasionally, a compex sentence doesn't really contain two seperate pieces of information, as for example the sentence
"The sentence will send a message that white collar crime doesn't pay."
is syntactically complex, but there are no different details and the entire sentence can be a contributor to a single content unit.
Once the annotation is completed, spellcheck the SCU labels, since later during peer annotation one would want to search over the labels.

Peer annotation

Peer annotation refers to the annotation of a new summary against an existing pyramid in order to evaluate how well content in the peer summary was chosen. The goal is to identify content units that are already expressed in the pyramid, as well as the new content units in the summary that do not appear in the pyramid at all. Content that corresponds to highly-weighted SCUs in the model pyramid are better, since they express information that many human summarizers have agreed on including.

The annotation is similar to the general SCU annotation---parts of the peer summary should be mapped to a corresponding SCU in the pyramid. The purpose is to identify in the peer summary near-paraphrases of SCUs in the model pyramid. The expressed meaning does not need to completely match the label of the SCU. In addition, the context of the entire peer summary as well as general common knowledge that an educated American might have can be used in the matching process. At the same time, avoid using information/context that you have gained by reading the pyramid and the human summaries that contributed to it. This is not really context or general common knowledge, and it will be unlikely that a new reader of the summary will have this knowledge. For example, if the summary contains anaphoric expressions that cannot be resolved within the peer summary itself, then the sentences should be annotated accordingly, without use of the possible anaphore referrent, even if this referen could be guessed if one reads the summaries that make up the pyramid. The summary is meant to be read without knowledge of the input documents/other summaries, so if the references are bad and the summary is unclear, then the annotation should reflect this and such vague snippet of text should not be matched to an SCU. In summary---resolving anaphora within the peer summary is ok, and it should be done during annotation (since a reader of the summary will have it available), but knowldge coming from the pyramid summaries should not be used. One can use general knowledge such as "France is in Europe", which we can assume is known to any reader, and does not require reading other text on the summary topic.

For example, if a summary contains the sentence "Dogs are used to control soccer fans in Europe" and the model pyramid contains two SCUs "Dogs are used to control soccer fans" and "It is in Italy that dogs are used to control soccer fans", then the first part of the summary sentence will match the general fact SCU, while "in Europe" can be mapped to the "in Italy" SCU. These can be considered paraphrases in the context of the sentences and by using simple common knowledge.

It is always a good idea to check the contributors of an SCU to see the annount of variation between them, this will oftentimes give you an indication that less strict mapping are possible, as in the example of annotating the year an event happened ("in 1994", "in 1993", in the early 90s"). The different contributors of a content unit express the same, or nearly the same information, so some difference are possible, and still a match can occur. For example, a peer sentence "The population of the world is likely to double - to more than 10bn people" can be mapped to an SCU with label "The world population will reach over 10 billion in 2050 with current growth rates", even though the temporal information is missing in the peer sentence. The decision to make the match can be facilitated by looking at all the contributors in the pyramid SCU and confirming that the peer expresses information that is in all of them, for example in this case this shared information can be the fact that the earth's population will reach 10 billion.

When the summary conveys new or significantly different information than the one in the model pyramid, these should be split to the appropraite contributor size and mapped to the service SCU at the bottom of the annotation panel.

You will notice that automatic summaries often repeat the same information in different sentences. Do annotate both instances. adding the corresponding contributors to the appropriate SCU. Repetition is taken into account in the final summary score.

Occasionally, the labels of two SCUs sound similar and it is dificult to understand just by reading the labels what is the difference between the two content units. In such cases, click on an SCUs to see their contributors---the contributors will give you an idea of what is the emphasis in each content unit.

The same text selection from the peer cannot be matched to two different pyramid SCUs. If a clauses expresses more than one SCU, different text spans that best represent the meaning of an SCU need to be selected.

When matching a peer contributor, try to find the most highly weighted appropriate SCU in the pyramid.

Some of the low-weight SCUs in the pyramid carry more information than the high-weighted ones, and have longer contributors. They represent less important information and you can be more liberal when matching peer content to them---it is enough for the peer to convey part of the information. The pyramid analysis is very useful in that it allows the annotator to deicde which information is important and should be split into fine-grained content units, and which information is not so important.

Content units from information that does not match a pyramid SCU: very often the peer summary contains information that is not covered by any of the pyramid SCUs. Such inforamtion should be split into content units (without assigning a lable). It is important that the "leftovers" are carefully split into segments that conform to the definition of a content unit. Again, complex sentences in which the different cluases convey different details, will be split into two seperate SCUs, as for example in "The unprecedented cold wave, which took the lives of 30 people in Eastern Europe, is finally ending" will be split in two SCUs, one about the number of victims, and another about the end of the cold wave. Complex sentences that do not convey more than one new details about an event or entity will remain as a single SCU contributor, as for example "Many discovered upon their retirement that their pension money no longer existed." and "Bilking a large number of people out of millions of dollars can lead to sentences that vary from ten to twenty years."

When evaluating summaries that need to be of specific length, the final sentence is oftentimes truncated before the actual end of the sentence. If there is enough from the truncated sentence that one can get a match with a content unit, it should be annotated as a contributor. If it expresses no clear idea that can be mapped to a content unit, it should be put as a "Non-matching contributor". If some ideas are expressed in the truncated sentence, but they do not correspond to an exisiting, add the appropriate parts to the "Non-matching SCU".

Annotation tool

The new pyramid annotation tool, DUCView, (v. 1.2) is available now.
Download DUCView by clicking on this link. This is a single jar file. If your browser saves the file under a different name, just rename it DUCView.jar.

When annotating multiple peers for the same pyramid, use the