The goal of SCU annotation is to identify sub-sentential content units that can allow for comparison of the information in several summaries. It is well-known that when summarizing people make different choices about what information to include in their summary. The SCU annotation aims at highlighting what people agreed on. After the annotation is completed, some SCUs might appear in only one summary, but its annotation will allow a person to read a brand new summary and look for that SCU in this new summary.
An SCU consist of a label and contributors. The label is a concise English sentence that states the semantic meaning of the content unit. The contributors are snippet(s) of text coming from the summaries that show the wording used in a specific summary to express the label. It is possible for an SCU to have a single contributor, in the case when only one of the analyzed summaries expresses the label of the SCU.
The definition of content unit is somewhat fluid -- it can sometimes be a single word but it is usually bigger than a clause. Any event realized by a verb or a nominalized verb (e.g, "blow up" and "bombing" in the examples below) is a candidate SCU.
The three questions that will help you identify an SCU contributor are
Overall, the annotation involves semantic judgements and it is thus difficult to list all possible syntactic constructions that can give rise to a content unit. The goal is to split the text in small semantic units that the original summary writers have put together in several sentences to form their summary. During the annotation, the context of the sentence and the entire summary can be used to interpret a specific text segment.
Example 1: The three sentences below come from four different summaries A, B, C and D.A: In 1992 the U. N. voted sanctions against Libya for its refusal to turn over the suspects. B: The United Nations imposed sanctions on Libya in 1992 because of their refusal to surrender the suspects. C: The U.N. imposed international air travel sanctions on Libya to force their extradition. D: Since 1992 Libya has been under U.N. sanctions in effect until the suspects are turned over to United States or Britain.Among other information, all four sentences express the fact that "Libya was under U.N. sanctions" and this is the label for the SCU. The contributors are marked in brackets below (ignore SCU2 for now.)
A: In 1992 [the U. N. voted sanctions against Libya]1 [for its refusal to turn over the suspects.]2 B: [The United Nations imposed sanctions on Libya]1 in 1992 [because of their refusal to surrender the suspects.]2 C: [The U.N. imposed]1 international air travel sanctions on Libya [to force their extradition.]2 D: Since 1992 [Libya has been under U.N. sanctions]1 [in effect until the suspects are turned over]2 to United States or Britain.
Other information, such as when the sanctions where imposed, what specific sanctions were imposed, why they were imposed etc, will form their own SCUs. Identifying a main topic event in the summaries and asking yourself such questions as above about specifics will help you formulate labels and identify the SCU contributors. The contributors of an SCU need not share identical wording. For example in the sentences above, the SCU with label "The goal behind the sanctions is to make Libya surrender the suspects" is expressed by the text coindexed with "2". Sentence B differs in wording from the rest of the sentences, but the meaning is the same as that of the other contributors, expressing the fact that Libya does not want to surrender the suspects and the other nations involved want to force their extradition. (Note that this is an example of only two SCUs that will be derived from the sentences, the full analysis will lead to identifying more SCUs and will lead to complete bracketing of the sentences.)
Let's look at one more example of sentences from the different summaries that share some common information.A. In 1998 [two Libyans indicted]1 [in 1991]2 for the Lockerbie [bombing]3 were still in Libya. B. [Two Libyans were indicted]1 [in 1991]2 [for blowing up]3 [a Pan Am]5 [jumbo jet]4 over Lockerbie, Scotland in 1988. C. [Two Libyans, accused]1 by the United States and Britain [of bombing]3 [a New York bound]6 [Pan Am]5 [jet]4 over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. D. [Two Libyan suspects were indicted]1 [in 1991]2.
All share the information that (1) "Two Libyans are held responsible for a crime". The contributors are surrounded by brackets and coindexed by 1. Note that C differs in its wording from the other sentences--accused is not the same as indicted. But because the goal of the annotation is to find as much shared information as possible, and the sense of "accused" is so close to that of "indicted", the contributors will be grouped together, and the label expresses the general meaning of both accused and indicted.
The time expression prepositional phrase "in 1991" forms a separate SCU because the phrase "in 1991" can be omitted for example from sentence D without making the sentence ungrammatical or incomprehensible. There will be loss of information, and this is why the phrase can indicate a new *content* unit! The contributors of the SCU with label "The libyans were accused in 1991" are coindexed with "2".
Now we have to proceed and find what other information is repeated. For example, what was the crime committed? The different sentences give different amount of detail. When deciding where to start from--remember that the main goal is identifying the same information! All sentences agree on the fact that "the crime in question is a bombing" -- the contributors are coindexed with 3.
What was bombed? "An airplane was bombed" is another SCU with index 4. This information is expressed in two bigger noun phrases " Pan Am jumbo jet" and "a New York bound Pan Am jet" but "New York bound" and "Pan Am" can be omitted and the sentences will still be acceptable, so this information will be marked in a separate content unit.
The contributors are simply a part of the sentence--not all grammatical arguments necessary to reconstruct the label will be included in the contributor. This is ok, because the label will "bring in" any argument needed.
It is best if the SCU contributor can be a complete grammatical phrase. But this is sometimes not possible, so use your best judgment in assigning the specific token boundaries of the contributor.
Peer annotation refers to the annotation of a new summary against an existing pyramid in order to evaluate how well content in the peer summary was chosen. The goal is to identify content units that are already expressed in the pyramid, as well as the new content units in the summary that do not appear in the pyramid at all. Content that corresponds to highly-weighted SCUs in the model pyramid are better, since they express information that many human summarizers have agreed on including.
The annotation is similar to the general SCU annotation---parts of the peer summary should be mapped to a corresponding SCU in the pyramid. The purpose is to identify in the peer summary near-paraphrases of SCUs in the model pyramid. The expressed meaning does not need to completely match the label of the SCU. In addition, the context of the entire peer summary as well as general common knowledge that an educated American might have can be used in the matching process. At the same time, avoid using information/context that you have gained by reading the pyramid and the human summaries that contributed to it. This is not really context or general common knowledge, and it will be unlikely that a new reader of the summary will have this knowledge. For example, if the summary contains anaphoric expressions that cannot be resolved within the peer summary itself, then the sentences should be annotated accordingly, without use of the possible anaphore referrent, even if this referen could be guessed if one reads the summaries that make up the pyramid. The summary is meant to be read without knowledge of the input documents/other summaries, so if the references are bad and the summary is unclear, then the annotation should reflect this and such vague snippet of text should not be matched to an SCU. In summary---resolving anaphora within the peer summary is ok, and it should be done during annotation (since a reader of the summary will have it available), but knowldge coming from the pyramid summaries should not be used. One can use general knowledge such as "France is in Europe", which we can assume is known to any reader, and does not require reading other text on the summary topic.
For example, if a summary contains the sentence "Dogs are used to control soccer fans in Europe" and the model pyramid contains two SCUs "Dogs are used to control soccer fans" and "It is in Italy that dogs are used to control soccer fans", then the first part of the summary sentence will match the general fact SCU, while "in Europe" can be mapped to the "in Italy" SCU. These can be considered paraphrases in the context of the sentences and by using simple common knowledge.
It is always a good idea to check the contributors of an SCU to see the annount of variation between them, this will oftentimes give you an indication that less strict mapping are possible, as in the example of annotating the year an event happened ("in 1994", "in 1993", in the early 90s"). The different contributors of a content unit express the same, or nearly the same information, so some difference are possible, and still a match can occur. For example, a peer sentence "The population of the world is likely to double - to more than 10bn people" can be mapped to an SCU with label "The world population will reach over 10 billion in 2050 with current growth rates", even though the temporal information is missing in the peer sentence. The decision to make the match can be facilitated by looking at all the contributors in the pyramid SCU and confirming that the peer expresses information that is in all of them, for example in this case this shared information can be the fact that the earth's population will reach 10 billion.
When the summary conveys new or significantly different information than the one in the model pyramid, these should be split to the appropraite contributor size and mapped to the service SCU at the bottom of the annotation panel.
You will notice that automatic summaries often repeat the same information in different sentences. Do annotate both instances. adding the corresponding contributors to the appropriate SCU. Repetition is taken into account in the final summary score.
Occasionally, the labels of two SCUs sound similar and it is dificult to understand just by reading the labels what is the difference between the two content units. In such cases, click on an SCUs to see their contributors---the contributors will give you an idea of what is the emphasis in each content unit.
The same text selection from the peer cannot be matched to two different pyramid SCUs. If a clauses expresses more than one SCU, different text spans that best represent the meaning of an SCU need to be selected.
When matching a peer contributor, try to find the most highly weighted appropriate SCU in the pyramid.
Some of the low-weight SCUs in the pyramid carry more information than the high-weighted ones, and have longer contributors. They represent less important information and you can be more liberal when matching peer content to them---it is enough for the peer to convey part of the information. The pyramid analysis is very useful in that it allows the annotator to deicde which information is important and should be split into fine-grained content units, and which information is not so important.
Content units from information that does not match a pyramid SCU: very often the peer summary contains information that is not covered by any of the pyramid SCUs. Such inforamtion should be split into content units (without assigning a lable). It is important that the "leftovers" are carefully split into segments that conform to the definition of a content unit. Again, complex sentences in which the different cluases convey different details, will be split into two seperate SCUs, as for example in "The unprecedented cold wave, which took the lives of 30 people in Eastern Europe, is finally ending" will be split in two SCUs, one about the number of victims, and another about the end of the cold wave. Complex sentences that do not convey more than one new details about an event or entity will remain as a single SCU contributor, as for example "Many discovered upon their retirement that their pension money no longer existed." and "Bilking a large number of people out of millions of dollars can lead to sentences that vary from ten to twenty years."
When evaluating summaries that need to be of specific length, the final sentence is oftentimes truncated before the actual end of the sentence. If there is enough from the truncated sentence that one can get a match with a content unit, it should be annotated as a contributor. If it expresses no clear idea that can be mapped to a content unit, it should be put as a "Non-matching contributor". If some ideas are expressed in the truncated sentence, but they do not correspond to an exisiting, add the appropriate parts to the "Non-matching SCU".
When annotating multiple peers for the same pyramid, use the