Abstract
In this thesis I have presented ARN – an Automatic Anaphora Resolution System for Norwegian.
Anaphora are words that specify a real-world entity by referring through another textual item, antecedent. In natural languages, anaphora are essential part of the cohesional forces that keep the discourse together. This makes anaphora resolution highly important for numerous natural language processing (NLP) applications, such as natural language interfaces, automatic text abstracting, information extraction and machine translation.
Consider the following example:
“But Mr Prodi will have to wait until after the election of a new Italian head of state in mid-May before he_1 can actually begin to appoint ministers from among his centre left coalition supporters and begin to run the country.
Despite this, Mr Berlusconi has remained defiant.
Speaking to supporters in the northern city of Trieste, he_2 said he_3 had no intention of making any formal telephone call to Mr Prodi conceding defeat, as he_4 believes the new centre left coalition will quickly become unglued, reports the BBC’s David Willey in Rome.”
There are several of anaphora in this text, but let us concentrate on he_4. Resolving this anaphor, i.e. finding its antecedent would go automatically for a human reader, without thinking of it. Only if she has her mind somewhere else while reading she might stop for a moment and think: “Who, Prodi or Berlusconi? Oh, that twit!”. For a machine the possible antecedent candidates are, apart from (for us the obvious) Prodi and Berlusconi, also: the election of a new Italian head of state in mid-May, the election, a new Italian head of state, state, mid-May, he_1, ministers, his centre left coalition supporters, centre, left, coalition, supporters, the country, this, speaking to supporters, speaking, supporters, the northern city of Trieste, he_2, he_3, etc.
So for a machine the task of anaphora resolution is not trivial at all.
I have developed ARN as a rule-based anaphora resolution system, on the basis of two existing systems for English: MARS (Mitkov 2001) and RAP (Lappin and Leass 1994). Designing AR systems for Norwegian has turned out to be a challenge since features that are important for English can turn out to be simply wrong for Norwegian. Many of the resolution rules used by both RAP and MARS are based on the hierarchy (Grosz 1995) in which the subject gets a very high score, which will in turn make the sentential subject the most probable antecedent candidate. According to the hierarchy, direct objects are preferred to indirect objects that are themselves preferred to prepositional and adverbial phrases. This hierarchy does not, however, seem to work for Norwegian, as implementing it has led to impairment of the results.
The explanation can be looked for in the difference in information structure between Norwegian and English. While in English there is no problem connected to introducing the new (and highly salient) discourse entities by subject, Norwegian goes to great lengths to avoid it. Faarlund et al. (1997) provide the example that (1) is not a natural answer to a question (2).
(1) "Nils fann pengane." 'nils found the-money'
(2) "Kven fann pengane?" 'who found the-money'
The natural answer would be the sentence (3):
(3) "Det var Nils som fann pengane." 'it was nils who found the-money'
Clefting is not unknown in English, but some studies (Gundel 2002, Johansson 2001) have shown that its use is much wider spread in Norwegian/ Swedish. Clefting is just one of the ways of avoiding introducing new information by subject and besides it Norwegian also uses some other related forms such as topicalisation and presentation thus leaving many subjects as expletive det 'it' which are highly inappropriate as reference candidates.
Apart from these 'centering' resolution factors, ARN employs several others:
1. Gender/number/person: In order to avoid sex/ gender conflict, this factor was implemented as a preference rather then filter, so that it takes into account that e.g. the word doctor (masc.) can denote a person of female gender.
2. Reference proximity: This factor gives higher preference to the candidates in the current sentence then the candidates in the penultimate and the antepenultimate sentence.
3. Boost pronoun factor gives preference to the pronominal candidates, based on the observation that pronominalized entities tend to be more salient.
4.-8. 'Centering' factors: This abovementioned group includes the following five factors: Subject preference, Direct object preference, Indirect object preference, Adverbial phrase penalization and Prepositional phrase penalization.
9. Syntactic parallelism factor awards candidates that fill the same syntactic role as the anaphor.
10. Section heading preference factor awards candidates that have appeared in the text title.
11. Indefiniteness penalization penalizes the candidates with indefinite form.
ARN has been designed to resolve the third person pronoun with the exception of pronoun det ’it (neut.)’. On this it has achieved an accuracy of 70.5%.