ARNER, what kind of name is that? : an automatic rule-based named entity recognizer for Norwegian

Jonsdottir, Andra Bjork

Master thesis

Åpne

Ingen fil.

År

2003

Sammendrag

Understanding names and their reference is central to the analysis of

unrestricted texts and poses a significant challenge for a number of

Natural Language Processing (NLP) applications. Named Entity Recognition serves as an important preprocessing tool for Information Extraction (IE), Information Retrieval (IR) and Machine Translation (MT).

Petasisen et al. (2000) have described Named Entity Recognition (NER)

as the task of identifying and semantically tagging proper names in

running texts, into categories like person, location, organization,

etc. This description captures the approach to NER presented in this

thesis. The focus here is on named entities as being proper names, while many other approaches also have treated numerical and

temporal expressions as named entities.

This thesis has served as part of the Nomen Nescio project in

connection with the Text Laboratory at the University of Oslo.

The main focus of this thesis can be divided in two parts that are

related to each other. The first part concerns the choice of semantic

categories for proper names and the second part concerns experiments

on practical solutions to automatic categorizing proper names into the

chosen categories.

Before the practical work of categorizing proper names can begin the choice of categories has to be made. The following six categories were chosen:

1) Person names (e.g. names of people, pets and humanoids)

2) Location names (e.g. names of countries, cities, mountains, lakes, oceans etc.)

3) Organization names (e.g. institutions, firms, organizations, pop groups etc.)

4) Publication names (e.g. films, books, songs, short stories, papers, paintings etc.)

5) Event names (e.g. historical events, sport events etc.)

6) Miscellaneous names (e.g. products, vessels, and other proper names

not belonging in any of the categories above)

The task of categorizing proper names is not a trivial task, not

even when done manually, and that is why considerable time has been

used to make guidelines. Chapter 5 gives the explicit guidelines made for the annotation of named entities. These guidelines are now being used to re-annotate test and training corpus for the Norwegian Nomen Nescio NER application.

The method I have used for the practical categorization is a

rule-based method based on the Constraint Grammar (CG) formalism for

morphosyntactic tagging. The CG formalism is developed by Fred Karlsson at the University of Helsinki. Paul Meurer at the University of Bergen has implemented a Norwegian CG-tagger developed by the Text Laboratory, at the University of Oslo.

There seem to be limitations to what the CG formalism permits, and the

task of getting semantic information into the system needed to be

resolved. I have tried to be creative in overcoming those

limitations in the following ways:

* CG can not look at the strings of words to get

information that is necessary for the semantic

categorization. To overcome this drawback, I have used semantic

labels from gazetteers, i.e. name lists, and a suffix module

(chapter 6 and 7).

* The CG does not give a direct opportunity to add large name

lists into the formalism, and for this reason, name lists have been

added to the lexicon.

* As there are no semantic labels on the words in the lexicon, I

have used sets in the CG tagger to "simulate semantics" (chapter

7).

The Oslo-Bergen Tagger needed to be modified to handle the task of NER, both in respect to the preprocessor as described in chapter 6, and the disambiguation module. The modifications of the preprocessor have been related to the identification of complex proper names. The modifications of the Oslo-Bergen Tagger include use of regular expressions together with Document Centered Approach (DCA), a suffix module and expansion of the syntactic disambiguation. Additionally the lexicon has been expanded to include proper names from various gazetteers.

Chapter 7 gives a description of different parts of ARNER, an

Automatic Rule-based Named Entity Recognizer for Norwegian, and

the work on developing the system. The section on semantics gives a

description of how the semantics have been made available to the ARNER rules.

Chapter 8 reveals the first results of the performance of the ARNER system, and ideas on how the system may be improved. The evaluation figures have showed that the system needs more rules, and safer

rules, and that there is a lot to gain by implementing a DCA.

The result of my work, the ARNER rules and sets, will be used

in the rule based NER the Norwegian Nomen Nescio group is developing.