Understanding names and their reference is central to the analysis ofunrestricted texts and poses a significant challenge for a number ofNatural Language Processing (NLP) applications. Named Entity Recognition serves as an important preprocessing tool for Information Extraction (IE), Information Retrieval (IR) and Machine Translation (MT).
Petasisen et al. (2000) have described Named Entity Recognition (NER)as the task of identifying and semantically tagging proper names in running texts, into categories like person, location, organization,etc. This description captures the approach to NER presented in thisthesis. The focus here is on named entities as being proper names, while many other approaches also have treated numerical andtemporal expressions as named entities.
This thesis has served as part of the Nomen Nescio project inconnection with the Text Laboratory at the University of Oslo.
The main focus of this thesis can be divided in two parts that arerelated to each other. The first part concerns the choice of semanticcategories for proper names and the second part concerns experimentson practical solutions to automatic categorizing proper names into thechosen categories.
Before the practical work of categorizing proper names can begin the choice of categories has to be made. The following six categories were chosen: 1) Person names (e.g. names of people, pets and humanoids)2) Location names (e.g. names of countries, cities, mountains, lakes, oceans etc.)3) Organization names (e.g. institutions, firms, organizations, pop groups etc.) 4) Publication names (e.g. films, books, songs, short stories, papers, paintings etc.)5) Event names (e.g. historical events, sport events etc.)6) Miscellaneous names (e.g. products, vessels, and other proper namesnot belonging in any of the categories above)
The task of categorizing proper names is not a trivial task, noteven when done manually, and that is why considerable time has beenused to make guidelines. Chapter 5 gives the explicit guidelines made for the annotation of named entities. These guidelines are now being used to re-annotate test and training corpus for the Norwegian Nomen Nescio NER application.
The method I have used for the practical categorization is arule-based method based on the Constraint Grammar (CG) formalism formorphosyntactic tagging. The CG formalism is developed by Fred Karlsson at the University of Helsinki. Paul Meurer at the University of Bergen has implemented a Norwegian CG-tagger developed by the Text Laboratory, at the University of Oslo.
There seem to be limitations to what the CG formalism permits, and thetask of getting semantic information into the system needed to beresolved. I have tried to be creative in overcoming thoselimitations in the following ways:
* CG can not look at the strings of words to get information that is necessary for the semantic categorization. To overcome this drawback, I have used semantic labels from gazetteers, i.e. name lists, and a suffix module (chapter 6 and 7).
* The CG does not give a direct opportunity to add large name lists into the formalism, and for this reason, name lists have been added to the lexicon.
* As there are no semantic labels on the words in the lexicon, I have used sets in the CG tagger to "simulate semantics" (chapter 7).
The Oslo-Bergen Tagger needed to be modified to handle the task of NER, both in respect to the preprocessor as described in chapter 6, and the disambiguation module. The modifications of the preprocessor have been related to the identification of complex proper names. The modifications of the Oslo-Bergen Tagger include use of regular expressions together with Document Centered Approach (DCA), a suffix module and expansion of the syntactic disambiguation. Additionally the lexicon has been expanded to include proper names from various gazetteers.
Chapter 7 gives a description of different parts of ARNER, anAutomatic Rule-based Named Entity Recognizer for Norwegian, andthe work on developing the system. The section on semantics gives adescription of how the semantics have been made available to the ARNER rules.
Chapter 8 reveals the first results of the performance of the ARNER system, and ideas on how the system may be improved. The evaluation figures have showed that the system needs more rules, and saferrules, and that there is a lot to gain by implementing a DCA.
The result of my work, the ARNER rules and sets, will be usedin the rule based NER the Norwegian Nomen Nescio group is developing.