Hide metadata

dc.date.accessioned2013-03-12T11:51:24Z
dc.date.available2013-03-12T11:51:24Z
dc.date.issued2003en_US
dc.date.submitted2003-05-05en_US
dc.identifier.citationJonsdottir, Andra Bjork. ARNER, what kind of name is that?. Hovedoppgave, University of Oslo, 2003en_US
dc.identifier.urihttp://hdl.handle.net/10852/26385
dc.description.abstractUnderstanding names and their reference is central to the analysis of unrestricted texts and poses a significant challenge for a number of Natural Language Processing (NLP) applications. Named Entity Recognition serves as an important preprocessing tool for Information Extraction (IE), Information Retrieval (IR) and Machine Translation (MT). Petasisen et al. (2000) have described Named Entity Recognition (NER) as the task of identifying and semantically tagging proper names in running texts, into categories like person, location, organization, etc. This description captures the approach to NER presented in this thesis. The focus here is on named entities as being proper names, while many other approaches also have treated numerical and temporal expressions as named entities. This thesis has served as part of the Nomen Nescio project in connection with the Text Laboratory at the University of Oslo. The main focus of this thesis can be divided in two parts that are related to each other. The first part concerns the choice of semantic categories for proper names and the second part concerns experiments on practical solutions to automatic categorizing proper names into the chosen categories. Before the practical work of categorizing proper names can begin the choice of categories has to be made. The following six categories were chosen: 1) Person names (e.g. names of people, pets and humanoids) 2) Location names (e.g. names of countries, cities, mountains, lakes, oceans etc.) 3) Organization names (e.g. institutions, firms, organizations, pop groups etc.) 4) Publication names (e.g. films, books, songs, short stories, papers, paintings etc.) 5) Event names (e.g. historical events, sport events etc.) 6) Miscellaneous names (e.g. products, vessels, and other proper names not belonging in any of the categories above) The task of categorizing proper names is not a trivial task, not even when done manually, and that is why considerable time has been used to make guidelines. Chapter 5 gives the explicit guidelines made for the annotation of named entities. These guidelines are now being used to re-annotate test and training corpus for the Norwegian Nomen Nescio NER application. The method I have used for the practical categorization is a rule-based method based on the Constraint Grammar (CG) formalism for morphosyntactic tagging. The CG formalism is developed by Fred Karlsson at the University of Helsinki. Paul Meurer at the University of Bergen has implemented a Norwegian CG-tagger developed by the Text Laboratory, at the University of Oslo. There seem to be limitations to what the CG formalism permits, and the task of getting semantic information into the system needed to be resolved. I have tried to be creative in overcoming those limitations in the following ways: * CG can not look at the strings of words to get information that is necessary for the semantic categorization. To overcome this drawback, I have used semantic labels from gazetteers, i.e. name lists, and a suffix module (chapter 6 and 7). * The CG does not give a direct opportunity to add large name lists into the formalism, and for this reason, name lists have been added to the lexicon. * As there are no semantic labels on the words in the lexicon, I have used sets in the CG tagger to "simulate semantics" (chapter 7). The Oslo-Bergen Tagger needed to be modified to handle the task of NER, both in respect to the preprocessor as described in chapter 6, and the disambiguation module. The modifications of the preprocessor have been related to the identification of complex proper names. The modifications of the Oslo-Bergen Tagger include use of regular expressions together with Document Centered Approach (DCA), a suffix module and expansion of the syntactic disambiguation. Additionally the lexicon has been expanded to include proper names from various gazetteers. Chapter 7 gives a description of different parts of ARNER, an Automatic Rule-based Named Entity Recognizer for Norwegian, and the work on developing the system. The section on semantics gives a description of how the semantics have been made available to the ARNER rules. Chapter 8 reveals the first results of the performance of the ARNER system, and ideas on how the system may be improved. The evaluation figures have showed that the system needs more rules, and safer rules, and that there is a lot to gain by implementing a DCA. The result of my work, the ARNER rules and sets, will be used in the rule based NER the Norwegian Nomen Nescio group is developing.nor
dc.language.isoengen_US
dc.titleARNER, what kind of name is that? : an automatic rule-based named entity recognizer for Norwegianen_US
dc.typeMaster thesisen_US
dc.date.updated2006-01-04en_US
dc.creator.authorJonsdottir, Andra Bjorken_US
dc.subject.nsiVDP::000en_US
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.au=Jonsdottir, Andra Bjork&rft.title=ARNER, what kind of name is that?&rft.inst=University of Oslo&rft.date=2003&rft.degree=Hovedoppgaveen_US
dc.identifier.urnURN:NBN:no-9015en_US
dc.type.documentHovedoppgaveen_US
dc.identifier.duo10630en_US
dc.contributor.supervisorJanne Bondi Johannessenen_US
dc.identifier.bibsys031540090en_US


Files in this item

FilesSizeFormatView

No file.

Appears in the following Collection

Hide metadata