A Maximum Entropy Approach to Proper Name Classification for Norwegian

Haaland, Åsne

Doctoral thesis

View/Open

335_Haaland.pdf (439.0Kb)

Year

2008

Abstract

In this Named Entity Recognition study, the proper names in Norwegian text are classified using six semantic categories: PERSON, ORGANIZATION, LOCATION, WORK, EVENT and OTHER. The first three correspond to the same-named MUC-categories. Our system performs classification only, as a separate grammatical tagger detects the names. This thesis examines in detail the potential usefulness of the different features of the name and its context for an automatic classification.

We annotated a POS-tagged corpus which comprises some 7 500 proper names. An off-the-shelf implementation of maximum entropy modeling was used for training and testing (employing ten-fold cross-validation). Feature selection is three-step: Results are recorded for single feature (classes), before the most important feature is combined with a second feature to form pairs. Features are finally added incrementally.

We achieve a cross-validation F-measure of 81.4 (2.6). The F-measure 89.1 (2.6) for PERSON tops our results, followed by LOCATION at 80.3 (3.7) and ORGANIZATION at 72.1 (4.0). Results are very poor for the three infrequent categories EVENT, WORK and OTHER.

Our classifier employs a symmetric window of size three anchored at the name, where both the name and its neighbours are represented as lemmas. This was at 69.7 (2.4) the by far most effective feature. Lists of a total of 13 thousand names accounted for half of the gain achieved by adding features to the lemma window. Features include the five last letters of the name and its immediate left neighbour. We record if the name is an acronym and in the case of multi-word-names, the distribution of capitalized first letters: Norwegian names of public institutions typically have lowercase non-first parts as in Det humanistiske fakultet (The Faculty of Humanities ). Our results are comparable with those of a memory-based system.