An increasing volume of text is currently in the process of being digitized from paper, or converted to plaintext from paper-centric file formats. Since these kinds of documents have typically been typeset for a two-dimensional printing surface, words are commonly hyphenated across lines. These hyphenations can cause noise in a corpus by either splitting words or running them together, resulting in omissions when the text is indexed. Longer content-words are affected disproportionately by this; for search applications especially, there's also potential for disrupting longer exact-phrase queries in particular. The task of dehyphenation involves detecting and removing only those hyphens that were inserted for typographical reasons at the time of typesetting, producing a text which should lie closer to the original.
In this thesis, several empirical methods for dehyphenation are described, prototyped, and then evaluated on a heterogenous sample of English/Norwegian academic texts from the Norwegian Open Research Archive (NORA). Most of the methods investigated are intended to be applicable across many different alphabetic languages without requiring close supervision or previously-compiled dictionaries. Recommendations and suggestions for future work are given.