Tagging and Parsing Old Texts with New Techniques

Bjerkeland, Daniel Clemeth

Master thesis

View/Open

thesis.pdf (2.379Mb)

Year

2022

Abstract

Despite the fact that substantial, language-specific treebanks have been available for many years, the best-performing parsers consistently have to relegate the Ancient Greek collections to the bottom third when it comes to parsing accuracy (such as in the CoNLL shared tasks). With recent developments such as the release of a monolingual BERT model for Ancient Greek, we see the potential to push the performance numbers up by combining contextual word embeddings with a number of high-performing strategies such as biaffine attention, joint modeling, and language-specific enhancements. We look at both dependency parsing and grammatical tagging. We evaluate on several datasets and show that our approach is successful, achieving state-of-the-art results on all metrics and datasets that we test for. We further probe the implications of our findings by identifying specific linguistic traits that we have been able to accommodate through the various methods, and contextualize the results with regard to multi-task learning and domain adaption. Specifically, we show that domain-specific training is crucial for performance on certain datasets, and that diacritics are essential to tagging accuracy in Ancient Greek in general, and propose a novel way of incorporating them into tokenizable text. We release our code and training details for reproducibility.