Deterministic, transition-based parsing has seen a surge of interest over the recent decade, with research efforts targeting Dependency Grammar, Context-Free Grammar, Head-Driven Phrase Structure Grammar (HPSG), and Combinatory Categorial Grammar. Previous work, however, has not applied the transition-based approach to parsing with hand-crafted, largescale unification-based grammars.
Basing our studies on the English Resource Grammar (ERG), we evaluate the feasibility of transferring strategies and methods from other transitionbased approaches to a semantically ‘deep’, hand-crafted HPSG. Our parsing platform, dubbed CuteForce, constitutes a pipeline which assumes pretokenized sentences, and produces syntacto-semantic representations in accordance with the ERG framework. The components in this pipeline include a preprocessing and supertagging stage and a transition-based parsing stage where both deterministic and near-deterministic strategies are evaluated. We evaluate the supertagger in isolation, and compare our overall parsing results to other ERG parsers. This allows us to assess the trade-offs a transition-based parsing approach for large-scale HPSGs may have in terms of parser precision, robustness and efficiency, compared to ‘classic’ parsing approaches.
Both the preprocessing stage and the transition-based parser rely on large amounts of training data. To ensure that we had sufficient linguistic resources for our data-driven platform, the first part of the project was committed to extracting a corpus from Wikipedia, and convert this data to a gold standard treebank (WeScience Treebank) and a ‘silver standard’ parsed corpus (WikiWoods). Utilizing Wikipedia as a linguistic resource has received increased attention, and we expect that the methodology for corpus acquisition presented in this thesis could also prove useful to other research initiatives.
We find that large amounts of ‘silver standard’ training data allows us to train a supertagger that reaches a previously unmatched level of supertagging accuracy for the ERG. Further, our evaluation shows that although the transition-based parser does not obtain state-of-the-art accuracy, it still reaches a high level of accuracy, coupled with much higher parsing efficiency than other parsers based on the same grammar, making it a suitable choice amongst others when speed has high priority.