Exploring Statistical Machine Translation between Norwegian and English

Jeong, Jieun

Master thesis

View/Open

The file is restricted (More info)

exploring-statistical-machine.pdf (831.8Kb)

Year

2018

Abstract

This thesis presents the findings from the experiments on statistical machine translation (SMT) between Norwegian and English. First of all, we have extracted a parallel corpus for the language pair of Norwegian and English from the data published by the Norwegian language bank in order to obtain the data sets to train SMT systems with. We examined the parallel corpus in several ways from creating the statistics for the corpus to measuring domain similarity between the sub-corpora which are manually divided into different domains of law. We conclude that the parallel corpus can be a valuable resource for the researches on SMT between Norwegian and English as well as other fields in natural language processing. Furthermore, we have trained more than a hundred of SMT systems with the parallel corpus to find more effective ways to training a better SMT system. We evaluated the effect of the size for the training data on the translation outcomes. We also inspected multiple systems by training them with various combinations of the domains from the corpus. The result shows that the SMT system trained with the limited training data can be as good as the system trained with the training data from the whole corpus.