Abstract
In this thesis, we aim to find suitable automatic evaluation metrics for evaluating topic models in the Aftenposten news domain. We review and experiment with six evaluation metrics from previous research which are potentially corresponding with human judgments. Most importantly, we carry out two secondary tasks with the metadata we have, to evaluate the performance of these evaluation metrics. From these tasks we find that given the same number of topics, inter-doc similarity and within-doc rank metrics are the most suitable metrics for evaluating topic models which are trained to rank relevant documents, and that inter-doc similarity is also suitable for evaluating models which are trained to classify documents into sections in the Aftenposten corpus. These metrics can be directly adopted to evaluate topic models which are trained for related use cases in the future.