Structurally Informed Document Classification of Norwegian Job Announcements

Moldestad, Celina Utnegaard

Master thesis

View/Open

Moldestad_thesis.pdf (1.595Mb)

Year

2019

Abstract

The Norwegian Welfare Administration (NAV) annually posts a large dataset of job announcements. Each announcement is classified with an occupation code, and today NAV performs this classification of announcements manually with respect to the International Standard Classification of Occupations (ISCO). This manual registration is time-consuming and human judgment could in principle introduce an element of variation across different annotators in the interpretation of ISCO codes. An automated approach using machine learning could potentially alleviate these problems. This thesis presents research and experiments on the task of classifying Norwegian job announcements, using natural language processing and convolutional neural networks. We present these public data collections and perform data exploration and analysis. This thesis addresses two different topics which we, later on, combine in the hope of increasing the accuracy of the classifier. The first topic explores the task of topic segmentation and the second explores the document classification task using a convolutional neural network architecture with dynamic pooling. The pooling regions are based on the variable-sized segments produced by the topic segmentation algorithm. We present strong empirical results and perform a thorough hyperparameter search on the convolutional models with pre-trained word embeddings. This includes a combination of randomized hyperparameter search and well-motivated hyperparameter choices.