Common Crawled Web Corpora: Constructing corpora from large amounts of web data

Kristoffersen, Kjetil Bugge

Master thesis

View/Open

Kristoffersen_MSc2.pdf (718.9Kb)

Year

2017

Abstract

Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer from, by taking advantage of the web's huge size and diverse type of content. This thesis will discuss the several sub-tasks that make up the web corpus construction process, like HTML markup removal, language identification, boilerplate removal, duplication detection, etc. Additionally, by using data provided by the Common Crawl Foundation, I develop a new very large English corpus with more than 135 billion tokens. Finally, I evaluate the corpus by training word embeddings and show that the trained model largely outperforms models trained on other corpora in a word analogy and word similarity task.