A Corpus Builder for Wikipedia

Master thesis

2012

We present in this work a method of creating high-quality corpora

from collections of user generated content, which we apply on a

snapshot of Wikipedia to create a very large corpus. Both our

software implementation and the corpus are released to the public.

Our approach makes use of both machine learning and hand-written

rules to remove a large portion of content that have little value

for most information retrieval on natural language processing tasks.

This work also contains a survey of several state of the art sentence

boundary detectors and we develop methods of improving their

performance by taking advantage of layout information. Finally, we

perform a quantitative comparison with a corpora created with an

earlier tool.

Browse