Abstract
We present in this work a method of creating high-quality corpora
from collections of user generated content, which we apply on a
snapshot of Wikipedia to create a very large corpus. Both our
software implementation and the corpus are released to the public.
Our approach makes use of both machine learning and hand-written
rules to remove a large portion of content that have little value
for most information retrieval on natural language processing tasks.
This work also contains a survey of several state of the art sentence
boundary detectors and we develop methods of improving their
performance by taking advantage of layout information. Finally, we
perform a quantitative comparison with a corpora created with an
earlier tool.