Hide metadata

dc.date.accessioned2013-05-02T10:24:33Z
dc.date.available2013-05-02T10:24:33Z
dc.date.issued2012en_US
dc.date.submitted2012-11-15en_US
dc.identifier.citationSolberg, Lars Jørgen. A Corpus Builder for Wikipedia. Masteroppgave, University of Oslo, 2012en_US
dc.identifier.urihttp://hdl.handle.net/10852/34914
dc.description.abstractWe present in this work a method of creating high-quality corpora from collections of user generated content, which we apply on a snapshot of Wikipedia to create a very large corpus. Both our software implementation and the corpus are released to the public. Our approach makes use of both machine learning and hand-written rules to remove a large portion of content that have little value for most information retrieval on natural language processing tasks. This work also contains a survey of several state of the art sentence boundary detectors and we develop methods of improving their performance by taking advantage of layout information. Finally, we perform a quantitative comparison with a corpora created with an earlier tool.eng
dc.language.isoengen_US
dc.titleA Corpus Builder for Wikipediaen_US
dc.typeMaster thesisen_US
dc.date.updated2013-04-30en_US
dc.creator.authorSolberg, Lars Jørgenen_US
dc.subject.nsiVDP::420en_US
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.au=Solberg, Lars Jørgen&rft.title=A Corpus Builder for Wikipedia&rft.inst=University of Oslo&rft.date=2012&rft.degree=Masteroppgaveen_US
dc.identifier.urnURN:NBN:no-33662en_US
dc.type.documentMasteroppgaveen_US
dc.identifier.duo172517en_US
dc.contributor.supervisorStephan Oepen and Jonathon Readen_US
dc.identifier.bibsys131473956en_US
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/34914/1/thesis.pdf


Files in this item

Appears in the following Collection

Hide metadata