Publications
Detail of publication
Citation
p. 356-363, Springer Berlin / Heidelberg, Habernal, Ivan and Matoušek, Václav, 2011. : Web text data mining for building large scale language modelling corpus . Habernal, Ivan and Matoušek, Václav, Lecture Notes in Computer Science, vol. 6836,
Abstract
The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.
Detail of publication
Title: | Web text data mining for building large scale language modelling corpus |
---|---|
Author: | Jan Švec ; Jan Hoidekr ; Daniel Soutner ; Jan Vavruška |
Language: | English |
Date of publication: | 1 Sep 2011 |
Year: | 2011 |
Type of publication: | Papers in proceedings of reviewed conferences |
Book title: | Habernal, Ivan and Matoušek, Václav |
Series: | Lecture Notes in Computer Science |
Číslo vydání: | 6836 |
Page: | 356 - 363 |
ISBN: | 978-3-642-23537-5 |
Editor: | Habernal, Ivan and Matoušek, Václav |
Publisher: | Springer Berlin / Heidelberg |
Keywords
language modelling, Internet, topic identification, duplicity detection
BibTeX
@ARTICLE{JanSvec_2011_Webtextdatamining, author = {Jan \v{S}vec and Jan Hoidekr and Daniel Soutner and Jan Vavru\v{s}ka}, title = {Web text data mining for building large scale language modelling corpus}, year = {2011}, publisher = {Springer Berlin / Heidelberg}, volume = {6836}, pages = {356-363}, editor = {Habernal, Ivan and Matou\v{s}ek, V\'{a}clav}, booktitle = {Habernal, Ivan and Matou\v{s}ek, V\'{a}clav}, series = {Lecture Notes in Computer Science}, ISBN = {978-3-642-23537-5}, url = {http://www.kky.zcu.cz/en/publications/JanSvec_2011_Webtextdatamining}, }