Web text data mining for building large scale language modelling corpus

Jan Švec; Jan Hoidekr; Daniel Soutner; Jan Vavruška

Publikace

Všechny publikace

Detail publikace

Citace

Jan Švec and Jan Hoidekr and Daniel Soutner and Jan Vavruška : Web text data mining for building large scale language modelling corpus . Habernal, Ivan and Matoušek, Václav, Lecture Notes in Computer Science, vol. 6836, p. 356-363, Springer Berlin / Heidelberg, Habernal, Ivan and Matoušek, Václav, 2011.

Abstrakt

The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.

Abstrakt v češtině

Tento článek popisuje systém pro sbírání rozsáhlých textových dat z Internetových zpravodajských serverů. Popisuje architekturu a algoritmy pro předzpracování textu. Rovněž popisuje použitý algoritmus pro detekci duplicit. Výsledný korpus obsahuje více než 1 miliardu tokenů ve více než 3 milionech článků s přiřazenými tématy a označenými duplicitami. Jsou uvedeny i statistiky korpusu jako je konzistence a perplexita.

Detail publikace

Název:	Web text data mining for building large scale language modelling corpus
Autor:	Jan Švec ; Jan Hoidekr ; Daniel Soutner ; Jan Vavruška
Název - česky:	Získávání webových textových dat za účelem vytvoření rozsáhlého corpusu pro jazykové modelování
Jazyk publikace:	anglicky
Datum vydání:	1.9.2011
Rok vydání:	2011
Typ publikace:	Stať ve sborníku
Název knihy:	Habernal, Ivan and Matoušek, Václav
Svazek:	Lecture Notes in Computer Science
Číslo vydání:	6836
Strana:	356 - 363
ISBN:	978-3-642-23537-5
Editor:	Habernal, Ivan and Matoušek, Václav
Nakladatel:	Springer Berlin / Heidelberg

/ 2011-12-21 09:15:58 /

Klíčová slova

language modelling, Internet, topic identification, duplicity detection

Klíčová slova v češtině

modelování, Internet, identifikace tématu, detekce duplicit

BibTeX

@ARTICLE{JanSvec_2011_Webtextdatamining,
 author = {Jan \v{S}vec and Jan Hoidekr and Daniel Soutner and Jan Vavru\v{s}ka},
 title = {Web text data mining for building large scale language modelling corpus},
 year = {2011},
 publisher = {Springer Berlin / Heidelberg},
 volume = {6836},
 pages = {356-363},
 editor = {Habernal, Ivan and Matou\v{s}ek, V\'{a}clav},
 booktitle = {Habernal, Ivan and Matou\v{s}ek, V\'{a}clav},
 series = {Lecture Notes in Computer Science},
 ISBN = {978-3-642-23537-5},
 url = {http://www.kky.zcu.cz/en/publications/JanSvec_2011_Webtextdatamining},
}

Pozice katedry v rámci univerzity

Oddělení katedry

Publikace

Detail publikace

Citace

Abstrakt

Abstrakt v češtině

Detail publikace

Klíčová slova

Klíčová slova v češtině

BibTeX