LanguageCrawl

The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows NLP researchers to easily construct web-scale corpus from Common Crawl Archive: a petabyte scale, open repository of web crawl information. Three use-cases are presented: filtering Polish websites, building an N-gram corpora and training continuous skip-gram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and N-gram ranks. Special effort has been put on high computing efficiency, by applying highly concurrent multitasking. We make our tool publicly available to enrich NLP resources. We strongly believe that our work will help to facilitate NLP research, especially in under-resourced languages, where the lack of appropriately sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conf		conf
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LanguageCrawl

About

Releases

Packages

Languages

sroziewski/LanguageCrawl

Folders and files

Latest commit

History

Repository files navigation

LanguageCrawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages