Code to download and tokenize wikipedia data.
You can install wikitokenizer directly from PyPI:
pip install wikitokenizer
Or from source:
git clone https://github.com/tpimentelms/wikitokenizer.git
cd wikitokenizer
pip install --editable .Wiki tokenizer has the following main requirements:
To download and tokenize wikipedia data for a specific language in Wiki40B:
$ tokenize_wiki_40b --language <wikipedia_language_code> --tgt-dir <tgt_dir> --break-text-mode <break_text_mode>Where <wikipedia_language_code> is the language code in wikipedia for the desired language, <tgt_dir> is the directory where data should be saved, and <break_text_mode> is either 'document', paragraph or sentence. This script will then produce a train.txt, validation.txt and test.txt file. To tokenize Finnish data, for example, run:
$ tokenize_wiki_40b --language fi --tgt-dir output/fi/ --break-text-mode documentTo tokenize a previously downloaded file, run:
$ tokenize_wiki_file --language fi --src-fname <src_fname> --tgt-fname output/fi/wiki.txtFinally, to fallback to using multilingual tokenizer / sentencizer models (instead of language specific ones), pass the flag --allow-multilingual when calling these scripts.
Create a conda enviroment:
$ conda env create -f environment.ymlThen install the lib in editable mode:
$ pip install --editable .