wiki-tokenizer

Code to download and tokenize wikipedia data.

Install

You can install wikitokenizer directly from PyPI:

pip install wikitokenizer

Or from source:

git clone https://github.com/tpimentelms/wikitokenizer.git
cd wikitokenizer
pip install --editable .

Dependencies

Wiki tokenizer has the following main requirements:

Usage

To download and tokenize wikipedia data for a specific language in Wiki40B:

$ tokenize_wiki_40b --language <wikipedia_language_code> --tgt-dir <tgt_dir> --break-text-mode <break_text_mode>

Where <wikipedia_language_code> is the language code in wikipedia for the desired language, <tgt_dir> is the directory where data should be saved, and <break_text_mode> is either 'document', paragraph or sentence. This script will then produce a train.txt, validation.txt and test.txt file. To tokenize Finnish data, for example, run:

$ tokenize_wiki_40b --language fi --tgt-dir output/fi/ --break-text-mode document

To tokenize a previously downloaded file, run:

$ tokenize_wiki_file --language fi --src-fname <src_fname> --tgt-fname output/fi/wiki.txt

Finally, to fallback to using multilingual tokenizer / sentencizer models (instead of language specific ones), pass the flag --allow-multilingual when calling these scripts.

Development setup

Create a conda enviroment:

$ conda env create -f environment.yml

Then install the lib in editable mode:

$ pip install --editable .

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.circleci		.circleci
wikitokenizer		wikitokenizer
.gitignore		.gitignore
.pylintrc		.pylintrc
Makefile		Makefile
README.md		README.md
activate.sh		activate.sh
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-tokenizer

Install

Dependencies

Usage

Development setup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wiki-tokenizer

Install

Dependencies

Usage

Development setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages