LittleLM is a character-level n-gram tiny language model toolkit.
It keeps a simple workflow:
- Read text fields from JSONL corpora.
- Split each text by whitespace into fragments.
- Build character-level n-gram counts inside each fragment.
- Save 1~N gram dictionaries as JSONL files.
- Sample from generated dictionaries to produce text.
LittleLM is a character-level n-gram model.
- Input must be JSONL (one JSON object per line).
- Default text key is
Content. - You can override with
--text-key.
See docs:
docs/corpus_format.mddocs/training.mddocs/tuning.md
pip install -e .Optional dev dependencies:
pip install -e .[dev]python -m littlelm.train \
--input examples/tiny_corpus.jsonl \
--output artifacts/dictionary \
--min-n 1 \
--max-n 3 \
--text-key textpython -m littlelm.sample \
--dictionary artifacts/dictionary \
--seed 午后 \
--max-n 3 \
--max-steps 20python -m littlelm.sample \
--dictionary artifacts/dictionary \
--interactive \
--prompt-mode plain \
--verbosity normalpython -m littlelm.sample \
--dictionary artifacts/dictionary \
--seed 我 \
--verbosity verbose \
--temperature 0.8 \
--top-k 10python -m littlelm.sample \
--dictionary artifacts/dictionary \
--seed 刚刚 \
--n-selection-mode fixed \
--fixed-n 3python -m littlelm.sample \
--dictionary artifacts/dictionary \
--seed 今天 \
--n-selection-mode manual \
--n-weights "1:0.1,2:0.2,3:0.7"You can also use the unified entry:
python -m littlelm train --input examples/tiny_corpus.jsonl --output artifacts/dictionary --min-n 1 --max-n 10 --text-key text
python -m littlelm sample --dictionary artifacts/dictionary --max-n 10 --max-steps 20examples/tiny_corpus.jsonl uses key text, so pass --text-key text when training with it.
Each dictionary JSONL line:
{"ngram": "示例", "count": 123}src/littlelm/ # core package
scripts/ # helper scripts
examples/ # tiny demo data
docs/ # docs
tests/ # pytest tests
archive/ # legacy archived scripts
--dictionary: dictionary folder containing*-gram.jsonl.--seed: direct seed input. If provided, interactive prompt is skipped.--interactive: explicitly enable interactive seed input.--prompt-mode {plain,random}: interactive prompt style.--verbosity {quiet,normal,verbose}: output detail level.--max-n: max n-gram order used per step.--max-steps: max generated steps.--end-chars: stop characters.--temperature: character sampling temperature.--top-k: character top-k filter (0means disabled).--top-p: character top-p nucleus filter.--n-selection-mode {weighted,uniform,fixed,manual}: n-value sampling strategy.--fixed-n: fixed n when mode isfixed.--n-weights: manual n weights when mode ismanual.--n-temperature: temperature on n-selection distribution.
quiet: only final generated text.normal: basic run config + final generated text.verbose: per-step structured output including:- current context
- n-choice distribution
- chosen n
- top 10 candidate chars and probabilities
- chosen char
- current text
Character sampling applies in this order:
- raw conditional probabilities
temperaturetop-ktop-p- renormalize and sample
For n-value selection:
- mode chooses base distribution (
weighted,uniform,fixed,manual) - then
n-temperatureadjusts sharpness - then sample n
- Fixed line-range reader bug from legacy jsonl reader script.
- Fixed global variable dependency in legacy JSONL character search printer.
- Removed hard-coded local Windows paths from runnable code.
See CONTRIBUTING.md.