Generative Pre-trained Transformer Lab*

This project is the implementation of a light-weight library for monitoring small LLM trainings, supporting inference, for small-scale ablation studies. It also includes an interface to chat with the model, and with models from 🤗 API, locally or remotely.
Explore the docs »

Request Feature »

*This name is quite pompous, I admit it. Any suggestions for a better one are welcomed!

About The Project

This project is primarily educational*. It implements transformer-based language models from scratch to expose and understand their core mechanisms.

While modern LLMs can generate strong implementations, true understanding comes from building. This repository follows that philosophy: learning through construction and internalization, permitting elaboration. That said, building alone does not guarantee understanding.

"What I cannot create, I do not understand." - Richard Feynman 🐐

This is not a production-ready library. It is a lightweight, transparent playground for training small models, running experiments, ablation studies, and exploring architectural ideas.

Components are adapted from existing work and properly credited. The goal is not to reinvent the wheel, but to understand it well enough to modify and improve it. At least, that is the intention.

*For the non-initiated, there is of course better resources available online and for free. Find some at the references section.

Built With

<<3 🐐 (sorry JAX lovers)
(datasets, transformers, tokenizer, hub)
(training monitoring)
(very fast tokenizer encoder)
(web interface -- not really actively developed; may have some bugs and issues)
[![uv-shield]][uv-url] (dependency management and CLI)

Get Started

This project has been developed and tested with Python 3.12. To manage dependencies, I recommend using uv.

Clone the repo

git clone git@github.com:art-test-stack/gpt-lab.git

Install dependencies
```
 uv sync
```
If running on Linux with CUDA available, you can install the GPU version of PyTorch by running:
```
uv sync --extra cuda
```

Note

Make sure to adjust the CUDA version in uv.toml if needed. This extra is only available for Linux systems with compatible NVIDIA GPUs. It permits using flash_attention for faster attention computation.

Usage

There is many layers in the library, and many components that can be used and customized.

Development Notes

Some components are intentionally incomplete. Contributors (including automated tools) are encouraged to explore TODOs and propose improvements via pull requests.

References

Nice repositories to check out for inspiration and reference

karpathy/nanoGPT by Andrej Karpathy.
karpathy/nanochat by Andrej Karpathy.
KellerJordan/modded-nanogpt by Jordan Keller.

Some nice blogs and articles

Building a text generation model from scratch by Vincent Bons

Some bibliography

Note

All of the literature ressources below all participated in some way to the development of the library. I have probably forgotten some, and I apologize for that. If you think some important papers are missing please feel free to add one (or suggest one) via pull request. Although, I tried to categorize the papers as possibly to make it easier to navigate. Some papers may fall to multiple categories, but I tried to make the classification as relevant as possible. Most papers are not directly cited in the code, I will try to add some as much as possible in the future.

Title	Authors	Journal	Year	DOI	Link	Category	ArxivID
dLLM: Simple Diffusion Language Modeling	Zhou et al.	arXiv	2026	[10.48550/arXiv.2602.22661]	https://arxiv.org/abs/2602.22661	Diffusion	2602.23
Denoising Diffusion Probabilistic Models	Ho et al.	NeurIPS	2020	[2006.11239]	https://arxiv.org/abs/2006.11239	Diffusion
Energy-Based Transformers are Scalable Learners and Thinkers	Gladstone et al.	arXiv	2025	[10.48550/arXiv.2507.02092]	https://arxiv.org/abs/2507.02092	Energy-Based Models	2507.02
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly	Yen et al.	arXiv	2024	[10.48550/arXiv.2410.02694]	https://arxiv.org/abs/2410.02694	Energy-Based Models	2410.03
How to Train Your Energy-Based Models	Song et al.	arXiv	2021	[10.48550/arXiv.2101.03288]	https://arxiv.org/abs/2101.03288	Energy-Based Models	2101.03
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One	Grathwohl et al.	arXiv	2019	[10.48550/arXiv.1912.03263]	https://arxiv.org/abs/1912.03263	Energy-Based Models	1912.03
A tutorial on Energy-Based Learning	LeCun et al.	MIT Press	2006	[eb-learning]	https://www.researchgate.net/publication/200744586_A_tutorial_on_energy-based_learning	Energy-Based Models
Building Bridges between Regression, Clustering, and Classification	Stewart et al.	arXiv	2025	[2502.02996]	https://arxiv.org/abs/2502.02996	General Machine Learning	2502.03
Representation Learning: A Review and New Perspectives	Bengio et al.	arXiv	2012	[1206.5538]	https://arxiv.org/abs/1206.5538	General Machine Learning	1206.55
StarCoder 2 and The Stack v2: The Next Generation	Lozhkov et al.	arXiv	2024	[10.48550/arXiv.2402.19173]	https://arxiv.org/abs/2402.19173	LLM Datasets	2402.19
SQUAD: 100,000+ Questions for Machine Comprehension of Text	Rajpurkar et al.	arXiv	2016	[10.48550/arXiv.1606.05250]	https://arxiv.org/abs/1606.05250	LLM Datasets	1606.05
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training	Bergsma et al.	arXiv	2025	[2505.13738]	https://arxiv.org/abs/2505.13738	LLM Scaling Laws	2505.14
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies	Tao et al.	arXiv	2024	[2407.13623]	https://arxiv.org/abs/2407.13623	LLM Scaling Laws	2407.14
ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers	Wang and Li	arXiv	2023	[2310.02489]	https://arxiv.org/abs/2310.02489	LLM shared blocks	2310.02
Block-Recurrent Transformers	Hutchins et al.	arXiv	2022	[2203.07852]	https://arxiv.org/abs/2203.07852	LLM shared blocks	2203.08
Muon is Scalable for LLM Training	Liu et al.	2025	arXiv	[2502.16982]	https://arxiv.org/abs/2502.16982	LLMs Basics
KIMI K2: OPEN AGENTIC INTELLIGENCE	Kimi Team	arXiv	2025	[10.48550/arXiv.2507.20534]	https://arxiv.org/abs/2507.20534	LLMs Basics	2507.21
Recursive Language Models	Zhang et al.	arXiv	2025	[2512.24601]	https://arxiv.org/abs/2512.24601	LLMs Basics	2512.25
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free	Qiu et al.	arXiv	2025	[2505.06708]	https://arxiv.org/abs/2505.06708	LLMs Basics	2505.07
How to Train Long-Context Language Models (Effectively)	Gao et al.	arXiv	2024	[10.48550/arXiv.2410.02660]	https://arxiv.org/abs/2410.02660	LLMs Basics	2410.03
The Zamba2 Suite: Technical Report	Glorion et al.	arXiv	2024	[10.48550/arXiv.2411.15242]	https://arxiv.org/abs/2411.15242	LLMs Basics	2411.15
Fewer Truncations Improve Language Modeling	Ding et al.	arXiv	2024	[2404.10830]	https://arxiv.org/abs/2404.10830	LLMs Basics	2404.11
QLoRA: Efficient Finetuning of Quantized LLMs	Dettmers et al.	arXiv	2023	[10.48550/arXiv.2305.14314]	https://arxiv.org/abs/	LLMs Basics	2305.14
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	Dao	arXiv	2023	[10.48550/arXiv.2307.08691]	https://arxiv.org/abs/2307.08691	LLMs Basics	2307.09
YaRN: Efficient Context Window Extension of Large Language Models	Peng et al.	arXiv	2023	[10.48550/arXiv.2309.00071]	https://arxiv.org/abs/2309.00071	LLMs Basics	2309
Effective Long-Context Scaling of Foundation Models	Xiong et al.	arXiv	2023	[10.48550/arXiv.2309.16039]	https://arxiv.org/abs/2309.16039	LLMs Basics	2309.16
Mistral 7B	Jiang et al.	arXiv	2023	[10.48550/arXiv.2310.06825]	https://arxiv.org/abs/2310.06825	LLMs Basics	2310.07
Mamba: Linear-Time Sequence Modeling with Selective State Spaces	Dao	NeurIPS	2023	[2312.00752]	https://arxiv.org/abs/2312.00752	LLMs Basics
Training Compute-Optimal Large Language Models	Hoffmann et al.	arXiv	2022	[10.48550/arXiv.2203.15556]	https://arxiv.org/abs/2203.15556	LLMs Basics	2203.16
PaLM: Scaling Language Modeling with Pathways	Chowdhery et al.	arXiv	2022	[10.48550/arXiv.2204.02311]	https://arxiv.org/abs/2204.02311	LLMs Basics	2204.02
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	Dao et al.	NeurIPS	2022	[2205.14135]	https://arxiv.org/abs/2205.14135	LLMs Basics
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	Fedus et al.	ICML	2021	[2101.03961]	https://arxiv.org/abs/2101.03961	LLMs Basics
RoFormer: Enhanced Transformer with Rotary Position Embedding	Su et al.	arXiv	2021	[10.48550/arXiv.2104.09864]	https://arxiv.org/abs/2104.09864	LLMs Basics	2104.1
LoRA: Low-Rank Adaptation of Large Language Models	Hu et al.	ICLR	2021	[2106.09685]	https://arxiv.org/abs/2106.09685	LLMs Basics
Knowledge distillation: A good teacher is patient and consistent	Beyer et al.	arXiv	2021	[2106.05237]	https://arxiv.org/abs/2106.05237	LLMs Basics	2106.05
Language Models are Few-Shot Learners	Brown et al.	arXiv	2020	[10.48550/arXiv.2005.14165]	https://arxiv.org/abs/2005.14165	LLMs Basics	2005.14
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	Katharopoulos et al.	arXiv	2020	[10.48550/arXiv.2006.16236]	https://arxiv.org/abs/2006.16236	LLMs Basics	2006.16
Efficient Transformers: A Survey	Tay et al.	LLMs BasicsarXiv	2020	[10.48550/arXiv.2009.06732]	https://arxiv.org/abs/2009.06732	LLMs Basics	2009.07
Language models are unsupervised multitask learners	Radford et al.	OpenAI	2019	[unsupervised-multitask]	https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf	LLMs Basics
Shampoo: Preconditioned Stochastic Tensor Optimization	Gupta et al.	arXiv	2018	[10.48550/arXiv.1802.09568]	https://arxiv.org/abs/1802.09568	LLMs Basics	1802.1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Devlin et al.	arXiv	2018	[10.48550/arXiv.1810.04805]	https://arxiv.org/abs/1810.04805	LLMs Basics	1810.05
Attention is all you need	Vaswani et al.	arXiv	2017	[10.48550/arXiv.1706.03762]	https://arxiv.org/abs/1706.03762	LLMs Basics	1706.04
Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials	Grishina et al.	arXiv	2025	[2506.10935]	https://arxiv.org/abs/2506.10935	ML Optimizers	2506.11
Fantastic Pretraining Optimizers and Where to Find Them	Wen et al.	arXiv	2025	[2509.02046]	https://arxiv.org/abs/2509.02046	ML Optimizers	2509.02
Statistical optimal transport	Chewi et al.	arXiv	2024	[2407.18163]	https://arxiv.org/abs/2407.18163	Optimal Transport	2407.18
Tokenization Is More Than Compression		nan	nan	[nan]	https://aclanthology.org/2024.emnlp-main.40.pdf	Tokenization
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models		nan	nan	[nan]	https://aclanthology.org/2021.acl-long.243.pdf	Tokenization
Observational Scaling Laws and the Predictability of Language Model Performance	Ruan et al.	arXiv	2024	[10.48550/arXiv.2405.10938]	https://arxiv.org/abs/2405.10938	alignment	2405.11
Deep learning	LeCun et al.	Y LeCun, Y Bengio, G Hinton - nature, 2015 - nature.com	20	[nan]	https://www.nature.com/articles/nature14539	deep learning
Deep learning	Bengio et al.	Y Bengio, I Goodfellow, A Courville - 2017 - academia.edu	20	[nan]	https://www.academia.edu/download/62266271/Deep_Learning20200303-80130-1s42zvt.pdf	deep learning
Attention Residuals	Kimi Team	arXiv	2026	[10.48550/arXiv.2603.15031]	https://arxiv.org/abs/2603.15031	llm basics	2603.15
SlimPajama-DC: Understanding Data Combinations for LLM Training	Shen et al.	arXiv	2023	[2309.10818]	https://arxiv.org/abs/2309.10818	llm datasets	2309.11
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning	DeepSeek-AI et al.	Nature volume 645, pages 633-638 (2025)	2025	[10.1038/s41586-025-09422-z]	https://arxiv.org/abs/2501.12948	llms basics

Bibliography made with art-test-stack/MyBible.

For the lazy ones

There are a lot of Youtube videos that explain well the different components of the library, and how to implement them. Here are some of them that I found useful:

Andrej Karpathy's YouTube channel for his unmatched expertise in the field, and his ability to explain complex concepts in a simple and intuitive way. His videos on Transformers and LLMs are particularly useful for understanding the architecture and training of these models.
Stanfords CME295 course for the very nice lecture on Transformers and LLMs by Afshine and Shervine Amidi. CME296 is also a very promising (currently ongoing) course on diffusion & LVMs.

TODOs

Here a non-exhaustive list of features that I aim to implement. Stars correspond to the priority level. Contributions are welcome!

Tokenization ⭐️
- BPE implementation in Python
- Rust implementation
Architecture ⭐️⭐️
- Alibi
- MoE
- Mixture of Depths
Optimization ⭐️⭐️
- Shampoo optimizer
- LION optimizer
- MARS optimizer
Precision ⭐️⭐️
- model and optimizer quantization
Training ⭐️⭐️⭐️
- fine-tuning / intruction tuning
- grpo
Cross-lib features ⭐️⭐️⭐️
- HuggingFace integration (model loading, tokenizers, etc.)
- vLLM, DeepSpeed, Megatron-LM, etc. integration

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Arthur Testard - arthur.testard.pro@gmail.com

Project Link: https://github.com/art-test-stack/gpt-lab

Citation

If you use this work in your research, please consider citing the following:

@misc{gpt_lib_2026,
  author={Testard, Arthur},
  title={gpt-lab: A light-weight library for fast-ablation studies on GPT-like LMs},
  year={2026},
  url={https://github.com/art-test-stack/gpt-lab}
}

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
configs		configs
examples		examples
nanochat-tasks @ d1aef7f		nanochat-tasks @ d1aef7f
rbpe		rbpe
scripts		scripts
src/gpt_lab		src/gpt_lab
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
ref.bib		ref.bib
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative Pre-trained Transformer Lab*

Table of Contents

About The Project

Built With

Get Started

Usage

Development Notes

References

Nice repositories to check out for inspiration and reference

Some nice blogs and articles

Some bibliography

For the lazy ones

TODOs

License

Contact

Citation

About

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative Pre-trained Transformer Lab*

Table of Contents

About The Project

Built With

Get Started

Usage

Development Notes

References

Nice repositories to check out for inspiration and reference

Some nice blogs and articles

Some bibliography

For the lazy ones

TODOs

License

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 1

Languages