This project is the implementation of a light-weight library for monitoring small LLM trainings, supporting inference, for small-scale ablation studies. It also includes an interface to chat with the model, and with models from 🤗 API, locally or remotely.
Explore the docs »
Request Feature »
*This name is quite pompous, I admit it. Any suggestions for a better one are welcomed!
- Table of Contents
- About The Project
- Built With
- Get Started
- Usage
- Development Notes
- References
- TODOs
- License
- Contact
This project is primarily educational*. It implements transformer-based language models from scratch to expose and understand their core mechanisms.
While modern LLMs can generate strong implementations, true understanding comes from building. This repository follows that philosophy: learning through construction and internalization, permitting elaboration. That said, building alone does not guarantee understanding.
"What I cannot create, I do not understand." - Richard Feynman 🐐
This is not a production-ready library. It is a lightweight, transparent playground for training small models, running experiments, ablation studies, and exploring architectural ideas.
Components are adapted from existing work and properly credited. The goal is not to reinvent the wheel, but to understand it well enough to modify and improve it. At least, that is the intention.
*For the non-initiated, there is of course better resources available online and for free. Find some at the references section.
-
(web interface -- not really actively developed; may have some bugs and issues)
-
[![uv-shield]][uv-url] (dependency management and CLI)
This project has been developed and tested with Python 3.12. To manage dependencies, I recommend using uv.
- Clone the repo
git clone git@github.com:art-test-stack/gpt-lab.git
- Install dependencies
If running on Linux with CUDA available, you can install the GPU version of PyTorch by running:
uv sync
uv sync --extra cuda
Note
Make sure to adjust the CUDA version in uv.toml if needed. This extra is only available for Linux systems with compatible NVIDIA GPUs. It permits using flash_attention for faster attention computation.
There is many layers in the library, and many components that can be used and customized.
Some components are intentionally incomplete. Contributors (including automated tools) are encouraged to explore TODOs and propose improvements via pull requests.
- karpathy/nanoGPT by Andrej Karpathy.
- karpathy/nanochat by Andrej Karpathy.
- KellerJordan/modded-nanogpt by Jordan Keller.
- Attention is all you need
- Training Compute-Optimal Large Language Models
- Training language models to follow instructions with human feedback
Note
All of the literature ressources below all participated in some way to the development of the library. I have probably forgotten some, and I apologize for that. If you think some important papers are missing please feel free to add one (or suggest one) via pull request. Although, I tried to categorize the papers as possibly to make it easier to navigate. Some papers may fall to multiple categories, but I tried to make the classification as relevant as possible. Most papers are not directly cited in the code, I will try to add some as much as possible in the future.
| Title | Authors | Journal | Year | DOI | Link | Category | ArxivID |
|---|---|---|---|---|---|---|---|
| dLLM: Simple Diffusion Language Modeling | Zhou et al. | arXiv | 2026 | [10.48550/arXiv.2602.22661] | https://arxiv.org/abs/2602.22661 | Diffusion | 2602.23 |
| Denoising Diffusion Probabilistic Models | Ho et al. | NeurIPS | 2020 | [2006.11239] | https://arxiv.org/abs/2006.11239 | Diffusion | |
| Energy-Based Transformers are Scalable Learners and Thinkers | Gladstone et al. | arXiv | 2025 | [10.48550/arXiv.2507.02092] | https://arxiv.org/abs/2507.02092 | Energy-Based Models | 2507.02 |
| HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly | Yen et al. | arXiv | 2024 | [10.48550/arXiv.2410.02694] | https://arxiv.org/abs/2410.02694 | Energy-Based Models | 2410.03 |
| How to Train Your Energy-Based Models | Song et al. | arXiv | 2021 | [10.48550/arXiv.2101.03288] | https://arxiv.org/abs/2101.03288 | Energy-Based Models | 2101.03 |
| Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One | Grathwohl et al. | arXiv | 2019 | [10.48550/arXiv.1912.03263] | https://arxiv.org/abs/1912.03263 | Energy-Based Models | 1912.03 |
| A tutorial on Energy-Based Learning | LeCun et al. | MIT Press | 2006 | [eb-learning] | https://www.researchgate.net/publication/200744586_A_tutorial_on_energy-based_learning | Energy-Based Models | |
| Building Bridges between Regression, Clustering, and Classification | Stewart et al. | arXiv | 2025 | [2502.02996] | https://arxiv.org/abs/2502.02996 | General Machine Learning | 2502.03 |
| Representation Learning: A Review and New Perspectives | Bengio et al. | arXiv | 2012 | [1206.5538] | https://arxiv.org/abs/1206.5538 | General Machine Learning | 1206.55 |
| StarCoder 2 and The Stack v2: The Next Generation | Lozhkov et al. | arXiv | 2024 | [10.48550/arXiv.2402.19173] | https://arxiv.org/abs/2402.19173 | LLM Datasets | 2402.19 |
| SQUAD: 100,000+ Questions for Machine Comprehension of Text | Rajpurkar et al. | arXiv | 2016 | [10.48550/arXiv.1606.05250] | https://arxiv.org/abs/1606.05250 | LLM Datasets | 1606.05 |
| Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training | Bergsma et al. | arXiv | 2025 | [2505.13738] | https://arxiv.org/abs/2505.13738 | LLM Scaling Laws | 2505.14 |
| Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies | Tao et al. | arXiv | 2024 | [2407.13623] | https://arxiv.org/abs/2407.13623 | LLM Scaling Laws | 2407.14 |
| ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers | Wang and Li | arXiv | 2023 | [2310.02489] | https://arxiv.org/abs/2310.02489 | LLM shared blocks | 2310.02 |
| Block-Recurrent Transformers | Hutchins et al. | arXiv | 2022 | [2203.07852] | https://arxiv.org/abs/2203.07852 | LLM shared blocks | 2203.08 |
| Muon is Scalable for LLM Training | Liu et al. | 2025 | arXiv | [2502.16982] | https://arxiv.org/abs/2502.16982 | LLMs Basics | |
| KIMI K2: OPEN AGENTIC INTELLIGENCE | Kimi Team | arXiv | 2025 | [10.48550/arXiv.2507.20534] | https://arxiv.org/abs/2507.20534 | LLMs Basics | 2507.21 |
| Recursive Language Models | Zhang et al. | arXiv | 2025 | [2512.24601] | https://arxiv.org/abs/2512.24601 | LLMs Basics | 2512.25 |
| Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free | Qiu et al. | arXiv | 2025 | [2505.06708] | https://arxiv.org/abs/2505.06708 | LLMs Basics | 2505.07 |
| How to Train Long-Context Language Models (Effectively) | Gao et al. | arXiv | 2024 | [10.48550/arXiv.2410.02660] | https://arxiv.org/abs/2410.02660 | LLMs Basics | 2410.03 |
| The Zamba2 Suite: Technical Report | Glorion et al. | arXiv | 2024 | [10.48550/arXiv.2411.15242] | https://arxiv.org/abs/2411.15242 | LLMs Basics | 2411.15 |
| Fewer Truncations Improve Language Modeling | Ding et al. | arXiv | 2024 | [2404.10830] | https://arxiv.org/abs/2404.10830 | LLMs Basics | 2404.11 |
| QLoRA: Efficient Finetuning of Quantized LLMs | Dettmers et al. | arXiv | 2023 | [10.48550/arXiv.2305.14314] | https://arxiv.org/abs/ | LLMs Basics | 2305.14 |
| FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | Dao | arXiv | 2023 | [10.48550/arXiv.2307.08691] | https://arxiv.org/abs/2307.08691 | LLMs Basics | 2307.09 |
| YaRN: Efficient Context Window Extension of Large Language Models | Peng et al. | arXiv | 2023 | [10.48550/arXiv.2309.00071] | https://arxiv.org/abs/2309.00071 | LLMs Basics | 2309 |
| Effective Long-Context Scaling of Foundation Models | Xiong et al. | arXiv | 2023 | [10.48550/arXiv.2309.16039] | https://arxiv.org/abs/2309.16039 | LLMs Basics | 2309.16 |
| Mistral 7B | Jiang et al. | arXiv | 2023 | [10.48550/arXiv.2310.06825] | https://arxiv.org/abs/2310.06825 | LLMs Basics | 2310.07 |
| Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Dao | NeurIPS | 2023 | [2312.00752] | https://arxiv.org/abs/2312.00752 | LLMs Basics | |
| Training Compute-Optimal Large Language Models | Hoffmann et al. | arXiv | 2022 | [10.48550/arXiv.2203.15556] | https://arxiv.org/abs/2203.15556 | LLMs Basics | 2203.16 |
| PaLM: Scaling Language Modeling with Pathways | Chowdhery et al. | arXiv | 2022 | [10.48550/arXiv.2204.02311] | https://arxiv.org/abs/2204.02311 | LLMs Basics | 2204.02 |
| FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Dao et al. | NeurIPS | 2022 | [2205.14135] | https://arxiv.org/abs/2205.14135 | LLMs Basics | |
| Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | Fedus et al. | ICML | 2021 | [2101.03961] | https://arxiv.org/abs/2101.03961 | LLMs Basics | |
| RoFormer: Enhanced Transformer with Rotary Position Embedding | Su et al. | arXiv | 2021 | [10.48550/arXiv.2104.09864] | https://arxiv.org/abs/2104.09864 | LLMs Basics | 2104.1 |
| LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | ICLR | 2021 | [2106.09685] | https://arxiv.org/abs/2106.09685 | LLMs Basics | |
| Knowledge distillation: A good teacher is patient and consistent | Beyer et al. | arXiv | 2021 | [2106.05237] | https://arxiv.org/abs/2106.05237 | LLMs Basics | 2106.05 |
| Language Models are Few-Shot Learners | Brown et al. | arXiv | 2020 | [10.48550/arXiv.2005.14165] | https://arxiv.org/abs/2005.14165 | LLMs Basics | 2005.14 |
| Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | Katharopoulos et al. | arXiv | 2020 | [10.48550/arXiv.2006.16236] | https://arxiv.org/abs/2006.16236 | LLMs Basics | 2006.16 |
| Efficient Transformers: A Survey | Tay et al. | LLMs BasicsarXiv | 2020 | [10.48550/arXiv.2009.06732] | https://arxiv.org/abs/2009.06732 | LLMs Basics | 2009.07 |
| Language models are unsupervised multitask learners | Radford et al. | OpenAI | 2019 | [unsupervised-multitask] | https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf | LLMs Basics | |
| Shampoo: Preconditioned Stochastic Tensor Optimization | Gupta et al. | arXiv | 2018 | [10.48550/arXiv.1802.09568] | https://arxiv.org/abs/1802.09568 | LLMs Basics | 1802.1 |
| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Devlin et al. | arXiv | 2018 | [10.48550/arXiv.1810.04805] | https://arxiv.org/abs/1810.04805 | LLMs Basics | 1810.05 |
| Attention is all you need | Vaswani et al. | arXiv | 2017 | [10.48550/arXiv.1706.03762] | https://arxiv.org/abs/1706.03762 | LLMs Basics | 1706.04 |
| Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials | Grishina et al. | arXiv | 2025 | [2506.10935] | https://arxiv.org/abs/2506.10935 | ML Optimizers | 2506.11 |
| Fantastic Pretraining Optimizers and Where to Find Them | Wen et al. | arXiv | 2025 | [2509.02046] | https://arxiv.org/abs/2509.02046 | ML Optimizers | 2509.02 |
| Statistical optimal transport | Chewi et al. | arXiv | 2024 | [2407.18163] | https://arxiv.org/abs/2407.18163 | Optimal Transport | 2407.18 |
| Tokenization Is More Than Compression | nan | nan | [nan] | https://aclanthology.org/2024.emnlp-main.40.pdf | Tokenization | ||
| How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models | nan | nan | [nan] | https://aclanthology.org/2021.acl-long.243.pdf | Tokenization | ||
| Observational Scaling Laws and the Predictability of Language Model Performance | Ruan et al. | arXiv | 2024 | [10.48550/arXiv.2405.10938] | https://arxiv.org/abs/2405.10938 | alignment | 2405.11 |
| Deep learning | LeCun et al. | Y LeCun, Y Bengio, G Hinton - nature, 2015 - nature.com | 20 | [nan] | https://www.nature.com/articles/nature14539 | deep learning | |
| Deep learning | Bengio et al. | Y Bengio, I Goodfellow, A Courville - 2017 - academia.edu | 20 | [nan] | https://www.academia.edu/download/62266271/Deep_Learning20200303-80130-1s42zvt.pdf | deep learning | |
| Attention Residuals | Kimi Team | arXiv | 2026 | [10.48550/arXiv.2603.15031] | https://arxiv.org/abs/2603.15031 | llm basics | 2603.15 |
| SlimPajama-DC: Understanding Data Combinations for LLM Training | Shen et al. | arXiv | 2023 | [2309.10818] | https://arxiv.org/abs/2309.10818 | llm datasets | 2309.11 |
| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | DeepSeek-AI et al. | Nature volume 645, pages 633-638 (2025) | 2025 | [10.1038/s41586-025-09422-z] | https://arxiv.org/abs/2501.12948 | llms basics |
Bibliography made with art-test-stack/MyBible.
There are a lot of Youtube videos that explain well the different components of the library, and how to implement them. Here are some of them that I found useful:
- Andrej Karpathy's YouTube channel for his unmatched expertise in the field, and his ability to explain complex concepts in a simple and intuitive way. His videos on Transformers and LLMs are particularly useful for understanding the architecture and training of these models.
- Stanfords CME295 course for the very nice lecture on Transformers and LLMs by Afshine and Shervine Amidi. CME296 is also a very promising (currently ongoing) course on diffusion & LVMs.
Here a non-exhaustive list of features that I aim to implement. Stars correspond to the priority level. Contributions are welcome!
- Tokenization ⭐️
- BPE implementation in Python
- Rust implementation
- Architecture ⭐️⭐️
- Alibi
- MoE
- Mixture of Depths
- Optimization ⭐️⭐️
- Shampoo optimizer
- LION optimizer
- MARS optimizer
- Precision ⭐️⭐️
- model and optimizer quantization
- Training ⭐️⭐️⭐️
- fine-tuning / intruction tuning
- grpo
- Cross-lib features ⭐️⭐️⭐️
- HuggingFace integration (model loading, tokenizers, etc.)
- vLLM, DeepSpeed, Megatron-LM, etc. integration
Distributed under the MIT License. See LICENSE.txt for more information.
Arthur Testard - arthur.testard.pro@gmail.com
Project Link: https://github.com/art-test-stack/gpt-lab
If you use this work in your research, please consider citing the following:
@misc{gpt_lib_2026,
author={Testard, Arthur},
title={gpt-lab: A light-weight library for fast-ablation studies on GPT-like LMs},
year={2026},
url={https://github.com/art-test-stack/gpt-lab}
}