Skip to content

art-test-stack/gpt-lab

Repository files navigation

Stargazers MIT License LinkedIn Ask DeepWiki


Generative Pre-trained Transformer Lab*

This project is the implementation of a light-weight library for monitoring small LLM trainings, supporting inference, for small-scale ablation studies. It also includes an interface to chat with the model, and with models from 🤗 API, locally or remotely.
Explore the docs »

Request Feature »

*This name is quite pompous, I admit it. Any suggestions for a better one are welcomed!

Table of Contents

About The Project

This project is primarily educational*. It implements transformer-based language models from scratch to expose and understand their core mechanisms.

While modern LLMs can generate strong implementations, true understanding comes from building. This repository follows that philosophy: learning through construction and internalization, permitting elaboration. That said, building alone does not guarantee understanding.

"What I cannot create, I do not understand." - Richard Feynman 🐐

This is not a production-ready library. It is a lightweight, transparent playground for training small models, running experiments, ablation studies, and exploring architectural ideas.

Components are adapted from existing work and properly credited. The goal is not to reinvent the wheel, but to understand it well enough to modify and improve it. At least, that is the intention.

*For the non-initiated, there is of course better resources available online and for free. Find some at the references section.

Built With

  • Torch <<3 🐐 (sorry JAX lovers)

  • huggingface-shield (datasets, transformers, tokenizer, hub)

  • wandb-shield (training monitoring)

  • tiktoken-shield (very fast tokenizer encoder)

  • gradio-shield (web interface -- not really actively developed; may have some bugs and issues)

  • [![uv-shield]][uv-url] (dependency management and CLI)

Get Started

This project has been developed and tested with Python 3.12. To manage dependencies, I recommend using uv.

  1. Clone the repo
    git clone git@github.com:art-test-stack/gpt-lab.git
  2. Install dependencies
     uv sync
    If running on Linux with CUDA available, you can install the GPU version of PyTorch by running:
    uv sync --extra cuda

Note

Make sure to adjust the CUDA version in uv.toml if needed. This extra is only available for Linux systems with compatible NVIDIA GPUs. It permits using flash_attention for faster attention computation.

Usage

There is many layers in the library, and many components that can be used and customized.

Development Notes

Some components are intentionally incomplete. Contributors (including automated tools) are encouraged to explore TODOs and propose improvements via pull requests.

References

Nice repositories to check out for inspiration and reference

  1. karpathy/nanoGPT by Andrej Karpathy.
  2. karpathy/nanochat by Andrej Karpathy.
  3. KellerJordan/modded-nanogpt by Jordan Keller.

Some nice blogs and articles

  1. Building a text generation model from scratch by Vincent Bons

Some bibliography

  1. Attention is all you need
  2. Training Compute-Optimal Large Language Models
  3. Training language models to follow instructions with human feedback

Note

All of the literature ressources below all participated in some way to the development of the library. I have probably forgotten some, and I apologize for that. If you think some important papers are missing please feel free to add one (or suggest one) via pull request. Although, I tried to categorize the papers as possibly to make it easier to navigate. Some papers may fall to multiple categories, but I tried to make the classification as relevant as possible. Most papers are not directly cited in the code, I will try to add some as much as possible in the future.

Title Authors Journal Year DOI Link Category ArxivID
dLLM: Simple Diffusion Language Modeling Zhou et al. arXiv 2026 [10.48550/arXiv.2602.22661] https://arxiv.org/abs/2602.22661 Diffusion 2602.23
Denoising Diffusion Probabilistic Models Ho et al. NeurIPS 2020 [2006.11239] https://arxiv.org/abs/2006.11239 Diffusion
Energy-Based Transformers are Scalable Learners and Thinkers Gladstone et al. arXiv 2025 [10.48550/arXiv.2507.02092] https://arxiv.org/abs/2507.02092 Energy-Based Models 2507.02
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly Yen et al. arXiv 2024 [10.48550/arXiv.2410.02694] https://arxiv.org/abs/2410.02694 Energy-Based Models 2410.03
How to Train Your Energy-Based Models Song et al. arXiv 2021 [10.48550/arXiv.2101.03288] https://arxiv.org/abs/2101.03288 Energy-Based Models 2101.03
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One Grathwohl et al. arXiv 2019 [10.48550/arXiv.1912.03263] https://arxiv.org/abs/1912.03263 Energy-Based Models 1912.03
A tutorial on Energy-Based Learning LeCun et al. MIT Press 2006 [eb-learning] https://www.researchgate.net/publication/200744586_A_tutorial_on_energy-based_learning Energy-Based Models
Building Bridges between Regression, Clustering, and Classification Stewart et al. arXiv 2025 [2502.02996] https://arxiv.org/abs/2502.02996 General Machine Learning 2502.03
Representation Learning: A Review and New Perspectives Bengio et al. arXiv 2012 [1206.5538] https://arxiv.org/abs/1206.5538 General Machine Learning 1206.55
StarCoder 2 and The Stack v2: The Next Generation Lozhkov et al. arXiv 2024 [10.48550/arXiv.2402.19173] https://arxiv.org/abs/2402.19173 LLM Datasets 2402.19
SQUAD: 100,000+ Questions for Machine Comprehension of Text Rajpurkar et al. arXiv 2016 [10.48550/arXiv.1606.05250] https://arxiv.org/abs/1606.05250 LLM Datasets 1606.05
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training Bergsma et al. arXiv 2025 [2505.13738] https://arxiv.org/abs/2505.13738 LLM Scaling Laws 2505.14
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies Tao et al. arXiv 2024 [2407.13623] https://arxiv.org/abs/2407.13623 LLM Scaling Laws 2407.14
ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers Wang and Li arXiv 2023 [2310.02489] https://arxiv.org/abs/2310.02489 LLM shared blocks 2310.02
Block-Recurrent Transformers Hutchins et al. arXiv 2022 [2203.07852] https://arxiv.org/abs/2203.07852 LLM shared blocks 2203.08
Muon is Scalable for LLM Training Liu et al. 2025 arXiv [2502.16982] https://arxiv.org/abs/2502.16982 LLMs Basics
KIMI K2: OPEN AGENTIC INTELLIGENCE Kimi Team arXiv 2025 [10.48550/arXiv.2507.20534] https://arxiv.org/abs/2507.20534 LLMs Basics 2507.21
Recursive Language Models Zhang et al. arXiv 2025 [2512.24601] https://arxiv.org/abs/2512.24601 LLMs Basics 2512.25
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free Qiu et al. arXiv 2025 [2505.06708] https://arxiv.org/abs/2505.06708 LLMs Basics 2505.07
How to Train Long-Context Language Models (Effectively) Gao et al. arXiv 2024 [10.48550/arXiv.2410.02660] https://arxiv.org/abs/2410.02660 LLMs Basics 2410.03
The Zamba2 Suite: Technical Report Glorion et al. arXiv 2024 [10.48550/arXiv.2411.15242] https://arxiv.org/abs/2411.15242 LLMs Basics 2411.15
Fewer Truncations Improve Language Modeling Ding et al. arXiv 2024 [2404.10830] https://arxiv.org/abs/2404.10830 LLMs Basics 2404.11
QLoRA: Efficient Finetuning of Quantized LLMs Dettmers et al. arXiv 2023 [10.48550/arXiv.2305.14314] https://arxiv.org/abs/ LLMs Basics 2305.14
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Dao arXiv 2023 [10.48550/arXiv.2307.08691] https://arxiv.org/abs/2307.08691 LLMs Basics 2307.09
YaRN: Efficient Context Window Extension of Large Language Models Peng et al. arXiv 2023 [10.48550/arXiv.2309.00071] https://arxiv.org/abs/2309.00071 LLMs Basics 2309
Effective Long-Context Scaling of Foundation Models Xiong et al. arXiv 2023 [10.48550/arXiv.2309.16039] https://arxiv.org/abs/2309.16039 LLMs Basics 2309.16
Mistral 7B Jiang et al. arXiv 2023 [10.48550/arXiv.2310.06825] https://arxiv.org/abs/2310.06825 LLMs Basics 2310.07
Mamba: Linear-Time Sequence Modeling with Selective State Spaces Dao NeurIPS 2023 [2312.00752] https://arxiv.org/abs/2312.00752 LLMs Basics
Training Compute-Optimal Large Language Models Hoffmann et al. arXiv 2022 [10.48550/arXiv.2203.15556] https://arxiv.org/abs/2203.15556 LLMs Basics 2203.16
PaLM: Scaling Language Modeling with Pathways Chowdhery et al. arXiv 2022 [10.48550/arXiv.2204.02311] https://arxiv.org/abs/2204.02311 LLMs Basics 2204.02
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Dao et al. NeurIPS 2022 [2205.14135] https://arxiv.org/abs/2205.14135 LLMs Basics
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Fedus et al. ICML 2021 [2101.03961] https://arxiv.org/abs/2101.03961 LLMs Basics
RoFormer: Enhanced Transformer with Rotary Position Embedding Su et al. arXiv 2021 [10.48550/arXiv.2104.09864] https://arxiv.org/abs/2104.09864 LLMs Basics 2104.1
LoRA: Low-Rank Adaptation of Large Language Models Hu et al. ICLR 2021 [2106.09685] https://arxiv.org/abs/2106.09685 LLMs Basics
Knowledge distillation: A good teacher is patient and consistent Beyer et al. arXiv 2021 [2106.05237] https://arxiv.org/abs/2106.05237 LLMs Basics 2106.05
Language Models are Few-Shot Learners Brown et al. arXiv 2020 [10.48550/arXiv.2005.14165] https://arxiv.org/abs/2005.14165 LLMs Basics 2005.14
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Katharopoulos et al. arXiv 2020 [10.48550/arXiv.2006.16236] https://arxiv.org/abs/2006.16236 LLMs Basics 2006.16
Efficient Transformers: A Survey Tay et al. LLMs BasicsarXiv 2020 [10.48550/arXiv.2009.06732] https://arxiv.org/abs/2009.06732 LLMs Basics 2009.07
Language models are unsupervised multitask learners Radford et al. OpenAI 2019 [unsupervised-multitask] https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf LLMs Basics
Shampoo: Preconditioned Stochastic Tensor Optimization Gupta et al. arXiv 2018 [10.48550/arXiv.1802.09568] https://arxiv.org/abs/1802.09568 LLMs Basics 1802.1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al. arXiv 2018 [10.48550/arXiv.1810.04805] https://arxiv.org/abs/1810.04805 LLMs Basics 1810.05
Attention is all you need Vaswani et al. arXiv 2017 [10.48550/arXiv.1706.03762] https://arxiv.org/abs/1706.03762 LLMs Basics 1706.04
Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials Grishina et al. arXiv 2025 [2506.10935] https://arxiv.org/abs/2506.10935 ML Optimizers 2506.11
Fantastic Pretraining Optimizers and Where to Find Them Wen et al. arXiv 2025 [2509.02046] https://arxiv.org/abs/2509.02046 ML Optimizers 2509.02
Statistical optimal transport Chewi et al. arXiv 2024 [2407.18163] https://arxiv.org/abs/2407.18163 Optimal Transport 2407.18
Tokenization Is More Than Compression nan nan [nan] https://aclanthology.org/2024.emnlp-main.40.pdf Tokenization
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models nan nan [nan] https://aclanthology.org/2021.acl-long.243.pdf Tokenization
Observational Scaling Laws and the Predictability of Language Model Performance Ruan et al. arXiv 2024 [10.48550/arXiv.2405.10938] https://arxiv.org/abs/2405.10938 alignment 2405.11
Deep learning LeCun et al. Y LeCun, Y Bengio, G Hinton - nature, 2015 - nature.com 20 [nan] https://www.nature.com/articles/nature14539 deep learning
Deep learning Bengio et al. Y Bengio, I Goodfellow, A Courville - 2017 - academia.edu 20 [nan] https://www.academia.edu/download/62266271/Deep_Learning20200303-80130-1s42zvt.pdf deep learning
Attention Residuals Kimi Team arXiv 2026 [10.48550/arXiv.2603.15031] https://arxiv.org/abs/2603.15031 llm basics 2603.15
SlimPajama-DC: Understanding Data Combinations for LLM Training Shen et al. arXiv 2023 [2309.10818] https://arxiv.org/abs/2309.10818 llm datasets 2309.11
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI et al. Nature volume 645, pages 633-638 (2025) 2025 [10.1038/s41586-025-09422-z] https://arxiv.org/abs/2501.12948 llms basics

Bibliography made with art-test-stack/MyBible.

For the lazy ones

There are a lot of Youtube videos that explain well the different components of the library, and how to implement them. Here are some of them that I found useful:

  1. Andrej Karpathy's YouTube channel for his unmatched expertise in the field, and his ability to explain complex concepts in a simple and intuitive way. His videos on Transformers and LLMs are particularly useful for understanding the architecture and training of these models.
  2. Stanfords CME295 course for the very nice lecture on Transformers and LLMs by Afshine and Shervine Amidi. CME296 is also a very promising (currently ongoing) course on diffusion & LVMs.

TODOs

Here a non-exhaustive list of features that I aim to implement. Stars correspond to the priority level. Contributions are welcome!

  • Tokenization ⭐️
    • BPE implementation in Python
    • Rust implementation
  • Architecture ⭐️⭐️
    • Alibi
    • MoE
    • Mixture of Depths
  • Optimization ⭐️⭐️
    • Shampoo optimizer
    • LION optimizer
    • MARS optimizer
  • Precision ⭐️⭐️
    • model and optimizer quantization
  • Training ⭐️⭐️⭐️
    • fine-tuning / intruction tuning
    • grpo
  • Cross-lib features ⭐️⭐️⭐️
    • HuggingFace integration (model loading, tokenizers, etc.)
    • vLLM, DeepSpeed, Megatron-LM, etc. integration

License

Distributed under the MIT License. See LICENSE.txt for more information.

Contact

Arthur Testard - arthur.testard.pro@gmail.com

Project Link: https://github.com/art-test-stack/gpt-lab

Citation

If you use this work in your research, please consider citing the following:

@misc{gpt_lib_2026,
  author={Testard, Arthur},
  title={gpt-lab: A light-weight library for fast-ablation studies on GPT-like LMs},
  year={2026},
  url={https://github.com/art-test-stack/gpt-lab}
}

(back to top)

About

A light weight Python library to fully manage LLMs, from training to inference, on local or remote server. Permitting small experiments for quick iterations.

Resources

License

Stars

Watchers

Forks