AI Basics: From Neurons to Language Models

🌐 This guide is available in two languages. You're reading English. Click हिंदी / Hindi for the full Hindi version (all 24 chapters translated).

A complete, beginner-to-pro guide to building modern language models from scratch. Written in plain English with PyTorch code you can actually run.

Who this is for

Beginners who know a little Python and want to understand how ChatGPT-style models work, end to end.
Intermediate folks who can use nn.Linear but want to know why a transformer block looks the way it does.
Pros who want refreshers on RoPE math, GQA cache savings, MoE routing, or quantization formats.

You should not feel stuck. If a section feels too dense, skip to the code, run it, and come back. Reading and running is faster than reading alone.

The learning path

The chapters build on each other. If you only have time for a fast tour, read the TL;DR at the top of each file.

#	Topic	What you walk away with
00	Essential Math	Matmul, gradients, chain rule, softmax, cross-entropy — the math you'll see every chapter.
00a	Math for Beginners	The same math as 00, but rebuilt from scratch for high schoolers — verbose, friendly, with a Google Colab walkthrough at the end. Read this first if `00` felt too dense.
01	Neural Networks	A neural net from scratch in NumPy + PyTorch. Backprop intuition.
02	PyTorch	Tensors, autograd, modules, training loops, mixed precision.
03	GPU Computing	Why GPUs are fast, memory hierarchy, kernel fusion, profiling.
04	Data	FineWeb-Edu, DCLM, MinHash dedup, classifier filtering, sharding.
05	Model Scale	Scaling laws, Chinchilla, over-training, test-time compute.
06	Tokenization & Embeddings	BPE, vocab choices, embedding tables, tied heads.
07	Positional Encodings	Sinusoidal, ALiBi, RoPE (full math + code), YaRN.
08	Attention Mechanisms	Q/K/V, multi-head, causal masking, Flash Attention 3, FlexAttention.
09	KV Cache, MQA, GQA	The single biggest inference win, plus MLA, PagedAttention, speculative decoding.
10	Building Blocks	RMSNorm, SwiGLU, residuals, pre-norm, QK-norm.
11	Building Qwen from Scratch	Glue everything into a real, working LLM with weights you can load.
12	Quantization	INT8/INT4/FP8/NVFP4, GPTQ, AWQ, GGUF, BitNet, QLoRA.
13	Mixture of Experts	Sparse models, routing, DeepSeek-V3-style fine-grained MoE.
14	Training Small Language Models	Full pretraining + SFT + DPO/GRPO pipeline with Muon and FSDP2.
15	Reading Training Logs	W&B charts, what each metric means, debugging loss curves.
16	Frontier Models in 2026	Gemma 4 (E2B/E4B/26B-A4B/31B Dense) and Qwen 3.6 (27B Dense / 35B-A3B MoE).
17	Production Inference	Serving 10,000 requests: vLLM/SGLang, FP8, prefix cache, HBM bandwidth, token caching, pricing per archetype.
18	AI Apps & Agents	The harness above the model: 12 usage patterns, MCP, real-world apps (Cursor, Claude Code, Cline), provider pricing, 300-line agent.
19	Evaluation	Capability benchmarks, LLM-as-judge, golden sets, agent evals (SWE-bench, OSWorld, GAIA), eval-driven development.
20	Fine-tuning Recipes	Decision tree, LoRA/QLoRA/DoRA, SFT/DPO/GRPO recipes, end-to-end Qwen 3 fine-tune in 150 lines.
21	Embeddings & RAG	Embedding models, vector DBs, hybrid search, reranking, chunking, agentic RAG, complete RAG in 80 lines.
22	Multimodal Models	VLM architectures, SigLIP-2, native multi-resolution, audio (Whisper, Realtime), QLoRA for VLMs, hallucination evals.
23	Safety & Alignment	Four kinds of safety, prompt injection, content moderation, red-teaming with PyRIT, compliance (EU AI Act, GDPR).
24	Gemma 4 Deep Dive	Full Gemma 4 reference: vocabulary glossary, architecture (PLE, hybrid attention, p-RoPE), dtype guide, real tok/s benchmarks, multimodal, thinking mode, end-to-end script.
25	Model Releases & Debugging	The release ladder (major/minor/point/patch), continual vs fresh pretrain, tool stack per role, looking inside the model with hooks, TransformerLens, logit lens, ablation, activation patching, SAEs.
26	Mechanistic Debugging in Practice	Hands-on: safetensors surgery, circuit tracing step-by-step, sycophancy fix with steering vectors, SAE feature suppression, breakpoints inside neurons, chart visualizations, real before/after measurements.
27	Useful, Not Benchmaxxed	The most important chapter. Goodhart's Law, what useful actually means, how to evaluate and train for users not leaderboards, the anti-pattern catalog, and case studies of products that win by ignoring benchmarks.

How to use this guide

Start with 00-essential-math.md if any of "matrix multiplication / chain rule / softmax / cross-entropy" sound fuzzy. It takes 30 minutes and the rest of the book reads twice as fast afterward.
Read 01-23 in order the first time. 00-03 are foundations. 04-13 build the model. 14-15 train it. 16-17 deploy. 18 wraps it in a product. 19-23 are the operational maturity layer. 24 is a Gemma 4 reference. 25-26 are release engineering and mechanistic debugging. Read 27 before you publish a single benchmark number. It's the chapter on building useful models instead of leaderboard champions — the philosophy that decides whether the rest of the work mattered.
Type the code yourself. Reading code is not the same as writing it. Open a Jupyter notebook and recreate the snippets.
Run on real hardware when possible. A free Colab T4 GPU is enough for most exercises. For training, even a small CPU run teaches you the loop.
Re-read the math once you've coded it. The equations make more sense after you've seen the tensor shapes flow through them.

Prerequisites

Python 3.10+ comfort (functions, classes, list comprehensions).
High-school algebra. We explain matrix multiplication when it first appears.
A working PyTorch install: pip install torch numpy.
Optional but useful: pip install transformers datasets matplotlib.

Conventions

Simple English first, equations and code second.
All code is PyTorch 2.x. We mark CUDA-only sections clearly.
Tensor shapes are spelled out in comments: # x: (B, T, D) means batch, time/sequence, dimension.
B = batch, T = sequence length, D = model dimension, H = heads, V = vocab size.

A note on style

The goal is understanding, not impressive jargon. If you find a sentence here that sounds smart but doesn't help, that's a bug — open the file and rewrite it for the next reader.

Now go to 00-essential-math.md if you need the math refresher, or jump straight to 01-neural-networks.md.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
hindi		hindi
00-essential-math.md		00-essential-math.md
00a-math-for-beginners.md		00a-math-for-beginners.md
01-neural-networks.md		01-neural-networks.md
02-pytorch.md		02-pytorch.md
03-gpu-computing.md		03-gpu-computing.md
04-data.md		04-data.md
05-model-scale.md		05-model-scale.md
06-tokenization-embeddings.md		06-tokenization-embeddings.md
07-positional-encodings.md		07-positional-encodings.md
08-attention-mechanisms.md		08-attention-mechanisms.md
09-kv-cache-mqa-gqa.md		09-kv-cache-mqa-gqa.md
10-building-blocks.md		10-building-blocks.md
11-building-qwen-from-scratch.md		11-building-qwen-from-scratch.md
12-quantization.md		12-quantization.md
13-mixture-of-experts.md		13-mixture-of-experts.md
14-training-small-language-models.md		14-training-small-language-models.md
15-reading-training-logs.md		15-reading-training-logs.md
16-frontier-models-2026.md		16-frontier-models-2026.md
17-production-inference.md		17-production-inference.md
18-ai-apps-and-agents.md		18-ai-apps-and-agents.md
19-evaluation.md		19-evaluation.md
20-fine-tuning-recipes.md		20-fine-tuning-recipes.md
21-embeddings-and-rag.md		21-embeddings-and-rag.md
22-multimodal.md		22-multimodal.md
23-safety-and-alignment.md		23-safety-and-alignment.md
24-gemma-4-deep-dive.md		24-gemma-4-deep-dive.md
25-model-releases-and-debugging.md		25-model-releases-and-debugging.md
26-mechanistic-debugging.md		26-mechanistic-debugging.md
27-useful-not-benchmaxxed.md		27-useful-not-benchmaxxed.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Basics: From Neurons to Language Models

Who this is for

The learning path

How to use this guide

Prerequisites

Conventions

A note on style

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Basics: From Neurons to Language Models

Who this is for

The learning path

How to use this guide

Prerequisites

Conventions

A note on style

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!