Skip to content

bidyashish/ai-basics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Basics: From Neurons to Language Models

🌐 This guide is available in two languages. You're reading English. Click हिंदी / Hindi for the full Hindi version (all 24 chapters translated).

A complete, beginner-to-pro guide to building modern language models from scratch. Written in plain English with PyTorch code you can actually run.

Who this is for

  • Beginners who know a little Python and want to understand how ChatGPT-style models work, end to end.
  • Intermediate folks who can use nn.Linear but want to know why a transformer block looks the way it does.
  • Pros who want refreshers on RoPE math, GQA cache savings, MoE routing, or quantization formats.

You should not feel stuck. If a section feels too dense, skip to the code, run it, and come back. Reading and running is faster than reading alone.

The learning path

The chapters build on each other. If you only have time for a fast tour, read the TL;DR at the top of each file.

# Topic What you walk away with
00 Essential Math Matmul, gradients, chain rule, softmax, cross-entropy — the math you'll see every chapter.
00a Math for Beginners The same math as 00, but rebuilt from scratch for high schoolers — verbose, friendly, with a Google Colab walkthrough at the end. Read this first if 00 felt too dense.
01 Neural Networks A neural net from scratch in NumPy + PyTorch. Backprop intuition.
02 PyTorch Tensors, autograd, modules, training loops, mixed precision.
03 GPU Computing Why GPUs are fast, memory hierarchy, kernel fusion, profiling.
04 Data FineWeb-Edu, DCLM, MinHash dedup, classifier filtering, sharding.
05 Model Scale Scaling laws, Chinchilla, over-training, test-time compute.
06 Tokenization & Embeddings BPE, vocab choices, embedding tables, tied heads.
07 Positional Encodings Sinusoidal, ALiBi, RoPE (full math + code), YaRN.
08 Attention Mechanisms Q/K/V, multi-head, causal masking, Flash Attention 3, FlexAttention.
09 KV Cache, MQA, GQA The single biggest inference win, plus MLA, PagedAttention, speculative decoding.
10 Building Blocks RMSNorm, SwiGLU, residuals, pre-norm, QK-norm.
11 Building Qwen from Scratch Glue everything into a real, working LLM with weights you can load.
12 Quantization INT8/INT4/FP8/NVFP4, GPTQ, AWQ, GGUF, BitNet, QLoRA.
13 Mixture of Experts Sparse models, routing, DeepSeek-V3-style fine-grained MoE.
14 Training Small Language Models Full pretraining + SFT + DPO/GRPO pipeline with Muon and FSDP2.
15 Reading Training Logs W&B charts, what each metric means, debugging loss curves.
16 Frontier Models in 2026 Gemma 4 (E2B/E4B/26B-A4B/31B Dense) and Qwen 3.6 (27B Dense / 35B-A3B MoE).
17 Production Inference Serving 10,000 requests: vLLM/SGLang, FP8, prefix cache, HBM bandwidth, token caching, pricing per archetype.
18 AI Apps & Agents The harness above the model: 12 usage patterns, MCP, real-world apps (Cursor, Claude Code, Cline), provider pricing, 300-line agent.
19 Evaluation Capability benchmarks, LLM-as-judge, golden sets, agent evals (SWE-bench, OSWorld, GAIA), eval-driven development.
20 Fine-tuning Recipes Decision tree, LoRA/QLoRA/DoRA, SFT/DPO/GRPO recipes, end-to-end Qwen 3 fine-tune in 150 lines.
21 Embeddings & RAG Embedding models, vector DBs, hybrid search, reranking, chunking, agentic RAG, complete RAG in 80 lines.
22 Multimodal Models VLM architectures, SigLIP-2, native multi-resolution, audio (Whisper, Realtime), QLoRA for VLMs, hallucination evals.
23 Safety & Alignment Four kinds of safety, prompt injection, content moderation, red-teaming with PyRIT, compliance (EU AI Act, GDPR).
24 Gemma 4 Deep Dive Full Gemma 4 reference: vocabulary glossary, architecture (PLE, hybrid attention, p-RoPE), dtype guide, real tok/s benchmarks, multimodal, thinking mode, end-to-end script.
25 Model Releases & Debugging The release ladder (major/minor/point/patch), continual vs fresh pretrain, tool stack per role, looking inside the model with hooks, TransformerLens, logit lens, ablation, activation patching, SAEs.
26 Mechanistic Debugging in Practice Hands-on: safetensors surgery, circuit tracing step-by-step, sycophancy fix with steering vectors, SAE feature suppression, breakpoints inside neurons, chart visualizations, real before/after measurements.
27 Useful, Not Benchmaxxed The most important chapter. Goodhart's Law, what useful actually means, how to evaluate and train for users not leaderboards, the anti-pattern catalog, and case studies of products that win by ignoring benchmarks.

How to use this guide

  1. Start with 00-essential-math.md if any of "matrix multiplication / chain rule / softmax / cross-entropy" sound fuzzy. It takes 30 minutes and the rest of the book reads twice as fast afterward.
  2. Read 01-23 in order the first time. 00-03 are foundations. 04-13 build the model. 14-15 train it. 16-17 deploy. 18 wraps it in a product. 19-23 are the operational maturity layer. 24 is a Gemma 4 reference. 25-26 are release engineering and mechanistic debugging. Read 27 before you publish a single benchmark number. It's the chapter on building useful models instead of leaderboard champions — the philosophy that decides whether the rest of the work mattered.
  3. Type the code yourself. Reading code is not the same as writing it. Open a Jupyter notebook and recreate the snippets.
  4. Run on real hardware when possible. A free Colab T4 GPU is enough for most exercises. For training, even a small CPU run teaches you the loop.
  5. Re-read the math once you've coded it. The equations make more sense after you've seen the tensor shapes flow through them.

Prerequisites

  • Python 3.10+ comfort (functions, classes, list comprehensions).
  • High-school algebra. We explain matrix multiplication when it first appears.
  • A working PyTorch install: pip install torch numpy.
  • Optional but useful: pip install transformers datasets matplotlib.

Conventions

  • Simple English first, equations and code second.
  • All code is PyTorch 2.x. We mark CUDA-only sections clearly.
  • Tensor shapes are spelled out in comments: # x: (B, T, D) means batch, time/sequence, dimension.
  • B = batch, T = sequence length, D = model dimension, H = heads, V = vocab size.

A note on style

The goal is understanding, not impressive jargon. If you find a sentence here that sounds smart but doesn't help, that's a bug — open the file and rewrite it for the next reader.

Now go to 00-essential-math.md if you need the math refresher, or jump straight to 01-neural-networks.md.