A project-based roadmap for experienced backend developers transitioning into AI engineering. Built with Node.js and the official Google Gen AI SDK — no Python, no tutorials, no theory-first approach.
Each phase produces a real, working project. Concepts compound from phase to phase. By Phase 9 you have a portfolio that demonstrates the full stack of modern AI engineering — from basic LLM calls to MCP servers, persistent agent memory, and production-grade RAG evaluation.
Build first. Understand by doing. No frameworks until they earn their place.
- Every phase = one real project, 2–5 days to build
- Raw SDK calls before frameworks — you see the mechanics, not the abstraction
- Each project is independently runnable and portfolio-ready
- Mistakes and fixes documented — the learning is in the debugging
| # | Project | Core concept | Status |
|---|---|---|---|
| 01 | Smart changelog generator | Prompting, structured output, streaming | ✅ Complete |
| 02 | Code review bot | Prompt chaining, schema-first design | ✅ Complete |
| 03 | Docs Q&A API | Embeddings, vector search, grounding | ✅ Complete |
| 04 | GitHub issue triage agent | ReAct loop, function calling | ✅ Complete |
| 05 | Production AI hardening | Caching, evals, observability, cost | ✅ Complete |
| 06 | Persistent research assistant | Agent memory, contextual retrieval | ✅ Complete |
| 07 | RAG eval harness | LLM-as-judge, RAGAS metrics | ✅ Complete |
| 08 | Fine-tuning comparison | When to fine-tune, ROI, platform limits | ✅ Complete |
| 09 | Custom MCP server | Model Context Protocol, stdio transport | ✅ Complete |
| 10 | Agentic RAG | Agent-driven retrieval, native JSON mode | 🔜 Next |
| 11 | Multi-provider + LangChain | Vercel AI SDK, provider tradeoffs | 🔜 Planned |
| 12 | Local models — Ollama | Open-source, offline, $0 cost | 🔜 Planned |
| 13 | Multi-agent systems | Orchestrator + subagents, parallel execution | 🔜 Planned |
| 14 | Browser agents + long-context | Computer use, 1M token context tradeoffs | 🔜 Planned |
genai-roadmap/
├── README.md ← you are here
│
├── 01-changelog-gen/ ← Smart changelog generator
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── commits.js
│ ├── prompts.js
│ ├── parser.js
│ ├── renderer.js
│ ├── git.js
│ └── index.js
│
├── 02-code-reviewer/ ← Code review bot
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── schema.js
│ ├── validator.js
│ ├── prompts.js
│ ├── reviewer.js
│ ├── renderer.js
│ ├── index.js
│ └── samples/
│ ├── good.js
│ └── bad.js
│
├── 03-docs-qa/ ← Docs Q&A with RAG
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── db.js
│ ├── chunker.js
│ ├── embedder.js
│ ├── pdf-loader.js
│ ├── ingest.js
│ ├── retriever.js
│ ├── generator.js
│ ├── query.js
│ ├── index.js
│ └── docs/
│ ├── gemini-quickstart.md
│ ├── gemini-embeddings.md
│ └── gemini-models.md
│
├── 04-issue-triage/ ← GitHub triage agent
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── github.js
│ ├── tools.js
│ ├── executor.js
│ ├── agent.js
│ └── index.js
│
├── 05-production/ ← Production hardening
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── rateLimiter.js
│ ├── logger.js
│ ├── promptRegistry.js
│ ├── fallback.js
│ ├── cache.js
│ ├── tokens.js
│ ├── pipeline.js
│ ├── db.js
│ ├── retriever.js
│ ├── index.js
│ ├── logs/
│ │ └── .gitkeep
│ └── evals/
│ ├── runner.js
│ └── cases.js
│
├── 06-research-assistant/ ← Persistent research assistant
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── db.js
│ ├── memory/
│ │ ├── shortTerm.js
│ │ ├── longTerm.js
│ │ └── manager.js
│ ├── rag/
│ │ ├── chunker.js
│ │ ├── embedder.js
│ │ ├── ingest.js
│ │ └── retriever.js
│ ├── agent/
│ │ ├── prompts.js
│ │ └── assistant.js
│ ├── scripts/
│ │ └── create-docs.js
│ ├── docs/
│ └── index.js
│
├── 07-rag-evals/ ← RAG eval harness
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── db.js
│ ├── retriever.js
│ ├── generator.js
│ ├── judge.js
│ ├── evalCases.js
│ └── runner.js
│
├── 08-finetuning/ ← Fine-tuning comparison
│ ├── README.md
│ ├── .env.example
│ ├── client.js
│ ├── utils.js
│ ├── data/
│ │ ├── generate-training-data.js
│ │ ├── training.jsonl
│ │ └── validation.jsonl
│ ├── tune.js
│ └── compare.js
│
└── 09-mcp-server/ ← Custom MCP server
├── README.md
├── .env.example
├── client.js
├── utils.js
├── db.js
├── rag/
│ └── embedder.js
├── memory/
│ └── longTerm.js
├── tools/
│ ├── rag.js
│ ├── memory.js
│ └── github.js
├── server.js
└── index.js
Each directory is independently runnable. Shared utilities (client.js, utils.js) are duplicated by design — no cross-phase imports, no monorepo tooling required.
These two files appear in every phase. Copy them when starting a new one:
client.js — GoogleGenAI initialisation:
import { GoogleGenAI } from "@google/genai";
import dotenv from "dotenv";
dotenv.config();
export const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });utils.js — retry with exponential backoff + jitter:
export async function withRetry(fn, retries = 3, baseDelayMs = 1000) {
for (let attempt = 0; attempt < retries; attempt++) {
try {
return await fn();
} catch (err) {
const isLast = attempt === retries - 1;
if (isLast) throw err;
const retryable = err?.status === 429 || err?.status === 503;
if (!retryable) throw err;
const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
console.warn(`Attempt ${attempt + 1} failed [${err.status}], retrying in ${Math.round(delay)}ms...`);
await new Promise((r) => setTimeout(r, delay));
}
}
}📁 01-changelog-gen/ · Full README
Transforms raw git log output into a structured, categorised changelog using Gemini. Output is both JSON and Markdown.
What you learn: Prompt anatomy, dynamic value injection, defensive output parsing, streaming, retry with backoff.
The critical lesson: LLMs are text-in, text-out. Prompt quality directly determines output quality — what examples you show, what rules you number, what you explicitly forbid.
cd 01-changelog-gen && npm install
node index.js
# Output: CHANGELOG.md📁 02-code-reviewer/ · Full README
Runs a thorough code review on any source file — bugs, security vulnerabilities, performance issues, maintainability — with a concrete fix for each.
What you learn: Prompt chaining (2-step pipeline), schema-first design, output schema validation, input validation, few-shot prompting.
The critical lesson: Treat LLM output as untrusted external data. Silent schema drift breaks downstream code without throwing an error.
cd 02-code-reviewer && npm install
node index.js samples/bad.js # score: ~15/100, 7 issues
node index.js samples/good.js # score: ~90/100, minimal issuesAnswers natural language questions grounded strictly in your documents — Gemini embeddings + pgvector + source citations on every answer.
What you learn: Embeddings, pgvector + HNSW index, paragraph-aware chunking, ingestion vs query pipeline separation, grounded generation, similarity thresholds.
The critical lesson: Chunking is the hardest part of RAG. The same embedding model must be used at ingest time and query time — mixing models produces wrong results silently.
cd 03-docs-qa && npm install
node ingest.js
node index.js "How do I use streaming?"
node index.js "What is the capital of France?" # → "I don't have info..."Requires: PostgreSQL + pgvector (Docker).
📁 04-issue-triage/ · Full README
Triages GitHub issues autonomously — reads issues, finds duplicates, applies labels, posts comments, closes duplicates without human input.
What you learn: ReAct loop, function calling, tool executor pattern, conversation history as memory, iteration cap guardrail, temperature 0 for determinism.
The critical lesson: The model never calls GitHub. It says "I want to call search_issues." You call GitHub. You tell the model what came back. This mechanical separation is the foundation of every agent framework ever built.
cd 04-issue-triage && npm install
node index.js 4 # duplicate detection
node index.js 7 # standard labelling + commentRequires: GitHub fine-grained PAT with Issues read/write.
📁 05-production/ · Full README
Hardens the Phase 3 RAG pipeline for production — semantic caching, automated evals, structured logging, fallback chains, cost tracking, prompt versioning.
What you learn: Semantic caching (Redis + embeddings), eval framework with CI exit codes, JSONL structured logging, Flash → Pro fallback chain, prompt versioning registry, ai.models.countTokens().
The critical lesson: Logging is the first thing to build, not the last. The free tier's 20 RPD cap is a hard wall for pipeline work — enable billing early.
cd 05-production && npm install
node index.js "How do I use streaming?"
node index.js evalRequires: Redis (Docker), PostgreSQL + pgvector, Gemini billing enabled.
📁 06-research-assistant/ · Full README
A CLI research assistant that remembers your preferences, conclusions, and sources across sessions using short-term (in-memory) and long-term (pgvector) memory, with contextual retrieval for 49% fewer retrieval failures.
What you learn: Short-term vs long-term memory architecture, importance-weighted memory recall, recent memory fallback, contextual retrieval (Anthropic's technique).
The critical lesson: Long-term agent memory is RAG applied to the agent's own history. Same embeddings, same pgvector, same cosine similarity — different content.
cd 06-research-assistant && npm install
node scripts/create-docs.js && node rag/ingest.js
node index.js yourname # Session 1
node index.js yourname # Session 2 — picks up memoriesRequires: PostgreSQL + pgvector with memories table.
Automated quality evaluation for the 06 RAG pipeline — four RAGAS-aligned metrics scored by an LLM judge using native JSON mode.
What you learn: LLM-as-judge pattern, faithfulness / relevance / precision / recall metrics, native JSON mode, adversarial grounding tests, parallel metric scoring with Promise.all().
The critical lesson: Low faithfulness = generator hallucinating. Low precision = retriever returning noise. Low recall = docs don't cover the topic. Each metric points to a different fix.
cd 07-rag-evals && npm install
node runner.js # exits 0 if ≥ 0.70, exits 1 if below — CI-ready📁 08-finetuning/ · Full README
Head-to-head comparison of zero-shot vs few-shot approaches on changelog generation. Includes training data prep, ROI calculation, and documented platform limitation.
What you learn: Fine-tuning decision tree, JSONL training data format, token cost delta between approaches, fine-tuning ROI at scale.
Platform note: Gemini Developer API dropped fine-tuning support mid-2025. Zero-shot vs few-shot comparison runs fully; tuning job requires Vertex AI.
cd 08-finetuning && npm install
node data/generate-training-data.js
node compare.js📁 09-mcp-server/ · Full README
Exposes the 06 RAG pipeline, agent memory, and 04 GitHub tools as a standardised MCP server — connectable to Claude Desktop or any MCP client without writing agent code.
What you learn: MCP tools vs resources vs prompts, Zod validation, stdio transport, tool descriptions as prompts, MCP Inspector, Claude Desktop integration.
The critical lesson: Write the MCP server once. Claude Desktop, Cursor, and your own agents all discover and use the same tools automatically.
cd 09-mcp-server && npm install
npx @modelcontextprotocol/inspector node index.js
# Then add to Claude Desktop config and restartRequires: 06 database, GitHub PAT, Claude Desktop.
📁 10-agentic-rag/ · Coming soon
The agent decides when and how to retrieve — not just at query time. Native JSON schema enforcement, query rewriting, multi-hop retrieval, self-correcting retrieval loops, hybrid search.
📁 11-multi-provider/ · Coming soon
Same code reviewer from 02, rebuilt with three providers (OpenAI, Claude, Gemini) via Vercel AI SDK and LangChain. Measure quality and cost tradeoffs. First intentional use of frameworks.
📁 12-local-models/ · Coming soon
Offline-capable RAG using Ollama + Llama/Mistral. Same 03 pipeline, zero API cost, runs entirely on your machine.
📁 13-multi-agent/ · Coming soon
Orchestrator spawns specialist subagents in parallel. Real coordination, handoffs, shared memory, partial failure handling.
📁 14-browser-agents/ · Coming soon
Browser agent using Playwright. Explores the 1M token context vs RAG tradeoff — when does full-context beat retrieval?
| Tool | Role | Notes |
|---|---|---|
@google/genai |
Gemini SDK | Official SDK — replaces deprecated @google/generative-ai |
gemini-2.5-flash |
Generation | Fast, 1M context, best default |
gemini-2.5-flash-lite |
Judge / eval model | $0.10/M tokens — cheapest stable option |
gemini-embedding-001 |
Embeddings | Replaces deprecated text-embedding-004 (Jan 2026), 1536 dims |
| PostgreSQL + pgvector | Vector store | HNSW index, cosine similarity |
| Redis | Semantic cache | Phase 05 |
@octokit/rest |
GitHub API | Phases 04 + 09 |
@modelcontextprotocol/sdk |
MCP server | Phase 09 |
zod |
Schema validation | Phase 09 tool parameters |
| No LangChain (Phases 01–09) | — | Raw SDK first — frameworks introduced in Phase 11 |
Every entry below is something that actually broke during this build:
| Issue | Phase | Root cause | Fix |
|---|---|---|---|
@google/generative-ai import fails |
01–04 | Old SDK deprecated | Migrated to @google/genai |
| JSON truncated mid-response | 01 | maxOutputTokens: 2048 too low |
Raised to 8192 |
| Wrong date in changelog | 01 | Model hallucinated from training data | Injected new Date().toISOString() |
| All bug fixes merged into one entry | 01 | Prompt too vague | Added rule: one entry per commit |
text-embedding-004 deprecated |
03 | Model retired Jan 2026 | Migrated to gemini-embedding-001 |
| HNSW index fails on 3072 dims | 03 | pgvector caps HNSW at 2000 dims | Used outputDimensionality: 1536 |
Agent skips search_issues on obvious duplicate |
04 | Model reads issue body and reasons correctly | Expected — let it reason |
| Free tier 20 RPD wall | 05 | Google cut free tier 92% Dec 2025 | Enable billing (Tier 1) |
retry in 11s misleading on daily quota error |
05 | 429 = daily cap, not per-minute | Detect GenerateRequestsPerDayPerProjectPerModel — wait for midnight or enable billing |
| Agent says "no memories from past sessions" | 06 | Model not emitting [REMEMBER:] signals |
Made instruction CRITICAL + MUST in system prompt |
| Meta-questions return no memories | 06 | Semantic similarity too low for "what did we discuss?" | Added recent memory fallback — always loads last 3 |
| DBeaver can't display vector column rows | 06 | DBeaver doesn't render vector type |
Query without embedding column |
ai.tunings.create is not a function |
08 | Tuning not in Gemini Developer API JS SDK | Use REST API or Vertex AI |
| Fine-tuning REST API 400 error | 08 | Gemini Developer API dropped tuning mid-2025 | Concept documented; requires Vertex AI for execution |
git clone https://github.com/your-username/genai-roadmap
cd genai-roadmap/01-changelog-gen
npm install
cp .env.example .env # add GEMINI_API_KEY
node index.jsGet a free Gemini API key at aistudio.google.com. Enable billing before Phase 05 — the free tier (20 RPD) is exhausted in minutes by pipeline work.
API keys needed across the full roadmap:
| Phase | Provider | Where |
|---|---|---|
| 01–09 | Gemini (billing enabled) | aistudio.google.com |
| 04, 09 | GitHub PAT (fine-grained) | GitHub → Settings → Developer settings |
| 11 | OpenAI | platform.openai.com — $5 min topup |
| 11 | Anthropic | console.anthropic.com — $5 free credits |
| 12 | None (Ollama local) | ollama.ai — free |
Built by a fullstack developer with 8 years of experience in Node.js, Express, Angular, and healthcare systems — transitioning into AI engineering by building, not watching tutorials.
The goal: go from "I've heard of RAG" to "I've built and debugged a production-shaped RAG system with evals, memory, and an MCP server" in under 6 weeks. This repo is the evidence it worked.