A production-grade Retrieval-Augmented Generation system with multimodal support for PDFs containing text, images, tables, and charts.
Built with Docling β’ Gemini VLM β’ Groq LLMs β’ Qdrant Vector DB β’ LangGraph
Overview β’ Features β’ Architecture β’ Installation β’ Usage β’ API
- Overview
- Key Features
- System Architecture
- Agentic Workflow
- Tech Stack
- Installation
- Configuration
- Usage
- API Reference
- Project Structure
- Anti-Hallucination Safeguards
- Performance & Limitations
- License
Waldo is a state-of-the-art Agentic RAG (Retrieval-Augmented Generation) pipeline designed to ingest, understand, and intelligently query complex PDF documents containing:
- π Text content - Paragraphs, headings, lists
- π Tables - Structured data with rows and columns
- π Charts & Graphs - Bar charts, line graphs, pie charts
- πΌοΈ Figures & Diagrams - Medical images, technical diagrams, illustrations
| Traditional RAG | Waldo Multimodal RAG |
|---|---|
| Text extraction only | Text, tables, AND figures |
| Static retrieval | Agentic self-correction with retries |
| Single-pass answers | Query rewriting for better results |
| No visual understanding | Gemini VLM transcription for images |
| Basic keyword search | Semantic + metadata hybrid search |
| May hallucinate | Strict anti-hallucination guardrails |
| Feature | Description |
|---|---|
| PDF Parsing | IBM's Docling library for high-fidelity document structure extraction |
| Figure Extraction | Identifies and extracts individual figures with bounding boxes |
| Table Detection | Preserves table structure and exports to Markdown |
| OCR Support | RapidOCR for text extraction from scanned documents |
| Chunking | LangChain's RecursiveCharacterTextSplitter (1000 chars, 200 overlap) |
| Feature | Description |
|---|---|
| Gemini VLM | Transcribes figures, diagrams, and charts to searchable semantic text |
| Rich Metadata | Captures figure numbers, captions, and surrounding context |
| Shadow Text | Creates semantic representations for visual content |
| Fallback System | Uses Docling captions when Gemini quota is exhausted |
| Feature | Description |
|---|---|
| Adaptive Routing | Decides between direct response vs. retrieval |
| Document Grading | LLM-based relevance scoring for each document |
| Visual Query Detection | Automatically includes figures when semantically appropriate |
| Query Rewriting | Transforms failed queries for better retrieval (max 2 retries) |
| Anti-Hallucination | Refuses to answer out-of-scope questions |
| Feature | Description |
|---|---|
| Qdrant | High-performance vector similarity search (in-memory) |
| Sentence Transformers | all-MiniLM-L6-v2 embeddings (384 dimensions) |
| TOP-K Retrieval | Fetches 10 most similar documents |
| Metadata Filtering | Filter by element_type, page_number, etc. |
| Feature | Description |
|---|---|
| React + Vite | Fast, responsive chat interface |
| Real-time Progress | Live ingestion status with polling |
| Inline Images | Figures and tables displayed directly in chat |
| TailwindCSS | Modern, clean styling |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND (React + Vite) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ
β β Upload β β Chat β β Reset β β Inline Images ββ
β β Component β β Interface β β Button β β (Figures + Tables) ββ
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββββββββββββββββββββ
βββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ
β β β
β POST /ingest β POST /chat β DELETE /reset
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND (FastAPI) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Layer (main.py) β β
β β β β
β β POST /ingest POST /chat DELETE /reset GET /ingestion-status β β
β ββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β IngestScript β β GraphBrain β β
β β β β (LangGraph) β β
β β βββββββββββββ β β β β
β β βPDF Parser β β β βββββββββββ βββββββββββ βββββββββββββββββββ β β
β β β (Docling) β β β β Router ββββΆβRetrieve ββββΆβ Grade β β β
β β βββββββ¬ββββββ β β βββββββββββ βββββββββββ β (LLM Relevance) β β β
β β β β β ββββββββββ¬βββββββββ β β
β β βββββββΌββββββ β β β β β
β β β Gemini β β β βββββββββββ βββββββββββ ββββββββββΌβββββββββ β β
β β βTranscribe β β β β Rewrite βββββ Decide βββββ Generate β β β
β β β (VLM) β β β β Query β β β β (Groq LLM) β β β
β β βββββββ¬ββββββ β β βββββββββββ βββββββββββ βββββββββββββββββββ β β
β β β β β β β
β β βββββββΌββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β Embed & β β β
β β β Store ββββΌβββββββββββββββββββββββββββββββββββββββββββ β
β β βββββββββββββ β β β
β βββββββββββββββββββ β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VectorStore (Qdrant) β β
β β β β
β β Collection: pdf_documents β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Fields: β β β
β β β β’ shadow_text (str) - Searchable content β β β
β β β β’ element_type (str) - "text", "figure", "table" β β β
β β β β’ page_number (int) - Source page β β β
β β β β’ original_image_path (str) - Path to extracted image β β β
β β β β’ source_pdf (str) - Original PDF filename β β β
β β β β’ heading (str) - Figure/Table number β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Groq API β β Gemini API β β Qdrant β
β (LLM Chat) β β (VLM) β β (In-Memory) β
β β β β β β
β llama-3.3 β β gemini-2.0 β β Collection: β
β 70b-versat β β flash β βpdf_documentsβ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
The system follows a cognitive architecture with self-correction capabilities:
βββββββββββββββββββ
β User Query β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Router β
β RAG or Direct? β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββ΄βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββ
β Greeting β β Retrieval β
β Direct Answer β β Top-10 Chunks β
ββββββββββββββββββ βββββββββ¬βββββββββ
β
βΌ
ββββββββββββββββββ
β Grade β
β Each Document β
β (LLM-based) β
βββββββββ¬βββββββββ
β
ββββββββββββββββββββ΄βββββββββββββββββββ
β β
Has Relevant? No Relevant?
β β
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββ
β Generate β β Rewrite Query β
β Grounded Answerβ β (Max 2 tries) β
βββββββββ¬βββββββββ βββββββββ¬βββββββββ
β β
β ββββββββββββββββ΄βββββββββββββββ
β β β
β Retry Available? Max Retries?
β β β
β βΌ βΌ
β ββββββββββββββββββ ββββββββββββββββββ
β β Back to β β Refuse β
β β Retrieval β β "I don't have β
β ββββββββββββββββββ β information" β
β βββββββββ¬βββββββββ
β β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββ
β Response β
β + Inline Image β
ββββββββββββββββββ
| Step | Description |
|---|---|
| 1. Router | Classifies query as greeting (skip retrieval) or RAG query (retrieve documents) |
| 2. Retrieve | Semantic search in Qdrant, returns top-10 most similar chunks |
| 3. Grade | LLM evaluates each document's relevance (strict for figures, lenient for text) |
| 4. Decide | If relevant docs found β Generate. If not β Rewrite or Refuse |
| 5. Rewrite | Transforms query for better retrieval (e.g., "Fig 1" β "Figure 1 diagram") |
| 6. Generate | Synthesizes answer from relevant documents only, refuses if no context |
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 18 + Vite | Modern, fast UI framework |
| Styling | TailwindCSS | Utility-first CSS |
| Backend | FastAPI | Async Python web framework |
| PDF Processing | IBM Docling | Document structure extraction |
| OCR | RapidOCR | Text recognition for images |
| Vision LLM | Gemini 2.0 Flash | Figure/chart transcription |
| Chat LLM | Groq (Llama 3.3 70B) | Fast inference, answer generation |
| Embeddings | Sentence Transformers | all-MiniLM-L6-v2 (384 dims) |
| Vector DB | Qdrant | In-memory vector storage |
| Orchestration | LangGraph | Agentic state machine |
| Text Splitting | LangChain | RecursiveCharacterTextSplitter |
| Requirement | Version |
|---|---|
| Python | 3.11+ |
| Node.js | 18+ |
| GPU | CUDA-capable (optional, for faster inference) |
git clone https://github.com/Vasu-DevS/Waldo.git
cd Waldo# Create virtual environment
python -m venv .venv
# Activate (Windows)
.\.venv\Scripts\activate
# Activate (Linux/Mac)
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txtcd frontend
npm install
cd ..# Copy example config
cp .env.example .env
# Edit .env with your API keysCreate a .env file in the project root:
# Required API Keys
GROQ_API_KEY=gsk_your_groq_api_key_here
GOOGLE_API_KEY=your_google_gemini_api_key_here
# Optional: Qdrant (defaults to in-memory)
QDRANT_HOST=localhost
QDRANT_PORT=6333| Service | How to Get Key |
|---|---|
| Groq | Sign up at console.groq.com β API Keys |
| Gemini | Sign up at aistudio.google.com β Get API Key |
| Service | Free Tier Limit |
|---|---|
| Groq | 30 requests/min, 6000 tokens/min |
| Gemini | 15 requests/min, 1,500 requests/day |
# From project root
.\.venv\Scripts\python -m uvicorn backend.main:app --port 8000 --host 0.0.0.0cd frontend
npm run devOpen http://localhost:5173 in your browser.
- π€ Upload PDF - Drag and drop or click to upload
- β³ Wait for Ingestion - Progress indicator shows status (~30s for text, ~2-5min with figures)
- π¬ Ask Questions - Chat naturally about the document
- πΌοΈ View Results - Relevant figures/tables appear inline
Upload and process a PDF document.
Request:
curl -X POST http://localhost:8000/ingest \
-F "file=@document.pdf" \
-F "filename=document.pdf"Response:
{
"status": "processing",
"filename": "document.pdf",
"message": "Ingestion started in background"
}Check ingestion progress.
Response:
{
"status": "complete",
"filename": "document.pdf",
"stats": {
"total_elements": 15,
"text_chunks": 10,
"tables": 2,
"figures": 3,
"stored": 16
}
}Send a query and get a response.
Request:
{
"message": "What are the key findings in Table 1?"
}Response:
{
"response": "### Key Findings\n\n* **Revenue increased** by 25% in Q4\n* **Customer satisfaction** rated 4.5/5\n...",
"relevant_documents": [
{
"id": "abc123",
"element_type": "table",
"page_number": 3,
"original_image_path": "E:/output/table_1_page_3.png",
"shadow_text": "Table 1 | Q4 Financial Results..."
}
]
}Clear the vector database for a new document.
Response:
{
"status": "success",
"message": "Knowledge base reset complete"
}Serve extracted figure/table images.
Example:
http://localhost:8000/images/figure_1_page_1.png
Waldo/
βββ backend/
β βββ main.py # FastAPI app & endpoints
β βββ __init__.py
β β
β βββ IngestScript/
β β βββ ingest.py # Main ingestion orchestrator
β β βββ __init__.py
β β βββ services/
β β βββ pdf_parser.py # Docling PDF extraction
β β βββ gemini_transcriber.py # VLM transcription + verification
β β βββ vector_store.py # Qdrant operations + embeddings
β β
β βββ GraphBrain/
β βββ graph.py # LangGraph RAG pipeline
β βββ __init__.py
β
βββ frontend/
β βββ src/
β β βββ App.jsx # Main React component
β β βββ main.jsx # Entry point
β β βββ index.css # TailwindCSS styles
β βββ index.html
β βββ package.json
β βββ vite.config.js
β βββ tailwind.config.js
β
βββ output/ # Extracted images (gitignored)
βββ TestingDATA/ # Sample PDFs for testing
β
βββ .env.example # Environment template
βββ .gitignore
βββ requirements.txt
βββ README.md
βββ SYSTEM_CARD.md # Detailed system documentation
Waldo implements multiple layers to prevent hallucination:
if len(relevant_documents) == 0:
return "I don't have information about that in the uploaded document."The system prompt explicitly forbids external knowledge:
CRITICAL RULE - NO HALLUCINATION:
* You can ONLY answer based on the document fragments provided.
* If the question asks about something NOT in the documents, refuse.
* NEVER make up information. NEVER use external knowledge.
temperature=0.3 # Reduced from 0.7 for factual responsesEach retrieved document is evaluated by an LLM before use:
prompt = f"Is this document relevant to: {query}? Answer 'yes' or 'no'."Only the most relevant figure is shown per response to avoid information overload.
| Metric | Value |
|---|---|
| Text Ingestion | ~5-10 seconds per PDF |
| Figure Transcription | ~65 seconds per figure (rate limited) |
| Query Response | 1-3 seconds |
| Embedding Speed | ~30 docs/sec on GPU |
| Limitation | Workaround |
|---|---|
| Gemini Rate Limits | 65s delay between calls, fallback to Docling captions |
| Single Document | Reset required for new document |
| In-Memory Storage | Data lost on restart |
| Table OCR | Some complex tables may not extract perfectly |
- Multi-document support with source attribution
- Persistent Qdrant with Docker
- Streaming responses
- Table-to-SQL conversion
- Citation highlighting in PDF viewer
This project is licensed under the MIT License - see the LICENSE file for details.
Vasudev Siddh (@Vasu-DevS)
Built for SOS 42 Technical Assessment β’ January 2026
β Star this repo if you found it useful!