Skip to content

vasu-devs/Waldo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python FastAPI React Qdrant LangGraph Gemini Groq

πŸ” Waldo - Multimodal Agentic RAG Pipeline

A production-grade Retrieval-Augmented Generation system with multimodal support for PDFs containing text, images, tables, and charts.

Built with Docling β€’ Gemini VLM β€’ Groq LLMs β€’ Qdrant Vector DB β€’ LangGraph

Overview β€’ Features β€’ Architecture β€’ Installation β€’ Usage β€’ API


πŸ“‹ Table of Contents


🎯 Overview

Waldo is a state-of-the-art Agentic RAG (Retrieval-Augmented Generation) pipeline designed to ingest, understand, and intelligently query complex PDF documents containing:

  • πŸ“„ Text content - Paragraphs, headings, lists
  • πŸ“Š Tables - Structured data with rows and columns
  • πŸ“ˆ Charts & Graphs - Bar charts, line graphs, pie charts
  • πŸ–ΌοΈ Figures & Diagrams - Medical images, technical diagrams, illustrations

What Makes This Different?

Traditional RAG Waldo Multimodal RAG
Text extraction only Text, tables, AND figures
Static retrieval Agentic self-correction with retries
Single-pass answers Query rewriting for better results
No visual understanding Gemini VLM transcription for images
Basic keyword search Semantic + metadata hybrid search
May hallucinate Strict anti-hallucination guardrails

✨ Key Features

πŸ”„ Intelligent Document Processing

Feature Description
PDF Parsing IBM's Docling library for high-fidelity document structure extraction
Figure Extraction Identifies and extracts individual figures with bounding boxes
Table Detection Preserves table structure and exports to Markdown
OCR Support RapidOCR for text extraction from scanned documents
Chunking LangChain's RecursiveCharacterTextSplitter (1000 chars, 200 overlap)

πŸ€– Multimodal Understanding

Feature Description
Gemini VLM Transcribes figures, diagrams, and charts to searchable semantic text
Rich Metadata Captures figure numbers, captions, and surrounding context
Shadow Text Creates semantic representations for visual content
Fallback System Uses Docling captions when Gemini quota is exhausted

🧩 Agentic RAG with LangGraph

Feature Description
Adaptive Routing Decides between direct response vs. retrieval
Document Grading LLM-based relevance scoring for each document
Visual Query Detection Automatically includes figures when semantically appropriate
Query Rewriting Transforms failed queries for better retrieval (max 2 retries)
Anti-Hallucination Refuses to answer out-of-scope questions

πŸ’Ύ Vector Storage & Retrieval

Feature Description
Qdrant High-performance vector similarity search (in-memory)
Sentence Transformers all-MiniLM-L6-v2 embeddings (384 dimensions)
TOP-K Retrieval Fetches 10 most similar documents
Metadata Filtering Filter by element_type, page_number, etc.

🎨 Modern Web Interface

Feature Description
React + Vite Fast, responsive chat interface
Real-time Progress Live ingestion status with polling
Inline Images Figures and tables displayed directly in chat
TailwindCSS Modern, clean styling

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              FRONTEND (React + Vite)                            β”‚
β”‚                                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚    Upload    β”‚  β”‚     Chat     β”‚  β”‚    Reset     β”‚  β”‚   Inline Images      β”‚β”‚
β”‚  β”‚  Component   β”‚  β”‚  Interface   β”‚  β”‚    Button    β”‚  β”‚ (Figures + Tables)   β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                 β”‚                 β”‚
          β”‚ POST /ingest    β”‚ POST /chat      β”‚ DELETE /reset
          β–Ό                 β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              BACKEND (FastAPI)                                  β”‚
β”‚                                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                           API Layer (main.py)                              β”‚ β”‚
β”‚  β”‚                                                                            β”‚ β”‚
β”‚  β”‚   POST /ingest    POST /chat    DELETE /reset    GET /ingestion-status    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚           β”‚              β”‚              β”‚                                       β”‚
β”‚           β–Ό              β–Ό              β–Ό                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  IngestScript   β”‚  β”‚                   GraphBrain                        β”‚   β”‚
β”‚  β”‚                 β”‚  β”‚                  (LangGraph)                        β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚                                                     β”‚   β”‚
β”‚  β”‚  β”‚PDF Parser β”‚  β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”‚
β”‚  β”‚  β”‚ (Docling) β”‚  β”‚  β”‚   β”‚ Router  │──▢│Retrieve │──▢│     Grade       β”‚   β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ (LLM Relevance) β”‚   β”‚   β”‚
β”‚  β”‚        β”‚        β”‚  β”‚                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚  β”‚                                        β”‚            β”‚   β”‚
β”‚  β”‚  β”‚  Gemini   β”‚  β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”‚
β”‚  β”‚  β”‚Transcribe β”‚  β”‚  β”‚   β”‚ Rewrite │◀──│ Decide  │◀──│    Generate     β”‚   β”‚   β”‚
β”‚  β”‚  β”‚   (VLM)   β”‚  β”‚  β”‚   β”‚  Query  β”‚   β”‚         β”‚   β”‚  (Groq LLM)     β”‚   β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚
β”‚  β”‚        β”‚        β”‚  β”‚                                                     β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β”‚  β”‚ Embed &   β”‚  β”‚                                                            β”‚
β”‚  β”‚  β”‚  Store    │──┼──────────────────────────────────────────┐                 β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                                          β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                          β”‚                 β”‚
β”‚                                                               β–Ό                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                         VectorStore (Qdrant)                               β”‚ β”‚
β”‚  β”‚                                                                            β”‚ β”‚
β”‚  β”‚   Collection: pdf_documents                                                β”‚ β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚ β”‚
β”‚  β”‚   β”‚ Fields:                                                           β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ shadow_text (str)         - Searchable content               β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ element_type (str)        - "text", "figure", "table"        β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ page_number (int)         - Source page                      β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ original_image_path (str) - Path to extracted image          β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ source_pdf (str)          - Original PDF filename            β”‚    β”‚ β”‚
β”‚  β”‚   β”‚   β€’ heading (str)             - Figure/Table number              β”‚    β”‚ β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                           β–Ό                           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Groq API   β”‚            β”‚ Gemini API  β”‚            β”‚   Qdrant    β”‚
   β”‚ (LLM Chat)  β”‚            β”‚   (VLM)     β”‚            β”‚ (In-Memory) β”‚
   β”‚             β”‚            β”‚             β”‚            β”‚             β”‚
   β”‚ llama-3.3   β”‚            β”‚ gemini-2.0  β”‚            β”‚ Collection: β”‚
   β”‚ 70b-versat  β”‚            β”‚   flash     β”‚            β”‚pdf_documentsβ”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Agentic Workflow

The system follows a cognitive architecture with self-correction capabilities:

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   User Query    β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚     Router      β”‚
                              β”‚ RAG or Direct?  β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                                     β”‚
                    β–Ό                                     β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚    Greeting    β”‚                    β”‚   Retrieval    β”‚
           β”‚ Direct Answer  β”‚                    β”‚ Top-10 Chunks  β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
                                                         β–Ό
                                                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                β”‚     Grade      β”‚
                                                β”‚  Each Document β”‚
                                                β”‚  (LLM-based)   β”‚
                                                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                     β”‚                                     β”‚
                            Has Relevant?                           No Relevant?
                                     β”‚                                     β”‚
                                     β–Ό                                     β–Ό
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚    Generate    β”‚                    β”‚  Rewrite Query β”‚
                            β”‚ Grounded Answerβ”‚                    β”‚  (Max 2 tries) β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚                                     β”‚
                                    β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚                      β”‚                             β”‚
                                    β”‚             Retry Available?                 Max Retries?
                                    β”‚                      β”‚                             β”‚
                                    β”‚                      β–Ό                             β–Ό
                                    β”‚             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚             β”‚ Back to        β”‚            β”‚    Refuse      β”‚
                                    β”‚             β”‚ Retrieval      β”‚            β”‚ "I don't have  β”‚
                                    β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚  information"  β”‚
                                    β”‚                                           β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚                                                   β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
                                                            β–Ό
                                                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                   β”‚   Response     β”‚
                                                   β”‚ + Inline Image β”‚
                                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Workflow Steps Explained

Step Description
1. Router Classifies query as greeting (skip retrieval) or RAG query (retrieve documents)
2. Retrieve Semantic search in Qdrant, returns top-10 most similar chunks
3. Grade LLM evaluates each document's relevance (strict for figures, lenient for text)
4. Decide If relevant docs found β†’ Generate. If not β†’ Rewrite or Refuse
5. Rewrite Transforms query for better retrieval (e.g., "Fig 1" β†’ "Figure 1 diagram")
6. Generate Synthesizes answer from relevant documents only, refuses if no context

πŸ› οΈ Tech Stack

Layer Technology Purpose
Frontend React 18 + Vite Modern, fast UI framework
Styling TailwindCSS Utility-first CSS
Backend FastAPI Async Python web framework
PDF Processing IBM Docling Document structure extraction
OCR RapidOCR Text recognition for images
Vision LLM Gemini 2.0 Flash Figure/chart transcription
Chat LLM Groq (Llama 3.3 70B) Fast inference, answer generation
Embeddings Sentence Transformers all-MiniLM-L6-v2 (384 dims)
Vector DB Qdrant In-memory vector storage
Orchestration LangGraph Agentic state machine
Text Splitting LangChain RecursiveCharacterTextSplitter

πŸ“¦ Installation

Prerequisites

Requirement Version
Python 3.11+
Node.js 18+
GPU CUDA-capable (optional, for faster inference)

1. Clone the Repository

git clone https://github.com/Vasu-DevS/Waldo.git
cd Waldo

2. Set Up Python Environment

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.\.venv\Scripts\activate

# Activate (Linux/Mac)
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Install Frontend Dependencies

cd frontend
npm install
cd ..

4. Configure Environment Variables

# Copy example config
cp .env.example .env

# Edit .env with your API keys

βš™οΈ Configuration

Create a .env file in the project root:

# Required API Keys
GROQ_API_KEY=gsk_your_groq_api_key_here
GOOGLE_API_KEY=your_google_gemini_api_key_here

# Optional: Qdrant (defaults to in-memory)
QDRANT_HOST=localhost
QDRANT_PORT=6333

Getting API Keys

Service How to Get Key
Groq Sign up at console.groq.com β†’ API Keys
Gemini Sign up at aistudio.google.com β†’ Get API Key

Rate Limits (Free Tier)

Service Free Tier Limit
Groq 30 requests/min, 6000 tokens/min
Gemini 15 requests/min, 1,500 requests/day

πŸš€ Usage

Start the Backend

# From project root
.\.venv\Scripts\python -m uvicorn backend.main:app --port 8000 --host 0.0.0.0

Start the Frontend

cd frontend
npm run dev

Access the Application

Open http://localhost:5173 in your browser.

Workflow

  1. πŸ“€ Upload PDF - Drag and drop or click to upload
  2. ⏳ Wait for Ingestion - Progress indicator shows status (~30s for text, ~2-5min with figures)
  3. πŸ’¬ Ask Questions - Chat naturally about the document
  4. πŸ–ΌοΈ View Results - Relevant figures/tables appear inline

πŸ“‘ API Reference

POST /ingest

Upload and process a PDF document.

Request:

curl -X POST http://localhost:8000/ingest \
  -F "file=@document.pdf" \
  -F "filename=document.pdf"

Response:

{
  "status": "processing",
  "filename": "document.pdf",
  "message": "Ingestion started in background"
}

GET /ingestion-status/{filename}

Check ingestion progress.

Response:

{
  "status": "complete",
  "filename": "document.pdf",
  "stats": {
    "total_elements": 15,
    "text_chunks": 10,
    "tables": 2,
    "figures": 3,
    "stored": 16
  }
}

POST /chat

Send a query and get a response.

Request:

{
  "message": "What are the key findings in Table 1?"
}

Response:

{
  "response": "### Key Findings\n\n* **Revenue increased** by 25% in Q4\n* **Customer satisfaction** rated 4.5/5\n...",
  "relevant_documents": [
    {
      "id": "abc123",
      "element_type": "table",
      "page_number": 3,
      "original_image_path": "E:/output/table_1_page_3.png",
      "shadow_text": "Table 1 | Q4 Financial Results..."
    }
  ]
}

DELETE /reset

Clear the vector database for a new document.

Response:

{
  "status": "success",
  "message": "Knowledge base reset complete"
}

GET /images/{filename}

Serve extracted figure/table images.

Example:

http://localhost:8000/images/figure_1_page_1.png

πŸ“ Project Structure

Waldo/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                        # FastAPI app & endpoints
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚
β”‚   β”œβ”€β”€ IngestScript/
β”‚   β”‚   β”œβ”€β”€ ingest.py                  # Main ingestion orchestrator
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── services/
β”‚   β”‚       β”œβ”€β”€ pdf_parser.py          # Docling PDF extraction
β”‚   β”‚       β”œβ”€β”€ gemini_transcriber.py  # VLM transcription + verification
β”‚   β”‚       └── vector_store.py        # Qdrant operations + embeddings
β”‚   β”‚
β”‚   └── GraphBrain/
β”‚       β”œβ”€β”€ graph.py                   # LangGraph RAG pipeline
β”‚       └── __init__.py
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.jsx                    # Main React component
β”‚   β”‚   β”œβ”€β”€ main.jsx                   # Entry point
β”‚   β”‚   └── index.css                  # TailwindCSS styles
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ vite.config.js
β”‚   └── tailwind.config.js
β”‚
β”œβ”€β”€ output/                            # Extracted images (gitignored)
β”œβ”€β”€ TestingDATA/                       # Sample PDFs for testing
β”‚
β”œβ”€β”€ .env.example                       # Environment template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── SYSTEM_CARD.md                     # Detailed system documentation

πŸ›‘οΈ Anti-Hallucination Safeguards

Waldo implements multiple layers to prevent hallucination:

1. Zero-Context Refusal

if len(relevant_documents) == 0:
    return "I don't have information about that in the uploaded document."

2. Grounded Generation Prompt

The system prompt explicitly forbids external knowledge:

CRITICAL RULE - NO HALLUCINATION:
* You can ONLY answer based on the document fragments provided.
* If the question asks about something NOT in the documents, refuse.
* NEVER make up information. NEVER use external knowledge.

3. Low Temperature

temperature=0.3  # Reduced from 0.7 for factual responses

4. Document Grading

Each retrieved document is evaluated by an LLM before use:

prompt = f"Is this document relevant to: {query}? Answer 'yes' or 'no'."

5. Figure Limiting

Only the most relevant figure is shown per response to avoid information overload.


πŸ“Š Performance & Limitations

Performance

Metric Value
Text Ingestion ~5-10 seconds per PDF
Figure Transcription ~65 seconds per figure (rate limited)
Query Response 1-3 seconds
Embedding Speed ~30 docs/sec on GPU

Current Limitations

Limitation Workaround
Gemini Rate Limits 65s delay between calls, fallback to Docling captions
Single Document Reset required for new document
In-Memory Storage Data lost on restart
Table OCR Some complex tables may not extract perfectly

Future Improvements

  • Multi-document support with source attribution
  • Persistent Qdrant with Docker
  • Streaming responses
  • Table-to-SQL conversion
  • Citation highlighting in PDF viewer

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ‘¨β€πŸ’» Author

Vasudev Siddh (@Vasu-DevS)

Built for SOS 42 Technical Assessment β€’ January 2026


⭐ Star this repo if you found it useful!

About

A production-grade Retrieval-Augmented Generation system with multimodal support for PDFs containing text, images, tables, and charts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors