Multi-Modal RAG screenshot

Multi-Modal RAG

FastAPICeleryRedisPostgreSQLPGvectorAWS S3SupabaseClerkLangGraphOllamaDockerUnstructuredLangSmithRAGAS
Source Code

Problem

Enterprise teams handling sensitive documents needed a RAG system that could process multi-format files (PDF, DOCX, PPTX, URLs), extract text, tables, and images, and answer questions with citation-backed accuracy. It also had to support fully local LLM deployment for zero data leakage.

System Design

Multi-Modal RAG system design 1
Multi-Modal RAG system design 2
Multi-Modal RAG system design 3
Multi-Modal RAG system design 4

Key Features

  • Multi-modal document processing: PDF, DOCX, PPTX, and URLs with text, table, and image extraction
  • Hybrid retrieval with vector + keyword search, multi-query expansion, reranking, and RRF
  • LangGraph multi-agent system with supervisor orchestration and SSE streaming
  • Three-layer input guardrails: toxicity, prompt-injection, and PII detection
  • Citation tracking for grounded, verifiable responses
  • Configurable local LLM support (Ollama/LLaMA) for zero data leakage
  • RAGAS-validated with ~80% higher accuracy than traditional RAG baseline

Details

Built to learn how production RAG pipelines work — multi-modal ingestion, hybrid retrieval, agentic generation, and evaluation — by building one from scratch.

Built a scalable RAG application using FastAPI, integrating S3 presigned URLs for direct uploads and Celery/Redis for asynchronous document processing, reducing backend load by 80% and enabling real-time background ingestion with full transparency.

Implemented multi-modal processing for 4 formats (PDF, DOCX, PPTX, URLs) using Unstructured, extracting three content types (text, tables, images) into PostgreSQL with PGvector for 1536-dimensional embeddings.

Developed a hybrid retrieval system combining vector + keyword search, multi-query expansion, reranking, and Reciprocal Rank Fusion (RRF), achieving ~30% higher retrieval accuracy with configurable search strategies per project.

Created a LangGraph-based multi-agent system with supervisor orchestration, three-layer input guardrails (toxicity, prompt-injection, PII detection), citation tracking for grounded responses, and SSE streaming emitting token and citation events consumed in real time.

Ensured enterprise-grade data privacy with configurable local LLM support (Ollama/LLaMA) enabling zero data leakage for sensitive documents, allowing organizations to process proprietary information on-premises without external API calls.

Validated quality using RAGAS evaluation framework, demonstrating ~80% higher accuracy than traditional RAG baseline (N = 30).