# RAG System with Multi-Model Support A flexible Retrieval-Augmented Generation (RAG) system that lets you query your documents using local or cloud-based LLMs. Features intelligent document processing, semantic search with reranking, and a simple web interface. ## 🎯 What This System Does 1. **Ingests** your documents (Word, PDF, PowerPoint, Text, Markdown) into a vector database 2. **Retrieves** relevant chunks using semantic search when you ask a question 3. **Reranks** results to find the most relevant information 4. **Generates** comprehensive answers using your choice of LLM ## 🏗️ Architecture ``` Documents (docs/) ↓ Document Processing & Chunking ↓ Embedding Model → Vector Database (Chroma) ↓ Query → Semantic Search (50 chunks) ↓ Reranking (Top 10 chunks) ↓ LLM (Qwen/Mistral/OpenAI) → Answer ``` ## ✨ Features - **Multiple LLM Providers**: Ollama (local), Mistral AI, or OpenAI - **Flexible Embeddings**: Ollama models (mxbai, nomic, bge) or FastEmbed - **Smart Retrieval**: Semantic search + Flashrank reranking for precision - **Document Support**: Word, PDF, PowerPoint, Text, Markdown, ODT - **ZIP Extraction**: Automatically extracts nested ZIP archives - **Web Interface**: Simple HTTP server with query interface - **CPU-Friendly**: Optimized for systems without GPU --- ## 📦 Installation ### 1. Install Ollama (for local models) **Windows (via Winget):** ```powershell winget install Ollama.Ollama -e ``` Verify installation: ```powershell ollama --version ``` Ollama runs as a Windows service automatically. If not running: ```powershell ollama serve ``` ### 2. Pull Required Models **LLM Model (for answering queries):** ```powershell # Recommended: 7B model (4.7GB, good balance) ollama pull qwen2.5:7b-instruct # Alternative: Smaller for low-end CPUs (2.0GB) ollama pull qwen2.5:3b-instruct # Alternative: Larger for better accuracy (8.9GB) ollama pull qwen2.5:14b-instruct ``` **Embedding Model (for semantic search):** ```powershell # Recommended: Best for technical documents (669MB, 1024-dim) ollama pull mxbai-embed-large # Alternative: Lighter option (274MB, 768-dim) ollama pull nomic-embed-text ``` ### 3. Install Pandoc (for ODT files) Optional, only if you have OpenDocument files: ```powershell winget install --id JohnMacFarlane.Pandoc -e ``` ### 4. Setup Python Environment **Create virtual environment:** ```powershell python -m venv .venv .\.venv\Scripts\Activate.ps1 ``` **Note:** If you get an execution policy error: ```powershell Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser ``` **Install dependencies:** ```powershell pip install -r requirements.txt ``` --- ## ⚙️ Configuration ### 1. Create .env file ```powershell Copy-Item .env.example .env ``` ### 2. Configure Settings Edit `.env` with your preferred settings: **LLM Configuration:** ```env MODEL_PROVIDER=ollama # Options: ollama, mistral, openai OLLAMA_MODEL=qwen2.5:7b-instruct # The model for generating answers OLLAMA_BASE_URL=http://localhost:11434 ``` **Embedding Configuration:** ```env EMBEDDING_PROVIDER=ollama # Options: ollama, fastembed EMBEDDING_MODEL=mxbai-embed-large # Must match during ingestion & query! ``` **Retrieval Settings:** ```env RETRIEVAL_CHUNKS=50 # How many chunks to retrieve initially TOP_N_RERANK=10 # Final chunks sent to LLM after reranking USE_RERANKING=true # Enable for better accuracy ``` **Document Processing:** ```env CHUNK_SIZE=800 # Characters per chunk (smaller = more chunks) CHUNK_OVERLAP=160 # Overlap prevents context loss BATCH_SIZE=100 # Documents processed per batch ``` --- ## 🚀 Usage ### Step 1: Ingest Your Documents Place your documents in the `docs/` folder, then run: ```powershell python rag\ingest.py ``` **What happens:** - Extracts all ZIP files (including nested archives) - Loads documents (Word, PDF, PowerPoint, etc.) - Splits into chunks using configured size/overlap - Generates embeddings using your chosen model - Stores vectors in Chroma database (`storage/chroma/`) **⏱️ Time estimate:** - ~458 documents = ~5-10 minutes with mxbai-embed-large - Faster with smaller embedding models ### Step 2: Start the Frontend (Web Interface) ```powershell python frontend\app.py ``` The server starts at: **http://127.0.0.1:8000** Open in your browser and start asking questions! ### Command-Line Query (Alternative) Test queries without the web interface: ```powershell python rag\query.py "What is the latest performance of V-PCC for gaussian splat?" ``` --- ## 📊 Improving System Performance ### 🎯 Improve Answer Accuracy #### 1. **Upgrade Embedding Model** Better embeddings = better retrieval = better answers | Model | Size | Dimensions | Best For | Quality | |-------|------|------------|----------|---------| | `mxbai-embed-large` | 669MB | 1024 | Technical docs | ⭐⭐⭐⭐⭐ | | `bge-large` | 1.34GB | 1024 | Highest accuracy | ⭐⭐⭐⭐⭐ | | `nomic-embed-text` | 274MB | 768 | General purpose | ⭐⭐⭐⭐ | | `bge-small` (FastEmbed) | ~130MB | 384 | Speed over quality | ⭐⭐⭐ | **How to upgrade:** ```powershell # Pull new embedding model ollama pull bge-large # Update .env EMBEDDING_MODEL=bge-large # Re-ingest documents (required!) Remove-Item -Path "storage\chroma" -Recurse -Force python rag\ingest.py ``` **⚠️ Important:** You MUST re-ingest when changing embedding models! Query embeddings must match stored embeddings. #### 2. **Upgrade LLM Model** Larger models understand context better and generate more accurate answers. | Model | Size | RAM Needed | Speed | Quality | |-------|------|------------|-------|---------| | `qwen2.5:3b-instruct` | 2.0GB | 4GB | Fast | ⭐⭐⭐ | | `qwen2.5:7b-instruct` | 4.7GB | 8GB | Medium | ⭐⭐⭐⭐ | | `qwen2.5:14b-instruct` | 8.9GB | 16GB | Slow | ⭐⭐⭐⭐⭐ | | `qwen2.5:32b` | 19GB | 32GB+ | Very slow | ⭐⭐⭐⭐⭐ | **How to upgrade:** ```powershell # Pull new model ollama pull qwen2.5:14b-instruct # Update .env OLLAMA_MODEL=qwen2.5:14b-instruct # Restart frontend python frontend\app.py ``` **No re-ingestion needed** when changing LLM models. #### 3. **Tune Retrieval Settings** Balance between recall (finding relevant chunks) and precision (avoiding irrelevant chunks): ```env # More initial chunks = better recall RETRIEVAL_CHUNKS=100 # More reranked chunks = more context for LLM TOP_N_RERANK=15 ``` **⚠️ Warning:** Larger models handle more chunks better! - 7B models: max 10 chunks (get overwhelmed beyond this) - 14B models: 12-15 chunks optimal - 32B models: 25-30 chunks #### 4. **Enable Reranking** Reranking dramatically improves precision by re-scoring retrieved chunks: ```env USE_RERANKING=true ``` **Impact:** ~30-50% improvement in answer relevance for complex queries. ### ⚡ Improve Speed #### 1. **Use Smaller Embedding Model** Trade-off: Speed vs. accuracy ```powershell # Fast option EMBEDDING_MODEL=nomic-embed-text ``` #### 2. **Use Smaller LLM** ```powershell ollama pull qwen2.5:3b-instruct ``` #### 3. **Reduce Retrieval Chunks** ```env RETRIEVAL_CHUNKS=20 # Faster search TOP_N_RERANK=5 # Faster reranking ``` #### 4. **Increase Chunk Size** Fewer chunks = faster retrieval (but potentially lower accuracy): ```env CHUNK_SIZE=1200 # Larger chunks = fewer total chunks CHUNK_OVERLAP=200 ``` **⚠️ Note:** Requires re-ingestion! ### 💪 Improve Answer Depth & Usefulness #### 1. **Optimize Chunk Size for Your Documents** - **Technical docs with tables/code:** Smaller chunks (600-800) - **Long-form articles:** Medium chunks (1000-1200) - **Books/reports:** Larger chunks (1500-2000) #### 2. **Increase Context for LLM** More chunks = more comprehensive answers: ```env RETRIEVAL_CHUNKS=100 TOP_N_RERANK=15 # Only if using 14B+ model! ``` #### 3. **Use Larger Open-Source Models for Best Quality** For maximum quality while staying open-source and cost-free: **Option 1: Larger Qwen Models (Recommended)** ```powershell # Best balance: quality + speed (if you have 16GB+ RAM) ollama pull qwen2.5:14b-instruct # Maximum quality (requires 32GB+ RAM) ollama pull qwen2.5:32b-instruct ``` ```env OLLAMA_MODEL=qwen2.5:14b-instruct ``` **Option 2: Qwen 3 (Newest generation)** ```powershell # Latest Qwen 3 models (better reasoning) ollama pull qwen3:8b ollama pull qwen3:14b ``` **Option 3: DeepSeek-R1 (Strong reasoning)** ```powershell # Excellent for complex technical questions ollama pull deepseek-r1:7b ollama pull deepseek-r1:14b ``` **Quality Comparison (all open-source & free):** - `qwen2.5:32b` ≈ GPT-4 quality (19GB, very slow on CPU) - `qwen2.5:14b` ≈ GPT-3.5-Turbo quality (8.9GB, acceptable on CPU) - `deepseek-r1:14b` - Excellent reasoning (9GB) - `qwen3:14b` - Latest generation (8.9GB) **⚠️ No cloud APIs needed!** All models run locally for free. --- ## 🔧 Advanced Configuration ### Chunk Size Guidelines The CHUNK_SIZE parameter controls how documents are split. Finding the optimal size depends on your document type and questions: **When to use SMALLER chunks (600-800):** - Technical documents with tables and code - Q&A scenarios (specific fact retrieval) - Documents with dense, structured information **When to use LARGER chunks (1200-1500):** - Long-form content (articles, reports) - Narrative documents (books, essays) - When questions require broader context **⚠️ Context Length Limits:** - `mxbai-embed-large`: Max ~800 chars/chunk (strict limit) - `nomic-embed-text`: Max ~1500 chars/chunk - `bge-large`: Max ~1200 chars/chunk If ingestion fails with "context length exceeded", reduce CHUNK_SIZE. ### RAM Requirements **Ingestion:** - Minimum: 8GB RAM - Recommended: 16GB+ for large document sets - Embedding models: + model size (270MB - 1.3GB) **Query:** - 7B LLM: 8GB minimum - 14B LLM: 16GB minimum - 32B LLM: 32GB+ minimum - Reranking: ~12.5MB per chunk ### 🖥️ Hardware Impact on Performance #### CPU vs GPU Performance **CPU-Only Systems (current setup):** - **7B models:** 3-5 seconds/query (acceptable) - **14B models:** 8-15 seconds/query (slow but usable) - **32B models:** 30-60+ seconds/query (very slow) - **Ingestion:** 5-15 minutes for 458 documents **With GPU (NVIDIA recommended):** - **7B models:** 0.5-1 second/query (8-10x faster) - **14B models:** 1-2 seconds/query (6-8x faster) - **32B models:** 3-5 seconds/query (10-15x faster) - **Ingestion:** 1-3 minutes (5-10x faster) **GPU Requirements:** - 7B models: 6GB VRAM minimum (RTX 3060, RTX 4060) - 14B models: 10GB VRAM minimum (RTX 3080, RTX 4070) - 32B models: 24GB VRAM minimum (RTX 3090, RTX 4090) #### RAM Impact | RAM | Max Model | Max Chunks | Experience | |-----|-----------|------------|------------| | 8GB | 7B | 50 | Basic, slow | | 16GB | 14B | 100 | Good | | 32GB | 32B | 300+ | Excellent | | 64GB+ | 70B+ | 1000+ | Professional | **⚠️ Important:** More RAM ≠ faster queries, but allows: - Larger models (better quality) - More retrieval chunks (better recall) - Multiple processes without swapping #### Storage Impact **SSD vs HDD:** - **SSD (Recommended):** Vector store loads in 0.5-1 second - **HDD:** Vector store loads in 3-5 seconds - **NVMe SSD:** Vector store loads in 0.2-0.5 second **Model Storage:** - 3B model: ~2GB - 7B model: ~5GB - 14B model: ~9GB - 32B model: ~19GB - Embedding models: 270MB - 1.3GB - Vector database: ~2-3MB per 10,000 chunks #### CPU Impact on Ingestion **Embedding Generation (CPU-bound):** - **8 CPU cores:** ~8-12 minutes (458 docs) - **16 CPU cores:** ~5-8 minutes - **32 CPU cores:** ~3-5 minutes **Document Loading (I/O + CPU):** - Single-core: 1 document/second - Multi-core: 5-10 documents/second (parallel processing) #### Optimal Hardware Recommendations **Budget Setup ($0 upgrade cost - current):** - CPU: Any modern CPU (4+ cores) - RAM: 8-16GB - Storage: Any SSD - Model: `qwen2.5:7b-instruct` - Expected: 3-5s queries, adequate quality **Recommended Setup ($300-500):** - CPU: Intel i5/Ryzen 5+ (8+ cores) - RAM: 16GB - Storage: SSD (500GB+) - GPU: RTX 3060 (12GB VRAM) - Model: `qwen2.5:14b-instruct` - Expected: 1-2s queries, excellent quality **Professional Setup ($1500-2000):** - CPU: Intel i7/Ryzen 7+ (12+ cores) - RAM: 32GB - Storage: NVMe SSD (1TB+) - GPU: RTX 4080/4090 (16-24GB VRAM) - Model: `qwen2.5:32b` - Expected: <1s queries, GPT-4 quality **💡 Key Insight:** Even budget CPU-only setups work fine! GPU mainly improves speed, not quality. The open-source approach keeps costs at $0 regardless of hardware. --- ## 📁 Project Structure ``` LLMQwen/ ├── docs/ # Your documents go here ├── storage/ │ └── chroma/ # Vector database storage ├── rag/ │ ├── ingest.py # Document ingestion script │ └── query.py # Query execution script ├── frontend/ │ ├── app.py # Web server (backend) │ ├── index.html # Web interface │ ├── script.js # Frontend logic │ └── style.css # UI styling ├── .env # Your configuration ├── .env.example # Configuration template ├── requirements.txt # Python dependencies └── README.md # This file ``` --- ## 🐛 Troubleshooting ### "Collection expecting embedding with dimension of X, got Y" **Cause:** Embedding model mismatch between ingestion and query. **Solution:** ```powershell # Clear vector store Remove-Item -Path "storage\chroma" -Recurse -Force # Re-ingest with correct model python rag\ingest.py ``` ### "the input length exceeds the context length" **Cause:** Chunks too large for embedding model. **Solution:** Reduce CHUNK_SIZE in .env: ```env CHUNK_SIZE=600 CHUNK_OVERLAP=120 ``` Then re-ingest. ### Ollama Connection Error **Cause:** Ollama service not running. **Solution:** ```powershell ollama serve ``` ### Out of Memory During Query **Cause:** Too many chunks for available RAM. **Solution:** Reduce RETRIEVAL_CHUNKS: ```env RETRIEVAL_CHUNKS=20 TOP_N_RERANK=5 ``` --- ## 📈 Performance Benchmarks Based on 458 documents (~30,000 chunks): | Configuration | Ingestion Time | Query Time | Accuracy* | |---------------|----------------|------------|-----------| | nomic + 7B | 5 min | 3-5s | ⭐⭐⭐⭐ | | mxbai + 7B | 8 min | 3-5s | ⭐⭐⭐⭐⭐ | | mxbai + 14B | 8 min | 8-12s | ⭐⭐⭐⭐⭐ | | bge-large + 14B | 12 min | 8-12s | ⭐⭐⭐⭐⭐ | *Accuracy for technical documentation queries --- ## 🤝 Contributing Contributions welcome! Key areas: - Additional document loaders - Embedding model benchmarks - Prompt engineering improvements - UI enhancements --- ## 📄 License MIT License - see LICENSE file for details --- ## 🙏 Acknowledgments Built with: - [LangChain](https://github.com/langchain-ai/langchain) - RAG framework - [Chroma](https://github.com/chroma-core/chroma) - Vector database - [Ollama](https://ollama.ai/) - Local LLM runtime - [Flashrank](https://github.com/PrithivirajDamodaran/FlashRank) - Fast reranking - [Qwen](https://github.com/QwenLM/Qwen) - Language models