# RAG System with Qwen A Retrieval-Augmented Generation (RAG) system that lets you query your documents using Qwen models from HuggingFace Transformers locally. --- ## Table of Contents - [Installation and Setup](#installation-and-setup) - [Linux Prerequisites](#linux-prerequisites) - [Step 1: Setup Python Environment](#step-1-setup-python-environment) - [Step 2: Install Pandoc (Optional)](#step-2-install-pandoc-optional) - [Step 3: Configure Environment](#step-3-configure-environment) - [Step 4: Add Your Documents](#step-4-add-your-documents) - [Step 5: Ingest Documents](#step-5-ingest-documents) - [Step 6: Start the Frontend](#step-6-start-the-frontend) - [Command-Line Query (Optional)](#command-line-query-optional) - [Interactive Mode (Recommended)](#interactive-mode-recommended) - [Single Query Mode](#single-query-mode) - [Performance Notes](#performance-notes) - [CPU vs GPU Mode](#cpu-vs-gpu-mode) - [Model Recommendations by Hardware](#model-recommendations-by-hardware) - [Model Upgrade Guide](#model-upgrade-guide) - [Current Setup (Fast/Testing)](#current-setup-fasttesting) - [Upgrading to Production Quality](#upgrading-to-production-quality) - [🏆 Recommended Production Configurations](#-recommended-production-configurations) - [⚡ Performance Impact Summary](#-performance-impact-summary) - [🌍 Multilingual Functionality Guide](#-multilingual-functionality-guide) - [How Each Component Affects Multilingual Support](#how-each-component-affects-multilingual-support) - [Current System Multilingual Capability](#current-system-multilingual-capability) - [Upgrading to Full Multilingual Support](#upgrading-to-full-multilingual-support) - [Testing Multilingual Functionality](#testing-multilingual-functionality) - [Chunking Configuration Guide](#chunking-configuration-guide) - [What is Chunking?](#what-is-chunking) - [Current Default Settings](#current-default-settings) - [How Chunk Size Affects Quality](#how-chunk-size-affects-quality) - [Why Overlap Matters](#why-overlap-matters) - [Recommended Settings by Document Type](#recommended-settings-by-document-type) - [How to Adjust Chunking](#how-to-adjust-chunking) - [Chunk Size Impact on Your System](#chunk-size-impact-on-your-system) --- ## Installation and Setup ### Linux Prerequisites **For Ubuntu/Debian-based distributions:** ```bash # Update package list sudo apt update # Install Python 3.10+ and pip sudo apt install python3 python3-pip python3-venv # Install development tools (required for some Python packages) sudo apt install build-essential python3-dev ``` **For Fedora/RHEL/CentOS:** ```bash # Install Python 3.10+ and pip sudo dnf install python3 python3-pip # Install development tools sudo dnf groupinstall "Development Tools" sudo dnf install python3-devel ``` **For Arch Linux:** ```bash # Install Python and pip sudo pacman -S python python-pip # Install base development tools sudo pacman -S base-devel ``` **Verify Python installation:** ```bash python3 --version # Should be 3.10 or higher pip3 --version ``` ### Step 1: Setup Python Environment **Windows:** ```powershell python -m venv .venv .\.venv\Scripts\Activate.ps1 ``` **Note:** If you get an execution policy error: ```powershell Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser ``` **macOS/Linux:** ```bash python3 -m venv .venv source .venv/bin/activate ``` **Install dependencies (all platforms):** ```bash pip install -r requirements.txt ``` **Note:** The first time you run a query, the Qwen model (~28GB for 14B model) will download automatically to `~/.cache/huggingface/`. This may take some time depending on your internet connection. ### Step 2: Install Pandoc (Optional) Only needed if you have OpenDocument (.odt) files: **Windows:** ```powershell winget install --id JohnMacFarlane.Pandoc -e ``` **macOS:** ```bash brew install pandoc ``` **Linux:** ```bash # Ubuntu/Debian sudo apt update sudo apt install pandoc # Fedora/RHEL/CentOS sudo dnf install pandoc # Arch Linux sudo pacman -S pandoc ``` ### Step 3: Configure Environment **Windows:** ```powershell Copy-Item .env.example .env ``` **macOS/Linux:** ```bash cp .env.example .env ``` Edit `.env` with your settings: ```env # Model Configuration LLM_PROVIDER=transformers TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct MAX_NEW_TOKENS=4096 TEMPERATURE=0 LLM_SEED=42 QUANTIZATION=4bit # Device Configuration (auto, cuda, or cpu) LLM_DEVICE=auto EMBEDDING_DEVICE=cuda RERANKER_DEVICE=cuda # Embedding Configuration EMBEDDING_PROVIDER=huggingface EMBEDDING_MODEL=intfloat/multilingual-e5-large-instruct # Retrieval Settings RETRIEVAL_CHUNKS=100 TOP_N_RERANK=8 USE_RERANKING=true # Document Processing CHUNK_SIZE=800 CHUNK_OVERLAP=100 ``` **Note:** If using a lower-spec computer, change to `TRANSFORMERS_MODEL=Qwen/Qwen2.5-7B-Instruct` for faster performance. If you don't have a GPU, set all device settings to `cpu`. ### Step 4: Add Your Documents Place your documents (Word, PDF, PowerPoint, Text, Markdown, etc.) in the `docs/` folder. ### Step 5: Ingest Documents Run the ingestion script to process your documents: **Windows:** ```powershell python rag\ingest.py ``` **macOS/Linux:** ```bash python rag/ingest.py ``` This will: - Extract ZIP files automatically - Load and process all documents - Generate embeddings - Store vectors in the database ### Step 6: Start the Frontend Start the web interface: **Windows:** ```powershell python frontend\app.py ``` **macOS/Linux:** ```bash python frontend/app.py ``` The server will start at: **http://127.0.0.1:8000** Open this URL in your browser to start querying your documents! --- ## Command-Line Query (Optional) ### Interactive Mode (Recommended) For multiple queries without reloading the model each time: **macOS/Linux:** ```bash python rag/query_interactive.py ``` **Windows:** ```powershell python rag\query_interactive.py ``` This loads the model **once** and keeps it in memory. You can then ask multiple questions without the 15-second checkpoint loading delay. **Example session:** ``` Query: What is V-PCC? [Answer streams in real-time...] Query: How does it compare to G-PCC? [Answer streams immediately - no reload!] Query: quit ``` ### Single Query Mode For one-off queries from the command line: **macOS/Linux:** ```bash python rag/query.py "Your question here" ``` **Windows:** ```powershell python rag\query.py "Your question here" ``` **Note:** This reloads the model each time (~15s startup) --- ## Performance Notes ### CPU vs GPU Mode The system can run on either CPU or GPU for optimal performance. You can configure which device each component uses in your `.env` file: ```env # Device configuration # Options: auto (auto-detect GPU), cuda (force GPU), cpu (force CPU) LLM_DEVICE=auto # Qwen language model EMBEDDING_DEVICE=cuda # Document/query embeddings RERANKER_DEVICE=cuda # Re-ranking model ``` **Device Options:** - `auto` - Automatically detects and uses GPU if available (recommended for LLM) - `cuda` - Forces GPU usage (fastest, requires NVIDIA GPU with CUDA) - `cpu` - Forces CPU usage (slower but works on any computer) **Performance Comparison (14B model):** - **GPU Mode (cuda):** Fast responses (1-2 seconds) - **CPU-Only Mode (cpu):** Slower responses (8-15 seconds) but works on any computer - **Auto Mode (auto):** Best of both worlds - uses GPU if available, falls back to CPU **Recommended Configurations:** *For systems with NVIDIA GPU:* ```env LLM_DEVICE=auto # Use GPU if available EMBEDDING_DEVICE=cuda # Embeddings are 10-50x faster on GPU RERANKER_DEVICE=cuda # Re-ranking is faster on GPU ``` *For CPU-only systems (no GPU):* ```env LLM_DEVICE=cpu EMBEDDING_DEVICE=cpu RERANKER_DEVICE=cpu ``` *For systems with limited GPU memory:* ```env LLM_DEVICE=cpu # Save GPU memory EMBEDDING_DEVICE=cuda # Embeddings use less memory RERANKER_DEVICE=cpu # Only when needed ``` **Note:** After changing device settings, restart the application for changes to take effect. Re-ingestion is not required unless you change `EMBEDDING_DEVICE` after already ingesting documents. ### Model Recommendations by Hardware | RAM Available | Recommended Model | CPU Query Time | Quality | |---------------|-------------------|----------------|----------| | 8GB | qwen2.5:7b-instruct | 3-5 seconds | Good | | 16GB+ | qwen2.5:14b-instruct | 8-15 seconds | Excellent | | 32GB+ | qwen2.5:32b | 30-60 seconds | Best | **Note:** These times are for CPU-only mode. GPU mode is 6-10x faster. --- ## Model Upgrade Guide ### Current Setup (Fast/Testing) Your system is currently configured for **speed and testing**: - **Embedding:** `sentence-transformers/all-MiniLM-L6-v2` (384-dim, very fast) - **Reranker:** `BAAI/bge-reranker-base` (good quality) - **LLM:** `qwen2.5:14b-instruct` (excellent balance) ### Upgrading to Production Quality #### 🚀 **Embedding Model Upgrades** **Current:** `sentence-transformers/all-MiniLM-L6-v2` (384-dim) - Speed: ⚡⚡⚡⚡⚡ Very Fast (5x faster than BGE-large) - Quality: ⭐⭐⭐ Good - Use case: Testing, prototyping, fast iterations **Option 1 - Balanced:** `BAAI/bge-base-en-v1.5` (768-dim) - Speed: ⚡⚡⚡⚡ Fast (2x faster than BGE-large) - Quality: ⭐⭐⭐⭐ Very Good - Use case: Production with good performance/quality balance - **Recommended for most users** **Option 2 - Best Quality:** `BAAI/bge-large-en-v1.5` (1024-dim) - Speed: ⚡⚡⚡ Moderate - Quality: ⭐⭐⭐⭐⭐ Excellent - Use case: Production where quality is critical - Trade-off: Slower ingestion (but queries remain fast) **Option 3 - Multilingual:** `BAAI/bge-m3` (1024-dim) - Speed: ⚡⚡⚡ Moderate - Quality: ⭐⭐⭐⭐⭐ Excellent - Use case: Multi-language documents (100+ languages) - Supports: Chinese, French, Spanish, German, etc. **To upgrade embedding model:** ```env # In .env file, change: EMBEDDING_MODEL=BAAI/bge-base-en-v1.5 # or bge-large-en-v1.5 ``` Then re-run: `python rag/ingest.py` #### 🎯 **Reranker Model Upgrades** **Current:** `BAAI/bge-reranker-base` (278M params) - Speed: ⚡⚡⚡⚡ Fast - Quality: ⭐⭐⭐⭐ Very Good - Already excellent for most use cases **Option 1 - Higher Quality:** `BAAI/bge-reranker-large` (560M params) - Speed: ⚡⚡⚡ Moderate - Quality: ⭐⭐⭐⭐⭐ Excellent - Use case: When answer quality is critical - Trade-off: 2x slower reranking (still fast overall) **Option 2 - Best Available:** `BAAI/bge-reranker-v2-m3` (568M params) - Speed: ⚡⚡⚡ Moderate - Quality: ⭐⭐⭐⭐⭐ State-of-the-art - Use case: Maximum accuracy, multilingual support - Supports: 100+ languages **To upgrade reranker:** ```env # In .env file, change: RERANKER_MODEL=BAAI/bge-reranker-large # or bge-reranker-v2-m3 ``` No re-ingestion needed, changes apply immediately! #### 🤖 **LLM Model Upgrades** **Current:** `qwen2.5:14b-instruct` (14B params, 8GB VRAM/16GB RAM) - Speed: ⚡⚡⚡⚡ Fast - Quality: ⭐⭐⭐⭐ Excellent - Already very good for most tasks **Option 1 - More Capable:** `qwen2.5:32b-instruct` (32B params, 20GB VRAM/32GB RAM) - Speed: ⚡⚡⚡ Moderate (2x slower) - Quality: ⭐⭐⭐⭐⭐ Outstanding - Use case: Complex reasoning, technical documents - Requirements: 32GB+ RAM recommended **Option 2 - Maximum Quality:** `Qwen/Qwen2.5-72B-Instruct` (72B params, 48GB VRAM/64GB RAM) - Speed: ⚡⚡ Slow (5x slower) - Quality: ⭐⭐⭐⭐⭐ Best available - Use case: Research, critical analysis, highest accuracy - Requirements: 64GB+ RAM, powerful hardware **Option 3 - Faster Lightweight:** `Qwen/Qwen2.5-7B-Instruct` (7B params, 4GB VRAM/8GB RAM) - Speed: ⚡⚡⚡⚡⚡ Very Fast (2x faster) - Quality: ⭐⭐⭐ Good - Use case: Low-end hardware, quick responses **To upgrade LLM:** ```env # Update .env TRANSFORMERS_MODEL=Qwen/Qwen2.5-32B-Instruct ``` The new model will download automatically on first use. No re-ingestion needed! ### 🏆 **Recommended Production Configurations** #### **Configuration 1: Balanced (Recommended)** ```env EMBEDDING_MODEL=BAAI/bge-base-en-v1.5 RERANKER_MODEL=BAAI/bge-reranker-base TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct ``` - **Speed:** Fast - **Quality:** Very Good - **Hardware:** 16GB RAM minimum - **Best for:** Most production use cases #### **Configuration 2: Maximum Quality** ```env EMBEDDING_MODEL=BAAI/bge-large-en-v1.5 RERANKER_MODEL=BAAI/bge-reranker-v2-m3 TRANSFORMERS_MODEL=Qwen/Qwen2.5-32B-Instruct ``` - **Speed:** Moderate - **Quality:** Excellent - **Hardware:** 32GB RAM minimum - **Best for:** Critical applications, research #### **Configuration 3: Fast & Efficient (Current)** ```env EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 RERANKER_MODEL=BAAI/bge-reranker-base TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct ``` - **Speed:** Very Fast - **Quality:** Good - **Hardware:** 16GB RAM minimum - **Best for:** Testing, development, rapid iteration #### **Configuration 4: Multilingual** ```env EMBEDDING_MODEL=BAAI/bge-m3 RERANKER_MODEL=BAAI/bge-reranker-v2-m3 TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct ``` - **Speed:** Moderate - **Quality:** Excellent - **Hardware:** 16GB RAM minimum - **Best for:** Multi-language document collections ### ⚡ **Performance Impact Summary** | Component | Affects | Re-ingestion Required? | |-----------|---------|------------------------| | Embedding Model | Ingestion speed, retrieval quality | ✅ Yes | | Reranker Model | Query speed (minimal), answer quality | ❌ No | | LLM Model | Response generation speed/quality | ❌ No | **Note:** Upgrading embedding model requires re-running `python rag/ingest.py` to rebuild the vector database with new embeddings. --- ## 🌍 Multilingual Functionality Guide The chatbot **automatically responds in the language you use** to ask questions (English, French, Spanish, etc.). However, **each model component affects multilingual quality differently**: ### How Each Component Affects Multilingual Support #### **1. Embedding Model - CRITICAL for Multilingual Retrieval** 🔴 **Impact:** Determines if your question in ANY language can find relevant documents **Current Model:** `sentence-transformers/all-MiniLM-L6-v2` - ⚠️ **English-only optimized** - Non-English queries will retrieve less relevant documents - Works for English, poor for French/Spanish/other languages **Recommended for Multilingual:** ```env EMBEDDING_MODEL=BAAI/bge-m3 # or EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 ``` **Why it matters:** - French question → English-focused embeddings → retrieves wrong documents → LLM gets irrelevant context → poor answer **even if LLM speaks French** - Multilingual embeddings → retrieves correct documents in any language → LLM gets relevant context → excellent answer **⚠️ Requires re-ingestion:** YES - `python rag/ingest.py` --- #### **2. Reranker Model - Important for Multilingual Precision** 🟡 **Impact:** Refines which documents are most relevant to your question **Current Model:** `BAAI/bge-reranker-base` - ⚠️ **English-focused** - Can rerank, but less accurate for non-English queries **Recommended for Multilingual:** ```env RERANKER_MODEL=BAAI/bge-reranker-v2-m3 ``` **Why it matters:** - Even if embeddings retrieve 10 good multilingual documents, English-only reranker might rank them poorly - Multilingual reranker correctly identifies the most relevant chunks in any language **⚠️ Requires re-ingestion:** NO - just update `.env` and restart --- #### **3. LLM (Text Generation Model) - Determines Answer Language** 🟢 **Impact:** Generates the actual response in the target language **Current Model:** `qwen2.5:14b-instruct` - ✅ **Excellent multilingual support** (100+ languages) - Strong in: English, Chinese, French, Spanish, German, Japanese, Korean, Arabic, and more - The prompt automatically instructs it to respond in the question's language **Alternative Multilingual LLMs:** ```env # In .env file TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct # Excellent for 100+ languages TRANSFORMERS_MODEL=Qwen/Qwen2.5-32B-Instruct # Best multilingual quality # Other alternatives: # TRANSFORMERS_MODEL=meta-llama/Llama-3.1-8B-Instruct # Good for European languages ``` **Why it matters:** - Even with perfect retrieval, if LLM doesn't support the language, answers will be poor or in wrong language - Qwen models are already excellent for multilingual - upgrading mainly improves reasoning depth --- ### Current System Multilingual Capability | Component | Current Model | Multilingual? | Impact on Non-English | |-----------|---------------|---------------|------------------------| | **Embedding** | all-MiniLM-L6-v2 | ❌ English-only | 🔴 **Poor retrieval** for non-English questions | | **Reranker** | bge-reranker-base | ⚠️ English-focused | 🟡 **Suboptimal ranking** for non-English | | **LLM** | Qwen2.5-14B-Instruct | ✅ Excellent | ✅ **Perfect responses** in any language | **Result:** The LLM **CAN respond** in French/Spanish/etc., but will work with **lower-quality context** retrieved by English-only embeddings. --- ### Upgrading to Full Multilingual Support **Recommended Configuration:** ```env # In .env file EMBEDDING_MODEL=BAAI/bge-m3 RERANKER_MODEL=BAAI/bge-reranker-v2-m3 TRANSFORMERS_MODEL=Qwen/Qwen2.5-14B-Instruct ``` **Steps:** 1. Update `.env` with multilingual models 2. Re-ingest documents: `python rag/ingest.py` (required for embedding change) 3. Restart frontend/queries **Benefits:** - ✅ Excellent retrieval for questions in **any language** - ✅ Accurate reranking regardless of language - ✅ High-quality answers in **100+ languages** **Trade-offs:** - Slightly slower (BGE-m3 is ~2x slower than all-MiniLM-L6-v2) - Larger model downloads (~3GB vs 90MB) --- ### Testing Multilingual Functionality ```powershell # English python rag/query.py "What are the latest V-PCC compression results?" # French python rag/query.py "Quels sont les derniers résultats de compression V-PCC ?" # Spanish python rag/query.py "¿Cuáles son los últimos resultados de compresión V-PCC?" ``` **Expected behavior:** - ✅ LLM responds in the correct language (works with current setup) - ⚠️ Answer quality may be lower for non-English with current English-only embeddings - ✅ Full quality in all languages after upgrading to multilingual embeddings --- ## Chunking Configuration Guide ### What is Chunking? Chunking splits large documents into smaller pieces for better retrieval and processing. **Chunk settings significantly impact answer quality!** ### Current Default Settings: ```env CHUNK_SIZE=800 # ~150-200 words, 2-3 paragraphs CHUNK_OVERLAP=100 # 12.5% overlap between chunks ``` ### How Chunk Size Affects Quality: | Chunk Size | Best For | Pros | Cons | |------------|----------|------|------| | **300-600** | FAQs, snippets, Q&A | Precise retrieval, fast | May fragment ideas | | **800-1000** | General technical docs | Balanced context/precision | Good all-around | | **1200-1500** | Dense specs, standards | Complete explanations | Slower retrieval | | **1500-2000** | Research papers, articles | Preserves narrative | May dilute relevance | ### Why Overlap Matters: **Without overlap (0):** ``` Chunk 1: "...the solution requires three steps. First," Chunk 2: "Second, process the data. Third, validate..." ``` ❌ Retrieving Chunk 2 misses "First" step! **With overlap (10-20%):** ``` Chunk 1: "...the solution requires three steps. First, initialize." Chunk 2: "...three steps. First, initialize. Second, process..." ``` ✅ Important information appears in multiple chunks! ### Recommended Settings by Document Type: #### **Dense Technical Specifications** (MPEG, ISO, IEEE standards) ```env CHUNK_SIZE=1200 CHUNK_OVERLAP=200 ``` - **Why:** Technical specs need complete multi-paragraph explanations - **Example:** Algorithm descriptions, performance tables, conformance requirements - **Impact:** Better context for complex technical questions #### **Short FAQs / Knowledge Base** ```env CHUNK_SIZE=500 CHUNK_OVERLAP=75 ``` - **Why:** Quick, focused answers without excess context - **Example:** Troubleshooting guides, quick reference docs - **Impact:** Faster, more precise retrieval #### **Long-Form Articles / Research Papers** ```env CHUNK_SIZE=1500 CHUNK_OVERLAP=300 ``` - **Why:** Preserves argument flow and narrative structure - **Example:** White papers, academic articles, detailed reports - **Impact:** Maintains logical connections between ideas #### **Mixed Document Collection** (Recommended) ```env CHUNK_SIZE=1000 CHUNK_OVERLAP=150 ``` - **Why:** Good balance for varied content types - **Example:** Mix of specs, guides, and reports - **Impact:** Versatile performance across document types ### How to Adjust Chunking: 1. **Edit `.env` file:** ```env CHUNK_SIZE=1200 CHUNK_OVERLAP=200 ``` 2. **Re-ingest your documents:** ```powershell python rag/ingest.py ``` 3. **Test with same questions** to compare quality ### Chunk Size Impact on Your System: | Setting | Total Chunks | Retrieval Speed | Context Quality | |---------|--------------|-----------------|------------------| | 500/75 | ~45,000 | Fastest | Fragmented | | 800/100 | ~29,000 | Fast | Good | | 1000/150 | ~23,000 | Medium | Better | | 1500/300 | ~15,000 | Slower | Most Complete | **Rule of Thumb:** Overlap should be 10-20% of chunk size for optimal results. ---