# Qwen RAG (CPU-friendly) - Multi-Model Support This is a flexible local/cloud RAG setup supporting multiple LLM providers: - **Ollama** for local models (Qwen, Mistral, Llama) - FREE & private - **Mistral AI** cloud API - High quality, European - **OpenAI** cloud API - GPT-4o, GPT-4o-mini - LangChain + Chroma for retrieval - FastEmbed embeddings (CPU-friendly, no PyTorch required) - Nested ZIP extraction for complex document archives - Smart caching and parallel processing Works on Windows without GPU. Switch between providers easily via `.env` configuration. ## Prerequisites - Python 3.9+ - Ollama installed and running (local server at `http://localhost:11434`) ### Install Ollama on Windows If you're not sure Ollama is installed: 1) Install via Winget (requires admin approval on first use): ```powershell winget install Ollama.Ollama -e ``` 2) Start the Ollama daemon (it usually runs as a Windows service): ```powershell ollama --version ollama serve ``` Leave it running in a terminal, or rely on the service. ### Pull a small Qwen model for CPU For better CPU performance, start with a smaller instruct model: ```powershell ollama pull qwen2.5:3b-instruct ``` You can switch to larger models later (e.g., `qwen2.5:7b-instruct` or `qwen3:8b`) once you have a GPU. ### Install Pandoc (required for ODT files) If you plan to use `.odt` files, install Pandoc: ```powershell winget install --id JohnMacFarlane.Pandoc -e --accept-source-agreements --accept-package-agreements ``` ## Setup Python environment From the repo root (`qwen/` folder): ```powershell python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -r requirements.txt ``` **Note:** If you get an execution policy error when activating the venv, run: ```powershell Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser ``` ## Configure Copy `.env.example` to `.env` and configure your settings: ```powershell Copy-Item .env.example .env ``` ### Key Configuration Options: - `MODEL_PROVIDER` – Choose: `ollama`, `mistral`, or `openai` - `OLLAMA_MODEL` – default is `qwen2.5:7b-instruct` - `MISTRAL_API_KEY` / `OPENAI_API_KEY` – For cloud providers - `DOCS_DIR` – folder with your documents (default: `docs`) - `CHROMA_DIR` – vector DB storage (default: `storage/chroma`) - `RETRIEVAL_CHUNKS` – initial chunks to retrieve (default: 100) - `TOP_N_RERANK` – final chunks sent to LLM (default: 8) - `USE_RERANKING` – enable for better accuracy (default: true) --- ## 🔍 How the RAG Pipeline Works Understanding the retrieval and reranking process: ### The 3-Step Process ``` 📚 Your Database (3,000+ chunks) ↓ STEP 1: Semantic Search (RETRIEVAL_CHUNKS) ↓ Top 100 similar chunks ↓ STEP 2: Reranking (TOP_N_RERANK) ↓ Best 8 relevant chunks ↓ STEP 3: LLM generates answer ``` ### Detailed Explanation **Step 1: Semantic Search (`RETRIEVAL_CHUNKS`)** - Searches **ALL documents** in your database - Compares your question's embedding to every chunk's embedding - Returns the most **similar** chunks - Example: Top 100 most similar chunks from 3,000+ total - ⚡ Fast - uses vector similarity **Step 2: Reranking (`TOP_N_RERANK`)** - Takes the chunks from Step 1 - Uses Flashrank model to re-score them more accurately - Keeps only the **best** chunks - Example: Best 8 out of 100 - ⚠️ **RAM Usage:** ~12.5 MB per chunk being reranked - 🎯 More accurate than semantic search alone **Step 3: LLM Processing** - LLM receives only `TOP_N_RERANK` chunks - Generates answer based on those chunks - Model must handle context size without being overwhelmed ### Configuration Guidelines **RETRIEVAL_CHUNKS** (Cast a wide net): - Searches across all documents, returns top N most similar - **Recommended:** 100-200 for good coverage - **Max by RAM:** - 8GB RAM → 300 chunks max - 16GB RAM → 700 chunks max - 32GB RAM → 1,500 chunks max - ⚠️ Higher values = more RAM used in reranking step **TOP_N_RERANK** (Final chunks to LLM): - **qwen2.5:7b** → 6-8 chunks (optimal, model overwhelmed beyond this) - **qwen2.5:14b** → 12-15 chunks - **qwen2.5:32b** → 25-30 chunks - ⚠️ **More chunks ≠ better answers** with smaller models! ### Why This Design? **You cannot send all documents to the LLM:** - 1,000+ documents = millions of tokens - LLMs have context limits (8k-128k tokens) - Would be extremely slow and expensive **RAG solution:** - Semantic search already "sees" all documents - Retrieves most relevant subset - Reranking filters to highest quality - LLM gets focused, relevant context --- ## 🌍 Why No Multilingual Support? This project **does not include automatic multilingual support** for a good reason: ### ⚠️ Small Models Perform Poorly in Non-English Languages **Current Model (`qwen2.5:7b-instruct`):** - Trained predominantly on English data - **Significantly worse quality** in other languages - Non-English responses are often less detailed, less accurate, and miss important nuances - Translation overhead reduces reasoning capacity **Why Small Models Struggle:** - Most training data is English (70-90% of training corpus) - 7B parameters aren't enough for strong multilingual capabilities - Model spends cognitive capacity on translation instead of reasoning ### 🚀 If You Need Multilingual Support **Option 1: Use larger Qwen models (14B+)** ```powershell ollama pull qwen2.5:14b-instruct # Better multilingual ollama pull qwen2.5:32b-instruct # Best multilingual ``` **Option 2: Use specialized multilingual models** ```powershell ollama pull aya-23:8b # Optimized for 23 languages ollama pull aya-23:35b # Best multilingual performance ``` **Recommendation:** For production use with multiple languages, upgrade to 14B+ models or use Aya. Otherwise, **ask questions in English for best results**. --- ## Multi-Model Provider Setup Your RAG system supports multiple LLM providers. Choose based on your needs: ### 🚀 Quick Start Edit your `.env` file and set `MODEL_PROVIDER`: ```env MODEL_PROVIDER=ollama # Local (free, private) MODEL_PROVIDER=mistral # Cloud API (paid) MODEL_PROVIDER=openai # Cloud API (paid) ``` ### Option 1: Ollama (Local - FREE) 🏠`.pdf`, `.odt` files and nested ZIP archives - For `.odt` files, Pandoc must be installed (see Prerequisites above) - FastEmbed uses ONNX under the hood and is lightweight for CPU - Smart caching skips re-ingestion if documents haven't changed - Parallel processing speeds up document loading - Streaming responses provide immediate feedback - Switch between Ollama/Mistral/OpenAI without code changes **Setup:** Already configured! Just pull different models: ```powershell # Fast & free models ollama pull qwen2.5:3b-instruct # Small, fast ollama pull qwen2.5:7b-instruct # Balanced ollama pull mistral:7b-instruct-v0.3 # Alternative # Larger models (need good CPU/GPU) ollama pull qwen2.5:14b-instruct ollama pull mixtral:8x7b ``` **Configuration:** ```env MODEL_PROVIDER=ollama OLLAMA_MODEL=qwen2.5:7b-instruct OLLAMA_BASE_URL=http://localhost:11434 ``` **No API key needed!** ### Option 2: Mistral AI (Cloud API) ☁️ **Best for:** High quality, faster than local large models, European company **Setup:** 1. Get API key: https://console.mistral.ai/ 2. Install package: `pip install langchain-mistralai` **Configuration:** ```env MODEL_PROVIDER=mistral MISTRAL_API_KEY=your_actual_api_key_here MISTRAL_MODEL=mistral-large-latest ``` **Model Options:** - `mistral-large-latest` - Most capable (expensive) - `mistral-medium-latest` - Balanced - `mistral-small-latest` - Fast & cheap **Pricing:** ~$2-8 per 1M tokens ### Option 3: OpenAI (Cloud API) 🤖 **Best for:** Highest quality (GPT-4), well-tested, most features **Setup:** 1. Get API key: https://platform.openai.com/api-keys 2. Install package: `pip install langchain-openai` **Configuration:** ```env MODEL_PROVIDER=openai OPENAI_API_KEY=your_actual_api_key_here OPENAI_MODEL=gpt-4o-mini ``` **Model Options:** - `gpt-4o` - Most capable (expensive) - `gpt-4o-mini` - Great balance (recommended) - `gpt-3.5-turbo` - Fast & cheap **Pricing:** ~$0.15-15 per 1M tokens ### Provider Comparison | Feature | Ollama | Mistral AI | OpenAI | |---------|--------|------------|--------| | **Cost** | Free | ~$2-8/1M tokens | ~$0.15-15/1M tokens | | **Privacy** | ✅ 100% local | ❌ Cloud | ❌ Cloud | | **Speed (small)** | ~15s | ~3-5s | ~3-5s | | **Speed (large)** | ~30-60s | ~5-10s | ~5-10s | | **Quality (small)** | Good | Excellent | Excellent | | **Quality (large)** | Very Good | Excellent | Outstanding | | **Setup** | Easy | API key | API key | | **Internet** | ❌ No | ✅ Yes | ✅ Yes | ### Recommendations **For Development/Testing:** ✅ Ollama (free, private, no limits) **For Production:** - ✅ Mistral AI for good quality + reasonable cost - ✅ OpenAI GPT-4o-mini for best balance - ✅ OpenAI GPT-4o for highest quality **For Maximum Privacy:** ✅ Ollama only (everything local) ### Switching Between Providers No code changes needed! Just edit `.env`: ```powershell # Try different providers python .\rag\query.py "test question" # Check active provider Get-Content .env | Select-String "MODEL_PROVIDER" ``` --- ## Ingest documents Put `.md`, `.txt`, `.docx`, `.pptx`, `.pdf`, `.odt` files or **ZIP archives** (including nested ZIPs) in the `docs/` folder: ```powershell python .\rag\ingest.py ``` This will: - Extract nested ZIP files automatically - Load all supported document types - Build a Chroma vector store under `storage/chroma` - Cache results to skip re-ingestion if files unchanged - Use parallel processing for faster PDF loading **Supported formats:** PDF, Word (.docx), PowerPoint (.pptx), Markdown (.md), Text (.txt), ODT (.odt) ## Ask questions (RAG) ### Option 1: Command Line Interface (CLI) ```powershell python .\rag\query.py "What does this project do?" ``` The script: - Retrieves relevant chunks from your documents - Uses streaming responses (answer appears immediately) - Shows query completion time - Cites sources from your documents - Automatically detects language and responds accordingly ### Option 2: Web Interface (Graphical) ```powershell python .\frontend\app.py ``` Then open your browser to: **http://localhost:8000** The web interface provides: - Clean, user-friendly chat interface - Real-time streaming responses - Source citations with document links - Language auto-detection (ask in any language) - Provider and model information displayed - No terminal needed - just type and ask! **To stop the server:** Press `Ctrl+C` in the terminal ## Upgrading to Qwen 8B later When you have a GPU, pull and use a larger model: ```powershell ollama pull qwen3:8b # then set in .env OLLAMA_MODEL=qwen3:8b ``` ## Notes - Supports `.md`, `.txt`, `.docx`, `.pptx`, and `.odt` files - For `.odt` files, Pandoc must be installed (see Prerequisites above) - FastEmbed uses ONNX under the hood and is light-weight for CPU