# Qwen RAG (CPU-friendly) - Multi-Model Support

This is a flexible local/cloud RAG setup supporting multiple LLM providers:
- **Ollama** for local models (Qwen, Mistral, Llama) - FREE & private
- **Mistral AI** cloud API - High quality, European
- **OpenAI** cloud API - GPT-4o, GPT-4o-mini
- LangChain + Chroma for retrieval
- FastEmbed embeddings (CPU-friendly, no PyTorch required)
- Nested ZIP extraction for complex document archives
- Smart caching and parallel processing

Works on Windows without GPU. Switch between providers easily via `.env` configuration.

## Prerequisites

- Python 3.9+
- Ollama installed and running (local server at `http://localhost:11434`)

### Install Ollama on Windows
If you're not sure Ollama is installed:
1) Install via Winget (requires admin approval on first use):
```powershell
winget install Ollama.Ollama -e
```
2) Start the Ollama daemon (it usually runs as a Windows service):
```powershell
ollama --version
ollama serve
```
Leave it running in a terminal, or rely on the service.

### Pull a small Qwen model for CPU
For better CPU performance, start with a smaller instruct model:
```powershell
ollama pull qwen2.5:3b-instruct
```
You can switch to larger models later (e.g., `qwen2.5:7b-instruct` or `qwen3:8b`) once you have a GPU.

### Install Pandoc (required for ODT files)
If you plan to use `.odt` files, install Pandoc:
```powershell
winget install --id JohnMacFarlane.Pandoc -e --accept-source-agreements --accept-package-agreements
```

## Setup Python environment
From the repo root (`qwen/` folder):
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```

**Note:** If you get an execution policy error when activating the venv, run:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```

## Configure
Copy `.env.example` to `.env` and configure your settings:

```powershell
Copy-Item .env.example .env
```

### Key Configuration Options:
- `MODEL_PROVIDER` – Choose: `ollama`, `mistral`, or `openai`
- `OLLAMA_MODEL` – default is `qwen2.5:7b-instruct`
- `MISTRAL_API_KEY` / `OPENAI_API_KEY` – For cloud providers
- `DOCS_DIR` – folder with your documents (default: `docs`)
- `CHROMA_DIR` – vector DB storage (default: `storage/chroma`)
- `RETRIEVAL_CHUNKS` – initial chunks to retrieve (default: 100)
- `TOP_N_RERANK` – final chunks sent to LLM (default: 8)
- `USE_RERANKING` – enable for better accuracy (default: true)

---

## 🔍 How the RAG Pipeline Works

Understanding the retrieval and reranking process:

### The 3-Step Process

```
📚 Your Database (3,000+ chunks)
         ↓
    STEP 1: Semantic Search (RETRIEVAL_CHUNKS)
         ↓
    Top 100 similar chunks
         ↓
    STEP 2: Reranking (TOP_N_RERANK)
         ↓
    Best 8 relevant chunks
         ↓
    STEP 3: LLM generates answer
```

### Detailed Explanation

**Step 1: Semantic Search (`RETRIEVAL_CHUNKS`)**
- Searches **ALL documents** in your database
- Compares your question's embedding to every chunk's embedding
- Returns the most **similar** chunks
- Example: Top 100 most similar chunks from 3,000+ total
- ⚡ Fast - uses vector similarity

**Step 2: Reranking (`TOP_N_RERANK`)**
- Takes the chunks from Step 1
- Uses Flashrank model to re-score them more accurately
- Keeps only the **best** chunks
- Example: Best 8 out of 100
- ⚠️ **RAM Usage:** ~12.5 MB per chunk being reranked
- 🎯 More accurate than semantic search alone

**Step 3: LLM Processing**
- LLM receives only `TOP_N_RERANK` chunks
- Generates answer based on those chunks
- Model must handle context size without being overwhelmed

### Configuration Guidelines

**RETRIEVAL_CHUNKS** (Cast a wide net):
- Searches across all documents, returns top N most similar
- **Recommended:** 100-200 for good coverage
- **Max by RAM:** 
  - 8GB RAM → 300 chunks max
  - 16GB RAM → 700 chunks max
  - 32GB RAM → 1,500 chunks max
- ⚠️ Higher values = more RAM used in reranking step

**TOP_N_RERANK** (Final chunks to LLM):
- **qwen2.5:7b** → 6-8 chunks (optimal, model overwhelmed beyond this)
- **qwen2.5:14b** → 12-15 chunks
- **qwen2.5:32b** → 25-30 chunks
- ⚠️ **More chunks ≠ better answers** with smaller models!

### Why This Design?

**You cannot send all documents to the LLM:**
- 1,000+ documents = millions of tokens
- LLMs have context limits (8k-128k tokens)
- Would be extremely slow and expensive

**RAG solution:**
- Semantic search already "sees" all documents
- Retrieves most relevant subset
- Reranking filters to highest quality
- LLM gets focused, relevant context

---

## 🌍 Why No Multilingual Support?

This project **does not include automatic multilingual support** for a good reason:

### ⚠️ Small Models Perform Poorly in Non-English Languages

**Current Model (`qwen2.5:7b-instruct`):**
- Trained predominantly on English data
- **Significantly worse quality** in other languages
- Non-English responses are often less detailed, less accurate, and miss important nuances
- Translation overhead reduces reasoning capacity

**Why Small Models Struggle:**
- Most training data is English (70-90% of training corpus)
- 7B parameters aren't enough for strong multilingual capabilities
- Model spends cognitive capacity on translation instead of reasoning

### 🚀 If You Need Multilingual Support

**Option 1: Use larger Qwen models (14B+)**
```powershell
ollama pull qwen2.5:14b-instruct  # Better multilingual
ollama pull qwen2.5:32b-instruct  # Best multilingual
```

**Option 2: Use specialized multilingual models**
```powershell
ollama pull aya-23:8b    # Optimized for 23 languages
ollama pull aya-23:35b   # Best multilingual performance
```

**Recommendation:** For production use with multiple languages, upgrade to 14B+ models or use Aya. Otherwise, **ask questions in English for best results**.

---

## Multi-Model Provider Setup

Your RAG system supports multiple LLM providers. Choose based on your needs:

### 🚀 Quick Start

Edit your `.env` file and set `MODEL_PROVIDER`:

```env
MODEL_PROVIDER=ollama    # Local (free, private)
MODEL_PROVIDER=mistral   # Cloud API (paid)
MODEL_PROVIDER=openai    # Cloud API (paid)
```

### Option 1: Ollama (Local - FREE) 🏠`.pdf`, `.odt` files and nested ZIP archives
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is lightweight for CPU
- Smart caching skips re-ingestion if documents haven't changed
- Parallel processing speeds up document loading
- Streaming responses provide immediate feedback
- Switch between Ollama/Mistral/OpenAI without code changes

**Setup:** Already configured! Just pull different models:

```powershell
# Fast & free models
ollama pull qwen2.5:3b-instruct     # Small, fast
ollama pull qwen2.5:7b-instruct     # Balanced
ollama pull mistral:7b-instruct-v0.3 # Alternative

# Larger models (need good CPU/GPU)
ollama pull qwen2.5:14b-instruct
ollama pull mixtral:8x7b
```

**Configuration:**
```env
MODEL_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b-instruct
OLLAMA_BASE_URL=http://localhost:11434
```

**No API key needed!**

### Option 2: Mistral AI (Cloud API) ☁️

**Best for:** High quality, faster than local large models, European company

**Setup:**
1. Get API key: https://console.mistral.ai/
2. Install package: `pip install langchain-mistralai`

**Configuration:**
```env
MODEL_PROVIDER=mistral
MISTRAL_API_KEY=your_actual_api_key_here
MISTRAL_MODEL=mistral-large-latest
```

**Model Options:**
- `mistral-large-latest` - Most capable (expensive)
- `mistral-medium-latest` - Balanced
- `mistral-small-latest` - Fast & cheap

**Pricing:** ~$2-8 per 1M tokens

### Option 3: OpenAI (Cloud API) 🤖

**Best for:** Highest quality (GPT-4), well-tested, most features

**Setup:**
1. Get API key: https://platform.openai.com/api-keys
2. Install package: `pip install langchain-openai`

**Configuration:**
```env
MODEL_PROVIDER=openai
OPENAI_API_KEY=your_actual_api_key_here
OPENAI_MODEL=gpt-4o-mini
```

**Model Options:**
- `gpt-4o` - Most capable (expensive)
- `gpt-4o-mini` - Great balance (recommended)
- `gpt-3.5-turbo` - Fast & cheap

**Pricing:** ~$0.15-15 per 1M tokens

### Provider Comparison

| Feature | Ollama | Mistral AI | OpenAI |
|---------|--------|------------|--------|
| **Cost** | Free | ~$2-8/1M tokens | ~$0.15-15/1M tokens |
| **Privacy** | ✅ 100% local | ❌ Cloud | ❌ Cloud |
| **Speed (small)** | ~15s | ~3-5s | ~3-5s |
| **Speed (large)** | ~30-60s | ~5-10s | ~5-10s |
| **Quality (small)** | Good | Excellent | Excellent |
| **Quality (large)** | Very Good | Excellent | Outstanding |
| **Setup** | Easy | API key | API key |
| **Internet** | ❌ No | ✅ Yes | ✅ Yes |

### Recommendations

**For Development/Testing:** ✅ Ollama (free, private, no limits)

**For Production:**
- ✅ Mistral AI for good quality + reasonable cost
- ✅ OpenAI GPT-4o-mini for best balance
- ✅ OpenAI GPT-4o for highest quality

**For Maximum Privacy:** ✅ Ollama only (everything local)

### Switching Between Providers

No code changes needed! Just edit `.env`:

```powershell
# Try different providers
python .\rag\query.py "test question"

# Check active provider
Get-Content .env | Select-String "MODEL_PROVIDER"
```

---

## Ingest documents
Put `.md`, `.txt`, `.docx`, `.pptx`, `.pdf`, `.odt` files or **ZIP archives** (including nested ZIPs) in the `docs/` folder:

```powershell
python .\rag\ingest.py
```

This will:
- Extract nested ZIP files automatically
- Load all supported document types
- Build a Chroma vector store under `storage/chroma`
- Cache results to skip re-ingestion if files unchanged
- Use parallel processing for faster PDF loading

**Supported formats:** PDF, Word (.docx), PowerPoint (.pptx), Markdown (.md), Text (.txt), ODT (.odt)

## Ask questions (RAG)

### Option 1: Command Line Interface (CLI)
```powershell
python .\rag\query.py "What does this project do?"
```

The script:
- Retrieves relevant chunks from your documents
- Uses streaming responses (answer appears immediately)
- Shows query completion time
- Cites sources from your documents
- Automatically detects language and responds accordingly

### Option 2: Web Interface (Graphical)
```powershell
python .\frontend\app.py
```

Then open your browser to: **http://localhost:8000**

The web interface provides:
- Clean, user-friendly chat interface
- Real-time streaming responses
- Source citations with document links
- Language auto-detection (ask in any language)
- Provider and model information displayed
- No terminal needed - just type and ask!

**To stop the server:** Press `Ctrl+C` in the terminal

## Upgrading to Qwen 8B later
When you have a GPU, pull and use a larger model:
```powershell
ollama pull qwen3:8b
# then set in .env
OLLAMA_MODEL=qwen3:8b
```

## Notes
- Supports `.md`, `.txt`, `.docx`, `.pptx`, and `.odt` files
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is light-weight for CPU