Newer
Older
Arthur Delarue
committed
# Qwen RAG (CPU-friendly) - Multi-Model Support
Arthur Delarue
committed
This is a flexible local/cloud RAG setup supporting multiple LLM providers:
- **Ollama** for local models (Qwen, Mistral, Llama) - FREE & private
- **Mistral AI** cloud API - High quality, European
- **OpenAI** cloud API - GPT-4o, GPT-4o-mini
- LangChain + Chroma for retrieval
- FastEmbed embeddings (CPU-friendly, no PyTorch required)
Arthur Delarue
committed
- Nested ZIP extraction for complex document archives
- Smart caching and parallel processing
Arthur Delarue
committed
Works on Windows without GPU. Switch between providers easily via `.env` configuration.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Prerequisites
- Python 3.9+
- Ollama installed and running (local server at `http://localhost:11434`)
### Install Ollama on Windows
If you're not sure Ollama is installed:
1) Install via Winget (requires admin approval on first use):
```powershell
winget install Ollama.Ollama -e
```
2) Start the Ollama daemon (it usually runs as a Windows service):
```powershell
ollama --version
ollama serve
```
Leave it running in a terminal, or rely on the service.
### Pull a small Qwen model for CPU
For better CPU performance, start with a smaller instruct model:
```powershell
ollama pull qwen2.5:3b-instruct
```
You can switch to larger models later (e.g., `qwen2.5:7b-instruct` or `qwen3:8b`) once you have a GPU.
### Install Pandoc (required for ODT files)
If you plan to use `.odt` files, install Pandoc:
```powershell
winget install --id JohnMacFarlane.Pandoc -e --accept-source-agreements --accept-package-agreements
```
## Setup Python environment
From the repo root (`qwen/` folder):
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
**Note:** If you get an execution policy error when activating the venv, run:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Arthur Delarue
committed
Copy `.env.example` to `.env` and configure your settings:
```powershell
Copy-Item .env.example .env
```
### Key Configuration Options:
- `MODEL_PROVIDER` – Choose: `ollama`, `mistral`, or `openai`
- `OLLAMA_MODEL` – default is `qwen2.5:7b-instruct`
Arthur Delarue
committed
- `MISTRAL_API_KEY` / `OPENAI_API_KEY` – For cloud providers
- `DOCS_DIR` – folder with your documents (default: `docs`)
- `CHROMA_DIR` – vector DB storage (default: `storage/chroma`)
- `RETRIEVAL_CHUNKS` – initial chunks to retrieve (default: 100)
- `TOP_N_RERANK` – final chunks sent to LLM (default: 8)
Arthur Delarue
committed
- `USE_RERANKING` – enable for better accuracy (default: true)
---
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
## 🔍 How the RAG Pipeline Works
Understanding the retrieval and reranking process:
### The 3-Step Process
```
📚 Your Database (3,000+ chunks)
↓
STEP 1: Semantic Search (RETRIEVAL_CHUNKS)
↓
Top 100 similar chunks
↓
STEP 2: Reranking (TOP_N_RERANK)
↓
Best 8 relevant chunks
↓
STEP 3: LLM generates answer
```
### Detailed Explanation
**Step 1: Semantic Search (`RETRIEVAL_CHUNKS`)**
- Searches **ALL documents** in your database
- Compares your question's embedding to every chunk's embedding
- Returns the most **similar** chunks
- Example: Top 100 most similar chunks from 3,000+ total
- ⚡ Fast - uses vector similarity
**Step 2: Reranking (`TOP_N_RERANK`)**
- Takes the chunks from Step 1
- Uses Flashrank model to re-score them more accurately
- Keeps only the **best** chunks
- Example: Best 8 out of 100
- ⚠️ **RAM Usage:** ~12.5 MB per chunk being reranked
- 🎯 More accurate than semantic search alone
**Step 3: LLM Processing**
- LLM receives only `TOP_N_RERANK` chunks
- Generates answer based on those chunks
- Model must handle context size without being overwhelmed
### Configuration Guidelines
**RETRIEVAL_CHUNKS** (Cast a wide net):
- Searches across all documents, returns top N most similar
- **Recommended:** 100-200 for good coverage
- **Max by RAM:**
- 8GB RAM → 300 chunks max
- 16GB RAM → 700 chunks max
- 32GB RAM → 1,500 chunks max
- ⚠️ Higher values = more RAM used in reranking step
**TOP_N_RERANK** (Final chunks to LLM):
- **qwen2.5:7b** → 6-8 chunks (optimal, model overwhelmed beyond this)
- **qwen2.5:14b** → 12-15 chunks
- **qwen2.5:32b** → 25-30 chunks
- ⚠️ **More chunks ≠ better answers** with smaller models!
### Why This Design?
**You cannot send all documents to the LLM:**
- 1,000+ documents = millions of tokens
- LLMs have context limits (8k-128k tokens)
- Would be extremely slow and expensive
**RAG solution:**
- Semantic search already "sees" all documents
- Retrieves most relevant subset
- Reranking filters to highest quality
- LLM gets focused, relevant context
---
## 🌍 Why No Multilingual Support?
This project **does not include automatic multilingual support** for a good reason:
### ⚠️ Small Models Perform Poorly in Non-English Languages
**Current Model (`qwen2.5:7b-instruct`):**
- Trained predominantly on English data
- **Significantly worse quality** in other languages
- Non-English responses are often less detailed, less accurate, and miss important nuances
- Translation overhead reduces reasoning capacity
**Why Small Models Struggle:**
- Most training data is English (70-90% of training corpus)
- 7B parameters aren't enough for strong multilingual capabilities
- Model spends cognitive capacity on translation instead of reasoning
### 🚀 If You Need Multilingual Support
**Option 1: Use larger Qwen models (14B+)**
```powershell
ollama pull qwen2.5:14b-instruct # Better multilingual
ollama pull qwen2.5:32b-instruct # Best multilingual
```
**Option 2: Use specialized multilingual models**
```powershell
ollama pull aya-23:8b # Optimized for 23 languages
ollama pull aya-23:35b # Best multilingual performance
```
**Recommendation:** For production use with multiple languages, upgrade to 14B+ models or use Aya. Otherwise, **ask questions in English for best results**.
---
Arthur Delarue
committed
## Multi-Model Provider Setup
Your RAG system supports multiple LLM providers. Choose based on your needs:
### 🚀 Quick Start
Edit your `.env` file and set `MODEL_PROVIDER`:
```env
MODEL_PROVIDER=ollama # Local (free, private)
MODEL_PROVIDER=mistral # Cloud API (paid)
MODEL_PROVIDER=openai # Cloud API (paid)
```
### Option 1: Ollama (Local - FREE) 🏠`.pdf`, `.odt` files and nested ZIP archives
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is lightweight for CPU
- Smart caching skips re-ingestion if documents haven't changed
- Parallel processing speeds up document loading
- Streaming responses provide immediate feedback
- Switch between Ollama/Mistral/OpenAI without code changes
**Setup:** Already configured! Just pull different models:
Arthur Delarue
committed
# Fast & free models
ollama pull qwen2.5:3b-instruct # Small, fast
ollama pull qwen2.5:7b-instruct # Balanced
ollama pull mistral:7b-instruct-v0.3 # Alternative
# Larger models (need good CPU/GPU)
ollama pull qwen2.5:14b-instruct
ollama pull mixtral:8x7b
Arthur Delarue
committed
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
**Configuration:**
```env
MODEL_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b-instruct
OLLAMA_BASE_URL=http://localhost:11434
```
**No API key needed!**
### Option 2: Mistral AI (Cloud API) ☁️
**Best for:** High quality, faster than local large models, European company
**Setup:**
1. Get API key: https://console.mistral.ai/
2. Install package: `pip install langchain-mistralai`
**Configuration:**
```env
MODEL_PROVIDER=mistral
MISTRAL_API_KEY=your_actual_api_key_here
MISTRAL_MODEL=mistral-large-latest
```
**Model Options:**
- `mistral-large-latest` - Most capable (expensive)
- `mistral-medium-latest` - Balanced
- `mistral-small-latest` - Fast & cheap
**Pricing:** ~$2-8 per 1M tokens
### Option 3: OpenAI (Cloud API) 🤖
**Best for:** Highest quality (GPT-4), well-tested, most features
**Setup:**
1. Get API key: https://platform.openai.com/api-keys
2. Install package: `pip install langchain-openai`
**Configuration:**
```env
MODEL_PROVIDER=openai
OPENAI_API_KEY=your_actual_api_key_here
OPENAI_MODEL=gpt-4o-mini
```
**Model Options:**
- `gpt-4o` - Most capable (expensive)
- `gpt-4o-mini` - Great balance (recommended)
- `gpt-3.5-turbo` - Fast & cheap
**Pricing:** ~$0.15-15 per 1M tokens
### Provider Comparison
| Feature | Ollama | Mistral AI | OpenAI |
|---------|--------|------------|--------|
| **Cost** | Free | ~$2-8/1M tokens | ~$0.15-15/1M tokens |
| **Privacy** | ✅ 100% local | ❌ Cloud | ❌ Cloud |
| **Speed (small)** | ~15s | ~3-5s | ~3-5s |
| **Speed (large)** | ~30-60s | ~5-10s | ~5-10s |
| **Quality (small)** | Good | Excellent | Excellent |
| **Quality (large)** | Very Good | Excellent | Outstanding |
| **Setup** | Easy | API key | API key |
| **Internet** | ❌ No | ✅ Yes | ✅ Yes |
### Recommendations
**For Development/Testing:** ✅ Ollama (free, private, no limits)
**For Production:**
- ✅ Mistral AI for good quality + reasonable cost
- ✅ OpenAI GPT-4o-mini for best balance
- ✅ OpenAI GPT-4o for highest quality
**For Maximum Privacy:** ✅ Ollama only (everything local)
### Switching Between Providers
No code changes needed! Just edit `.env`:
```powershell
# Try different providers
python .\rag\query.py "test question"
# Check active provider
Get-Content .env | Select-String "MODEL_PROVIDER"
```
---
Arthur Delarue
committed
Put `.md`, `.txt`, `.docx`, `.pptx`, `.pdf`, `.odt` files or **ZIP archives** (including nested ZIPs) in the `docs/` folder:
Arthur Delarue
committed
This will:
- Extract nested ZIP files automatically
- Load all supported document types
- Build a Chroma vector store under `storage/chroma`
- Cache results to skip re-ingestion if files unchanged
- Use parallel processing for faster PDF loading
**Supported formats:** PDF, Word (.docx), PowerPoint (.pptx), Markdown (.md), Text (.txt), ODT (.odt)
### Option 1: Command Line Interface (CLI)
```powershell
python .\rag\query.py "What does this project do?"
```
Arthur Delarue
committed
The script:
- Retrieves relevant chunks from your documents
- Uses streaming responses (answer appears immediately)
- Shows query completion time
- Cites sources from your documents
- Automatically detects language and responds accordingly
### Option 2: Web Interface (Graphical)
```powershell
python .\frontend\app.py
```
Then open your browser to: **http://localhost:8000**
The web interface provides:
- Clean, user-friendly chat interface
- Real-time streaming responses
- Source citations with document links
- Language auto-detection (ask in any language)
- Provider and model information displayed
- No terminal needed - just type and ask!
**To stop the server:** Press `Ctrl+C` in the terminal
## Upgrading to Qwen 8B later
When you have a GPU, pull and use a larger model:
```powershell
ollama pull qwen3:8b
# then set in .env
OLLAMA_MODEL=qwen3:8b
```
## Notes
- Supports `.md`, `.txt`, `.docx`, `.pptx`, and `.odt` files
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is light-weight for CPU