Newer
Older
Arthur Delarue
committed
# Qwen RAG (CPU-friendly) - Multi-Model Support
Arthur Delarue
committed
This is a flexible local/cloud RAG setup supporting multiple LLM providers:
- **Ollama** for local models (Qwen, Mistral, Llama) - FREE & private
- **Mistral AI** cloud API - High quality, European
- **OpenAI** cloud API - GPT-4o, GPT-4o-mini
- LangChain + Chroma for retrieval
- FastEmbed embeddings (CPU-friendly, no PyTorch required)
Arthur Delarue
committed
- Nested ZIP extraction for complex document archives
- Smart caching and parallel processing
Arthur Delarue
committed
Works on Windows without GPU. Switch between providers easily via `.env` configuration.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Prerequisites
- Python 3.9+
- Ollama installed and running (local server at `http://localhost:11434`)
### Install Ollama on Windows
If you're not sure Ollama is installed:
1) Install via Winget (requires admin approval on first use):
```powershell
winget install Ollama.Ollama -e
```
2) Start the Ollama daemon (it usually runs as a Windows service):
```powershell
ollama --version
ollama serve
```
Leave it running in a terminal, or rely on the service.
### Pull a small Qwen model for CPU
For better CPU performance, start with a smaller instruct model:
```powershell
ollama pull qwen2.5:3b-instruct
```
You can switch to larger models later (e.g., `qwen2.5:7b-instruct` or `qwen3:8b`) once you have a GPU.
### Install Pandoc (required for ODT files)
If you plan to use `.odt` files, install Pandoc:
```powershell
winget install --id JohnMacFarlane.Pandoc -e --accept-source-agreements --accept-package-agreements
```
## Setup Python environment
From the repo root (`qwen/` folder):
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
**Note:** If you get an execution policy error when activating the venv, run:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Arthur Delarue
committed
Copy `.env.example` to `.env` and configure your settings:
```powershell
Copy-Item .env.example .env
```
### Key Configuration Options:
- `MODEL_PROVIDER` – Choose: `ollama`, `mistral`, or `openai`
Arthur Delarue
committed
- `MISTRAL_API_KEY` / `OPENAI_API_KEY` – For cloud providers
- `DOCS_DIR` – folder with your documents (default: `docs`)
- `CHROMA_DIR` – vector DB storage (default: `storage/chroma`)
Arthur Delarue
committed
- `RETRIEVAL_CHUNKS` – how many chunks to retrieve (default: 12)
- `USE_RERANKING` – enable for better accuracy (default: true)
---
## Multi-Model Provider Setup
Your RAG system supports multiple LLM providers. Choose based on your needs:
### 🚀 Quick Start
Edit your `.env` file and set `MODEL_PROVIDER`:
```env
MODEL_PROVIDER=ollama # Local (free, private)
MODEL_PROVIDER=mistral # Cloud API (paid)
MODEL_PROVIDER=openai # Cloud API (paid)
```
### Option 1: Ollama (Local - FREE) 🏠`.pdf`, `.odt` files and nested ZIP archives
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is lightweight for CPU
- Smart caching skips re-ingestion if documents haven't changed
- Parallel processing speeds up document loading
- Streaming responses provide immediate feedback
- Switch between Ollama/Mistral/OpenAI without code changes
**Setup:** Already configured! Just pull different models:
Arthur Delarue
committed
# Fast & free models
ollama pull qwen2.5:3b-instruct # Small, fast
ollama pull qwen2.5:7b-instruct # Balanced
ollama pull mistral:7b-instruct-v0.3 # Alternative
# Larger models (need good CPU/GPU)
ollama pull qwen2.5:14b-instruct
ollama pull mixtral:8x7b
Arthur Delarue
committed
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
**Configuration:**
```env
MODEL_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b-instruct
OLLAMA_BASE_URL=http://localhost:11434
```
**No API key needed!**
### Option 2: Mistral AI (Cloud API) ☁️
**Best for:** High quality, faster than local large models, European company
**Setup:**
1. Get API key: https://console.mistral.ai/
2. Install package: `pip install langchain-mistralai`
**Configuration:**
```env
MODEL_PROVIDER=mistral
MISTRAL_API_KEY=your_actual_api_key_here
MISTRAL_MODEL=mistral-large-latest
```
**Model Options:**
- `mistral-large-latest` - Most capable (expensive)
- `mistral-medium-latest` - Balanced
- `mistral-small-latest` - Fast & cheap
**Pricing:** ~$2-8 per 1M tokens
### Option 3: OpenAI (Cloud API) 🤖
**Best for:** Highest quality (GPT-4), well-tested, most features
**Setup:**
1. Get API key: https://platform.openai.com/api-keys
2. Install package: `pip install langchain-openai`
**Configuration:**
```env
MODEL_PROVIDER=openai
OPENAI_API_KEY=your_actual_api_key_here
OPENAI_MODEL=gpt-4o-mini
```
**Model Options:**
- `gpt-4o` - Most capable (expensive)
- `gpt-4o-mini` - Great balance (recommended)
- `gpt-3.5-turbo` - Fast & cheap
**Pricing:** ~$0.15-15 per 1M tokens
### Provider Comparison
| Feature | Ollama | Mistral AI | OpenAI |
|---------|--------|------------|--------|
| **Cost** | Free | ~$2-8/1M tokens | ~$0.15-15/1M tokens |
| **Privacy** | ✅ 100% local | ❌ Cloud | ❌ Cloud |
| **Speed (small)** | ~15s | ~3-5s | ~3-5s |
| **Speed (large)** | ~30-60s | ~5-10s | ~5-10s |
| **Quality (small)** | Good | Excellent | Excellent |
| **Quality (large)** | Very Good | Excellent | Outstanding |
| **Setup** | Easy | API key | API key |
| **Internet** | ❌ No | ✅ Yes | ✅ Yes |
### Recommendations
**For Development/Testing:** ✅ Ollama (free, private, no limits)
**For Production:**
- ✅ Mistral AI for good quality + reasonable cost
- ✅ OpenAI GPT-4o-mini for best balance
- ✅ OpenAI GPT-4o for highest quality
**For Maximum Privacy:** ✅ Ollama only (everything local)
### Switching Between Providers
No code changes needed! Just edit `.env`:
```powershell
# Try different providers
python .\rag\query.py "test question"
# Check active provider
Get-Content .env | Select-String "MODEL_PROVIDER"
```
---
Arthur Delarue
committed
Put `.md`, `.txt`, `.docx`, `.pptx`, `.pdf`, `.odt` files or **ZIP archives** (including nested ZIPs) in the `docs/` folder:
Arthur Delarue
committed
This will:
- Extract nested ZIP files automatically
- Load all supported document types
- Build a Chroma vector store under `storage/chroma`
- Cache results to skip re-ingestion if files unchanged
- Use parallel processing for faster PDF loading
**Supported formats:** PDF, Word (.docx), PowerPoint (.pptx), Markdown (.md), Text (.txt), ODT (.odt)
## Ask questions (RAG)
```powershell
python .\rag\query.py "What does this project do?"
```
Arthur Delarue
committed
The script:
- Retrieves relevant chunks from your documents
- Uses streaming responses (answer appears immediately)
- Shows query completion time
- Cites sources from your documents
## Upgrading to Qwen 8B later
When you have a GPU, pull and use a larger model:
```powershell
ollama pull qwen3:8b
# then set in .env
OLLAMA_MODEL=qwen3:8b
```
## Notes
- Supports `.md`, `.txt`, `.docx`, `.pptx`, and `.odt` files
- For `.odt` files, Pandoc must be installed (see Prerequisites above)
- FastEmbed uses ONNX under the hood and is light-weight for CPU