Newer
Older
Arthur Delarue
committed
# RAG System with Multi-Model Support
Arthur Delarue
committed
A flexible Retrieval-Augmented Generation (RAG) system that lets you query your documents using local or cloud-based LLMs. Features intelligent document processing, semantic search with reranking, and a simple web interface.
Arthur Delarue
committed
## 🎯 What This System Does
Arthur Delarue
committed
1. **Ingests** your documents (Word, PDF, PowerPoint, Text, Markdown) into a vector database
2. **Retrieves** relevant chunks using semantic search when you ask a question
3. **Reranks** results to find the most relevant information
4. **Generates** comprehensive answers using your choice of LLM
Arthur Delarue
committed
## 🏗️ Architecture
Arthur Delarue
committed
```
Documents (docs/)
↓
Document Processing & Chunking
↓
Embedding Model → Vector Database (Chroma)
↓
Query → Semantic Search (50 chunks)
↓
Reranking (Top 10 chunks)
↓
LLM (Qwen/Mistral/OpenAI) → Answer
```
## ✨ Features
- **Multiple LLM Providers**: Ollama (local), Mistral AI, or OpenAI
- **Flexible Embeddings**: Ollama models (mxbai, nomic, bge) or FastEmbed
- **Smart Retrieval**: Semantic search + Flashrank reranking for precision
- **Document Support**: Word, PDF, PowerPoint, Text, Markdown, ODT
- **ZIP Extraction**: Automatically extracts nested ZIP archives
- **Web Interface**: Simple HTTP server with query interface
- **CPU-Friendly**: Optimized for systems without GPU
---
## 📦 Installation
### 1. Install Ollama (for local models)
**Windows (via Winget):**
Arthur Delarue
committed
Verify installation:
Arthur Delarue
committed
```
Ollama runs as a Windows service automatically. If not running:
```powershell
Arthur Delarue
committed
### 2. Pull Required Models
**LLM Model (for answering queries):**
Arthur Delarue
committed
# Recommended: 7B model (4.7GB, good balance)
ollama pull qwen2.5:7b-instruct
# Alternative: Smaller for low-end CPUs (2.0GB)
Arthur Delarue
committed
# Alternative: Larger for better accuracy (8.9GB)
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
**Embedding Model (for semantic search):**
Arthur Delarue
committed
# Recommended: Best for technical documents (669MB, 1024-dim)
ollama pull mxbai-embed-large
# Alternative: Lighter option (274MB, 768-dim)
ollama pull nomic-embed-text
Arthur Delarue
committed
### 3. Install Pandoc (for ODT files)
Optional, only if you have OpenDocument files:
```powershell
winget install --id JohnMacFarlane.Pandoc -e
```
### 4. Setup Python Environment
**Create virtual environment:**
python -m venv .venv
Arthur Delarue
committed
**Note:** If you get an execution policy error:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Arthur Delarue
committed
**Install dependencies:**
```powershell
pip install -r requirements.txt
```
---
## ⚙️ Configuration
### 1. Create .env file
Arthur Delarue
committed
```powershell
Copy-Item .env.example .env
```
Arthur Delarue
committed
### 2. Configure Settings
Arthur Delarue
committed
Arthur Delarue
committed
Edit `.env` with your preferred settings:
Arthur Delarue
committed
Arthur Delarue
committed
**LLM Configuration:**
```env
MODEL_PROVIDER=ollama # Options: ollama, mistral, openai
OLLAMA_MODEL=qwen2.5:7b-instruct # The model for generating answers
OLLAMA_BASE_URL=http://localhost:11434
```
Arthur Delarue
committed
**Embedding Configuration:**
```env
EMBEDDING_PROVIDER=ollama # Options: ollama, fastembed
EMBEDDING_MODEL=mxbai-embed-large # Must match during ingestion & query!
```
Arthur Delarue
committed
**Retrieval Settings:**
```env
RETRIEVAL_CHUNKS=50 # How many chunks to retrieve initially
TOP_N_RERANK=10 # Final chunks sent to LLM after reranking
USE_RERANKING=true # Enable for better accuracy
```
Arthur Delarue
committed
**Document Processing:**
```env
CHUNK_SIZE=800 # Characters per chunk (smaller = more chunks)
CHUNK_OVERLAP=160 # Overlap prevents context loss
BATCH_SIZE=100 # Documents processed per batch
```
Arthur Delarue
committed
---
Arthur Delarue
committed
## 🚀 Usage
Arthur Delarue
committed
### Step 1: Ingest Your Documents
Arthur Delarue
committed
Place your documents in the `docs/` folder, then run:
Arthur Delarue
committed
python rag\ingest.py
Arthur Delarue
committed
**What happens:**
- Extracts all ZIP files (including nested archives)
- Loads documents (Word, PDF, PowerPoint, etc.)
- Splits into chunks using configured size/overlap
- Generates embeddings using your chosen model
- Stores vectors in Chroma database (`storage/chroma/`)
**⏱️ Time estimate:**
- ~458 documents = ~5-10 minutes with mxbai-embed-large
- Faster with smaller embedding models
### Step 2: Start the Frontend (Web Interface)
Arthur Delarue
committed
python frontend\app.py
Arthur Delarue
committed
The server starts at: **http://127.0.0.1:8000**
Open in your browser and start asking questions!
### Command-Line Query (Alternative)
Test queries without the web interface:
```powershell
python rag\query.py "What is the latest performance of V-PCC for gaussian splat?"
```
Arthur Delarue
committed
## 📊 Improving System Performance
Arthur Delarue
committed
Arthur Delarue
committed
### 🎯 Improve Answer Accuracy
Arthur Delarue
committed
Arthur Delarue
committed
#### 1. **Upgrade Embedding Model**
Arthur Delarue
committed
Arthur Delarue
committed
Better embeddings = better retrieval = better answers
Arthur Delarue
committed
Arthur Delarue
committed
| Model | Size | Dimensions | Best For | Quality |
|-------|------|------------|----------|---------|
| `mxbai-embed-large` | 669MB | 1024 | Technical docs | ⭐⭐⭐⭐⭐ |
| `bge-large` | 1.34GB | 1024 | Highest accuracy | ⭐⭐⭐⭐⭐ |
| `nomic-embed-text` | 274MB | 768 | General purpose | ⭐⭐⭐⭐ |
| `bge-small` (FastEmbed) | ~130MB | 384 | Speed over quality | ⭐⭐⭐ |
**How to upgrade:**
```powershell
# Pull new embedding model
ollama pull bge-large
# Update .env
EMBEDDING_MODEL=bge-large
# Re-ingest documents (required!)
Remove-Item -Path "storage\chroma" -Recurse -Force
python rag\ingest.py
Arthur Delarue
committed
```
Arthur Delarue
committed
**⚠️ Important:** You MUST re-ingest when changing embedding models! Query embeddings must match stored embeddings.
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Upgrade LLM Model**
Arthur Delarue
committed
Larger models understand context better and generate more accurate answers.
| Model | Size | RAM Needed | Speed | Quality |
|-------|------|------------|-------|---------|
| `qwen2.5:3b-instruct` | 2.0GB | 4GB | Fast | ⭐⭐⭐ |
| `qwen2.5:7b-instruct` | 4.7GB | 8GB | Medium | ⭐⭐⭐⭐ |
| `qwen2.5:14b-instruct` | 8.9GB | 16GB | Slow | ⭐⭐⭐⭐⭐ |
| `qwen2.5:32b` | 19GB | 32GB+ | Very slow | ⭐⭐⭐⭐⭐ |
Arthur Delarue
committed
Arthur Delarue
committed
**How to upgrade:**
```powershell
# Pull new model
Arthur Delarue
committed
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
# Update .env
OLLAMA_MODEL=qwen2.5:14b-instruct
# Restart frontend
python frontend\app.py
Arthur Delarue
committed
**No re-ingestion needed** when changing LLM models.
#### 3. **Tune Retrieval Settings**
Balance between recall (finding relevant chunks) and precision (avoiding irrelevant chunks):
Arthur Delarue
committed
```env
Arthur Delarue
committed
# More initial chunks = better recall
RETRIEVAL_CHUNKS=100
Arthur Delarue
committed
Arthur Delarue
committed
# More reranked chunks = more context for LLM
TOP_N_RERANK=15
```
Arthur Delarue
committed
Arthur Delarue
committed
**⚠️ Warning:** Larger models handle more chunks better!
- 7B models: max 10 chunks (get overwhelmed beyond this)
- 14B models: 12-15 chunks optimal
- 32B models: 25-30 chunks
Arthur Delarue
committed
Arthur Delarue
committed
#### 4. **Enable Reranking**
Arthur Delarue
committed
Arthur Delarue
committed
Reranking dramatically improves precision by re-scoring retrieved chunks:
Arthur Delarue
committed
```env
Arthur Delarue
committed
USE_RERANKING=true
Arthur Delarue
committed
```
Arthur Delarue
committed
**Impact:** ~30-50% improvement in answer relevance for complex queries.
### ⚡ Improve Speed
#### 1. **Use Smaller Embedding Model**
Arthur Delarue
committed
Arthur Delarue
committed
Trade-off: Speed vs. accuracy
Arthur Delarue
committed
Arthur Delarue
committed
```powershell
# Fast option
EMBEDDING_MODEL=nomic-embed-text
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Use Smaller LLM**
```powershell
ollama pull qwen2.5:3b-instruct
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 3. **Reduce Retrieval Chunks**
Arthur Delarue
committed
```env
Arthur Delarue
committed
RETRIEVAL_CHUNKS=20 # Faster search
TOP_N_RERANK=5 # Faster reranking
Arthur Delarue
committed
```
Arthur Delarue
committed
#### 4. **Increase Chunk Size**
Arthur Delarue
committed
Arthur Delarue
committed
Fewer chunks = faster retrieval (but potentially lower accuracy):
Arthur Delarue
committed
Arthur Delarue
committed
```env
CHUNK_SIZE=1200 # Larger chunks = fewer total chunks
CHUNK_OVERLAP=200
```
**⚠️ Note:** Requires re-ingestion!
### 💪 Improve Answer Depth & Usefulness
Arthur Delarue
committed
Arthur Delarue
committed
#### 1. **Optimize Chunk Size for Your Documents**
Arthur Delarue
committed
Arthur Delarue
committed
- **Technical docs with tables/code:** Smaller chunks (600-800)
- **Long-form articles:** Medium chunks (1000-1200)
- **Books/reports:** Larger chunks (1500-2000)
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Increase Context for LLM**
Arthur Delarue
committed
Arthur Delarue
committed
More chunks = more comprehensive answers:
Arthur Delarue
committed
Arthur Delarue
committed
```env
RETRIEVAL_CHUNKS=100
TOP_N_RERANK=15 # Only if using 14B+ model!
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 3. **Use Larger Open-Source Models for Best Quality**
Arthur Delarue
committed
Arthur Delarue
committed
For maximum quality while staying open-source and cost-free:
Arthur Delarue
committed
Arthur Delarue
committed
**Option 1: Larger Qwen Models (Recommended)**
Arthur Delarue
committed
```powershell
Arthur Delarue
committed
# Best balance: quality + speed (if you have 16GB+ RAM)
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
Arthur Delarue
committed
# Maximum quality (requires 32GB+ RAM)
ollama pull qwen2.5:32b-instruct
Arthur Delarue
committed
```
Arthur Delarue
committed
```env
OLLAMA_MODEL=qwen2.5:14b-instruct
```
Arthur Delarue
committed
Arthur Delarue
committed
**Option 2: Qwen 3 (Newest generation)**
```powershell
# Latest Qwen 3 models (better reasoning)
ollama pull qwen3:8b
ollama pull qwen3:14b
```
Arthur Delarue
committed
Arthur Delarue
committed
**Option 3: DeepSeek-R1 (Strong reasoning)**
Arthur Delarue
committed
# Excellent for complex technical questions
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:14b
Arthur Delarue
committed
Arthur Delarue
committed
**Quality Comparison (all open-source & free):**
- `qwen2.5:32b` ≈ GPT-4 quality (19GB, very slow on CPU)
- `qwen2.5:14b` ≈ GPT-3.5-Turbo quality (8.9GB, acceptable on CPU)
- `deepseek-r1:14b` - Excellent reasoning (9GB)
- `qwen3:14b` - Latest generation (8.9GB)
**⚠️ No cloud APIs needed!** All models run locally for free.
Arthur Delarue
committed
Arthur Delarue
committed
---
Arthur Delarue
committed
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
## 🔧 Advanced Configuration
### Chunk Size Guidelines
The CHUNK_SIZE parameter controls how documents are split. Finding the optimal size depends on your document type and questions:
**When to use SMALLER chunks (600-800):**
- Technical documents with tables and code
- Q&A scenarios (specific fact retrieval)
- Documents with dense, structured information
**When to use LARGER chunks (1200-1500):**
- Long-form content (articles, reports)
- Narrative documents (books, essays)
- When questions require broader context
**⚠️ Context Length Limits:**
- `mxbai-embed-large`: Max ~800 chars/chunk (strict limit)
- `nomic-embed-text`: Max ~1500 chars/chunk
- `bge-large`: Max ~1200 chars/chunk
If ingestion fails with "context length exceeded", reduce CHUNK_SIZE.
### RAM Requirements
**Ingestion:**
- Minimum: 8GB RAM
- Recommended: 16GB+ for large document sets
- Embedding models: + model size (270MB - 1.3GB)
**Query:**
- 7B LLM: 8GB minimum
- 14B LLM: 16GB minimum
- 32B LLM: 32GB+ minimum
- Reranking: ~12.5MB per chunk
### 🖥️ Hardware Impact on Performance
#### CPU vs GPU Performance
**CPU-Only Systems (current setup):**
- **7B models:** 3-5 seconds/query (acceptable)
- **14B models:** 8-15 seconds/query (slow but usable)
- **32B models:** 30-60+ seconds/query (very slow)
- **Ingestion:** 5-15 minutes for 458 documents
**With GPU (NVIDIA recommended):**
- **7B models:** 0.5-1 second/query (8-10x faster)
- **14B models:** 1-2 seconds/query (6-8x faster)
- **32B models:** 3-5 seconds/query (10-15x faster)
- **Ingestion:** 1-3 minutes (5-10x faster)
**GPU Requirements:**
- 7B models: 6GB VRAM minimum (RTX 3060, RTX 4060)
- 14B models: 10GB VRAM minimum (RTX 3080, RTX 4070)
- 32B models: 24GB VRAM minimum (RTX 3090, RTX 4090)
#### RAM Impact
| RAM | Max Model | Max Chunks | Experience |
|-----|-----------|------------|------------|
| 8GB | 7B | 50 | Basic, slow |
| 16GB | 14B | 100 | Good |
| 32GB | 32B | 300+ | Excellent |
| 64GB+ | 70B+ | 1000+ | Professional |
**⚠️ Important:** More RAM ≠ faster queries, but allows:
- Larger models (better quality)
- More retrieval chunks (better recall)
- Multiple processes without swapping
#### Storage Impact
**SSD vs HDD:**
- **SSD (Recommended):** Vector store loads in 0.5-1 second
- **HDD:** Vector store loads in 3-5 seconds
- **NVMe SSD:** Vector store loads in 0.2-0.5 second
**Model Storage:**
- 3B model: ~2GB
- 7B model: ~5GB
- 14B model: ~9GB
- 32B model: ~19GB
- Embedding models: 270MB - 1.3GB
- Vector database: ~2-3MB per 10,000 chunks
#### CPU Impact on Ingestion
**Embedding Generation (CPU-bound):**
- **8 CPU cores:** ~8-12 minutes (458 docs)
- **16 CPU cores:** ~5-8 minutes
- **32 CPU cores:** ~3-5 minutes
**Document Loading (I/O + CPU):**
- Single-core: 1 document/second
- Multi-core: 5-10 documents/second (parallel processing)
#### Optimal Hardware Recommendations
**Budget Setup ($0 upgrade cost - current):**
- CPU: Any modern CPU (4+ cores)
- RAM: 8-16GB
- Storage: Any SSD
- Model: `qwen2.5:7b-instruct`
- Expected: 3-5s queries, adequate quality
**Recommended Setup ($300-500):**
- CPU: Intel i5/Ryzen 5+ (8+ cores)
- RAM: 16GB
- Storage: SSD (500GB+)
- GPU: RTX 3060 (12GB VRAM)
- Model: `qwen2.5:14b-instruct`
- Expected: 1-2s queries, excellent quality
**Professional Setup ($1500-2000):**
- CPU: Intel i7/Ryzen 7+ (12+ cores)
- RAM: 32GB
- Storage: NVMe SSD (1TB+)
- GPU: RTX 4080/4090 (16-24GB VRAM)
- Model: `qwen2.5:32b`
- Expected: <1s queries, GPT-4 quality
**💡 Key Insight:** Even budget CPU-only setups work fine! GPU mainly improves speed, not quality. The open-source approach keeps costs at $0 regardless of hardware.
---
## 📁 Project Structure
Arthur Delarue
committed
LLMQwen/
├── docs/ # Your documents go here
├── storage/
│ └── chroma/ # Vector database storage
├── rag/
│ ├── ingest.py # Document ingestion script
│ └── query.py # Query execution script
├── frontend/
│ ├── app.py # Web server (backend)
│ ├── index.html # Web interface
│ ├── script.js # Frontend logic
│ └── style.css # UI styling
├── .env # Your configuration
├── .env.example # Configuration template
├── requirements.txt # Python dependencies
└── README.md # This file
```
---
## 🐛 Troubleshooting
### "Collection expecting embedding with dimension of X, got Y"
Arthur Delarue
committed
Arthur Delarue
committed
**Cause:** Embedding model mismatch between ingestion and query.
Arthur Delarue
committed
**Solution:**
Arthur Delarue
committed
# Clear vector store
Remove-Item -Path "storage\chroma" -Recurse -Force
# Re-ingest with correct model
python rag\ingest.py
```
### "the input length exceeds the context length"
**Cause:** Chunks too large for embedding model.
**Solution:** Reduce CHUNK_SIZE in .env:
```env
CHUNK_SIZE=600
CHUNK_OVERLAP=120
Arthur Delarue
committed
Then re-ingest.
Arthur Delarue
committed
### Ollama Connection Error
Arthur Delarue
committed
**Cause:** Ollama service not running.
Arthur Delarue
committed
**Solution:**
Arthur Delarue
committed
ollama serve
```
### Out of Memory During Query
**Cause:** Too many chunks for available RAM.
**Solution:** Reduce RETRIEVAL_CHUNKS:
```env
RETRIEVAL_CHUNKS=20
TOP_N_RERANK=5
Arthur Delarue
committed
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
---
## 📈 Performance Benchmarks
Based on 458 documents (~30,000 chunks):
| Configuration | Ingestion Time | Query Time | Accuracy* |
|---------------|----------------|------------|-----------|
| nomic + 7B | 5 min | 3-5s | ⭐⭐⭐⭐ |
| mxbai + 7B | 8 min | 3-5s | ⭐⭐⭐⭐⭐ |
| mxbai + 14B | 8 min | 8-12s | ⭐⭐⭐⭐⭐ |
| bge-large + 14B | 12 min | 8-12s | ⭐⭐⭐⭐⭐ |
*Accuracy for technical documentation queries
---
## 🤝 Contributing
Contributions welcome! Key areas:
- Additional document loaders
- Embedding model benchmarks
- Prompt engineering improvements
- UI enhancements
---
## 📄 License
MIT License - see LICENSE file for details
---
## 🙏 Acknowledgments
Built with:
- [LangChain](https://github.com/langchain-ai/langchain) - RAG framework
- [Chroma](https://github.com/chroma-core/chroma) - Vector database
- [Ollama](https://ollama.ai/) - Local LLM runtime
- [Flashrank](https://github.com/PrithivirajDamodaran/FlashRank) - Fast reranking
- [Qwen](https://github.com/QwenLM/Qwen) - Language models