Newer
Older
Arthur Delarue
committed
# RAG System with Multi-Model Support
Arthur Delarue
committed
A flexible Retrieval-Augmented Generation (RAG) system that lets you query your documents using local or cloud-based LLMs. Features intelligent document processing, semantic search with reranking, and a simple web interface.
Arthur Delarue
committed
## 🎯 What This System Does
Arthur Delarue
committed
1. **Ingests** your documents (Word, PDF, PowerPoint, Text, Markdown) into a vector database
2. **Retrieves** relevant chunks using semantic search when you ask a question
3. **Reranks** results to find the most relevant information
4. **Generates** comprehensive answers using your choice of LLM
Arthur Delarue
committed
## 🏗️ Architecture
Arthur Delarue
committed
```
Documents (docs/)
↓
Document Processing & Chunking
↓
Embedding Model → Vector Database (Chroma)
↓
Query → Semantic Search (50 chunks)
↓
Reranking (Top 10 chunks)
↓
LLM (Qwen/Mistral/OpenAI) → Answer
```
## ✨ Features
- **Multiple LLM Providers**: Ollama (local), Mistral AI, or OpenAI
- **Flexible Embeddings**: Ollama models (mxbai, nomic, bge) or FastEmbed
- **Smart Retrieval**: Semantic search + Flashrank reranking for precision
- **Document Support**: Word, PDF, PowerPoint, Text, Markdown, ODT
- **ZIP Extraction**: Automatically extracts nested ZIP archives
- **Web Interface**: Simple HTTP server with query interface
- **CPU-Friendly**: Optimized for systems without GPU
---
## 📦 Installation
### 1. Install Ollama (for local models)
**Windows (via Winget):**
Arthur Delarue
committed
Verify installation:
Arthur Delarue
committed
```
Ollama runs as a Windows service automatically. If not running:
```powershell
Arthur Delarue
committed
### 2. Pull Required Models
**LLM Model (for answering queries):**
Arthur Delarue
committed
# Recommended: 7B model (4.7GB, good balance)
ollama pull qwen2.5:7b-instruct
# Alternative: Smaller for low-end CPUs (2.0GB)
Arthur Delarue
committed
# Alternative: Larger for better accuracy (8.9GB)
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
**Embedding Model (for semantic search):**
Arthur Delarue
committed
# Recommended: Best for technical documents (669MB, 1024-dim)
ollama pull mxbai-embed-large
# Alternative: Lighter option (274MB, 768-dim)
ollama pull nomic-embed-text
Arthur Delarue
committed
### 3. Install Pandoc (for ODT files)
Optional, only if you have OpenDocument files:
```powershell
winget install --id JohnMacFarlane.Pandoc -e
```
### 4. Setup Python Environment
**Create virtual environment:**
python -m venv .venv
Arthur Delarue
committed
**Note:** If you get an execution policy error:
```powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Arthur Delarue
committed
**Install dependencies:**
```powershell
pip install -r requirements.txt
```
---
## ⚙️ Configuration
### 1. Create .env file
Arthur Delarue
committed
```powershell
Copy-Item .env.example .env
```
Arthur Delarue
committed
### 2. Configure Settings
Arthur Delarue
committed
Arthur Delarue
committed
Edit `.env` with your preferred settings:
Arthur Delarue
committed
Arthur Delarue
committed
**LLM Configuration:**
```env
MODEL_PROVIDER=ollama # Options: ollama, mistral, openai
OLLAMA_MODEL=qwen2.5:14b-instruct # The model for generating answers
Arthur Delarue
committed
OLLAMA_BASE_URL=http://localhost:11434
```
Arthur Delarue
committed
**Embedding Configuration:**
```env
EMBEDDING_PROVIDER=ollama # Options: ollama, fastembed
EMBEDDING_MODEL=mxbai-embed-large # Must match during ingestion & query!
```
Arthur Delarue
committed
**Retrieval Settings:**
```env
RETRIEVAL_CHUNKS=100 # How many chunks to retrieve initially
TOP_N_RERANK=15 # Final chunks sent to LLM after reranking
Arthur Delarue
committed
USE_RERANKING=true # Enable for better accuracy
```
Arthur Delarue
committed
**Document Processing:**
```env
CHUNK_SIZE=800 # Characters per chunk (smaller = more chunks)
CHUNK_OVERLAP=160 # Overlap prevents context loss
BATCH_SIZE=100 # Documents processed per batch
```
Arthur Delarue
committed
---
Arthur Delarue
committed
## 🚀 Usage
Arthur Delarue
committed
### Step 1: Ingest Your Documents
Arthur Delarue
committed
Place your documents in the `docs/` folder, then run:
Arthur Delarue
committed
python rag\ingest.py
Arthur Delarue
committed
**What happens:**
- Extracts all ZIP files (including nested archives)
- Loads documents (Word, PDF, PowerPoint, etc.)
- Splits into chunks using configured size/overlap
- Generates embeddings using your chosen model
- Stores vectors in Chroma database (`storage/chroma/`)
**⏱️ Time estimate:**
- ~458 documents = ~5-10 minutes with mxbai-embed-large
- Faster with smaller embedding models
### Step 2: Start the Frontend (Web Interface)
Arthur Delarue
committed
python frontend\app.py
Arthur Delarue
committed
The server starts at: **http://127.0.0.1:8000**
Open in your browser and start asking questions!
### Command-Line Query (Alternative)
Test queries without the web interface:
```powershell
python rag\query.py "What is the latest performance of V-PCC for gaussian splat?"
```
Arthur Delarue
committed
## 📊 Improving System Performance
Arthur Delarue
committed
Arthur Delarue
committed
### 🎯 Improve Answer Accuracy
Arthur Delarue
committed
Arthur Delarue
committed
#### 1. **Upgrade Embedding Model**
Arthur Delarue
committed
Arthur Delarue
committed
Better embeddings = better retrieval = better answers
Arthur Delarue
committed
Arthur Delarue
committed
| Model | Size | Dimensions | Best For | Quality |
|-------|------|------------|----------|---------|
| `mxbai-embed-large` | 669MB | 1024 | Technical docs | ⭐⭐⭐⭐⭐ |
| `bge-large` | 1.34GB | 1024 | Highest accuracy | ⭐⭐⭐⭐⭐ |
| `nomic-embed-text` | 274MB | 768 | General purpose | ⭐⭐⭐⭐ |
| `bge-small` (FastEmbed) | ~130MB | 384 | Speed over quality | ⭐⭐⭐ |
**How to upgrade:**
```powershell
# Pull new embedding model
ollama pull bge-large
# Update .env
EMBEDDING_MODEL=bge-large
# Re-ingest documents (required!)
Remove-Item -Path "storage\chroma" -Recurse -Force
python rag\ingest.py
Arthur Delarue
committed
```
Arthur Delarue
committed
**⚠️ Important:** You MUST re-ingest when changing embedding models! Query embeddings must match stored embeddings.
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Upgrade LLM Model**
Arthur Delarue
committed
Larger models understand context better and generate more accurate answers.
| Model | Size | RAM Needed | Speed | Quality |
|-------|------|------------|-------|---------|
| `qwen2.5:3b-instruct` | 2.0GB | 4GB | Fast | ⭐⭐⭐ |
| `qwen2.5:7b-instruct` | 4.7GB | 8GB | Medium | ⭐⭐⭐⭐ |
| `qwen2.5:14b-instruct` | 8.9GB | 16GB | Slow | ⭐⭐⭐⭐⭐ |
| `qwen2.5:32b` | 19GB | 32GB+ | Very slow | ⭐⭐⭐⭐⭐ |
Arthur Delarue
committed
Arthur Delarue
committed
**How to upgrade:**
```powershell
# Pull new model
Arthur Delarue
committed
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
# Update .env
OLLAMA_MODEL=qwen2.5:14b-instruct
# Restart frontend
python frontend\app.py
Arthur Delarue
committed
**No re-ingestion needed** when changing LLM models.
#### 3. **Tune Retrieval Settings**
Balance between recall (finding relevant chunks) and precision (avoiding irrelevant chunks):
Arthur Delarue
committed
```env
Arthur Delarue
committed
# More initial chunks = better recall
RETRIEVAL_CHUNKS=100
Arthur Delarue
committed
Arthur Delarue
committed
# More reranked chunks = more context for LLM
TOP_N_RERANK=15
```
Arthur Delarue
committed
Arthur Delarue
committed
**⚠️ Warning:** Larger models handle more chunks better!
- 7B models: max 10 chunks (get overwhelmed beyond this)
- 14B models: 12-15 chunks optimal
- 32B models: 25-30 chunks
Arthur Delarue
committed
Arthur Delarue
committed
#### 4. **Enable Reranking**
Arthur Delarue
committed
Arthur Delarue
committed
Reranking dramatically improves precision by re-scoring retrieved chunks:
Arthur Delarue
committed
```env
Arthur Delarue
committed
USE_RERANKING=true
Arthur Delarue
committed
```
Arthur Delarue
committed
**Impact:** ~30-50% improvement in answer relevance for complex queries.
### ⚡ Improve Speed
#### 1. **Use Smaller Embedding Model**
Arthur Delarue
committed
Arthur Delarue
committed
Trade-off: Speed vs. accuracy
Arthur Delarue
committed
Arthur Delarue
committed
```powershell
# Fast option
EMBEDDING_MODEL=nomic-embed-text
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Use Smaller LLM**
```powershell
ollama pull qwen2.5:3b-instruct
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 3. **Reduce Retrieval Chunks**
Arthur Delarue
committed
```env
Arthur Delarue
committed
RETRIEVAL_CHUNKS=20 # Faster search
TOP_N_RERANK=5 # Faster reranking
Arthur Delarue
committed
```
Arthur Delarue
committed
#### 4. **Increase Chunk Size**
Arthur Delarue
committed
Arthur Delarue
committed
Fewer chunks = faster retrieval (but potentially lower accuracy):
Arthur Delarue
committed
Arthur Delarue
committed
```env
CHUNK_SIZE=1200 # Larger chunks = fewer total chunks
CHUNK_OVERLAP=200
```
**⚠️ Note:** Requires re-ingestion!
### 💪 Improve Answer Depth & Usefulness
Arthur Delarue
committed
Arthur Delarue
committed
#### 1. **Optimize Chunk Size for Your Documents**
Arthur Delarue
committed
Arthur Delarue
committed
- **Technical docs with tables/code:** Smaller chunks (600-800)
- **Long-form articles:** Medium chunks (1000-1200)
- **Books/reports:** Larger chunks (1500-2000)
Arthur Delarue
committed
Arthur Delarue
committed
#### 2. **Increase Context for LLM**
Arthur Delarue
committed
Arthur Delarue
committed
More chunks = more comprehensive answers:
Arthur Delarue
committed
Arthur Delarue
committed
```env
RETRIEVAL_CHUNKS=100
TOP_N_RERANK=15 # Only if using 14B+ model!
```
Arthur Delarue
committed
Arthur Delarue
committed
#### 3. **Use Larger Open-Source Models for Best Quality**
Arthur Delarue
committed
Arthur Delarue
committed
For maximum quality while staying open-source and cost-free:
Arthur Delarue
committed
Arthur Delarue
committed
**Option 1: Larger Qwen Models (Recommended)**
Arthur Delarue
committed
```powershell
Arthur Delarue
committed
# Best balance: quality + speed (if you have 16GB+ RAM)
ollama pull qwen2.5:14b-instruct
Arthur Delarue
committed
Arthur Delarue
committed
# Maximum quality (requires 32GB+ RAM)
ollama pull qwen2.5:32b-instruct
Arthur Delarue
committed
```
Arthur Delarue
committed
```env
OLLAMA_MODEL=qwen2.5:14b-instruct
```
Arthur Delarue
committed
Arthur Delarue
committed
**Option 2: Qwen 3 (Newest generation)**
```powershell
# Latest Qwen 3 models (better reasoning)
ollama pull qwen3:8b
ollama pull qwen3:14b
```
Arthur Delarue
committed
Arthur Delarue
committed
**Option 3: DeepSeek-R1 (Strong reasoning)**
Arthur Delarue
committed
# Excellent for complex technical questions
ollama pull deepseek-r1:7b
ollama pull deepseek-r1:14b
Arthur Delarue
committed
Arthur Delarue
committed
**Quality Comparison (all open-source & free):**
- `qwen2.5:32b` ≈ GPT-4 quality (19GB, very slow on CPU)
- `qwen2.5:14b` ≈ GPT-3.5-Turbo quality (8.9GB, acceptable on CPU)
- `deepseek-r1:14b` - Excellent reasoning (9GB)
- `qwen3:14b` - Latest generation (8.9GB)
**⚠️ No cloud APIs needed!** All models run locally for free.
Arthur Delarue
committed
Arthur Delarue
committed
---
Arthur Delarue
committed
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
## 🔧 Advanced Configuration
### Chunk Size Guidelines
The CHUNK_SIZE parameter controls how documents are split. Finding the optimal size depends on your document type and questions:
**When to use SMALLER chunks (600-800):**
- Technical documents with tables and code
- Q&A scenarios (specific fact retrieval)
- Documents with dense, structured information
**When to use LARGER chunks (1200-1500):**
- Long-form content (articles, reports)
- Narrative documents (books, essays)
- When questions require broader context
**⚠️ Context Length Limits:**
- `mxbai-embed-large`: Max ~800 chars/chunk (strict limit)
- `nomic-embed-text`: Max ~1500 chars/chunk
- `bge-large`: Max ~1200 chars/chunk
If ingestion fails with "context length exceeded", reduce CHUNK_SIZE.
### RAM Requirements
**Ingestion:**
- Minimum: 8GB RAM
- Recommended: 16GB+ for large document sets
- Embedding models: + model size (270MB - 1.3GB)
**Query:**
- 7B LLM: 8GB minimum
- 14B LLM: 16GB minimum
- 32B LLM: 32GB+ minimum
- Reranking: ~12.5MB per chunk
### 🖥️ Hardware Impact on Performance
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
#### 🔍 Understanding CPU, GPU, and RAM Roles
Your RAG system has **3 main operations**, and each hardware component plays a specific role:
**📥 1. INGESTION (Creating Vector Database)**
```
Documents → Text Extraction → Chunking → Embedding Creation → Store in ChromaDB
```
**Component Roles During Ingestion:**
- **CPU:**
- Reads files from disk (unzipping, PDF parsing)
- Splits text into chunks
- **Runs embedding model** (mxbai-embed-large) to convert chunks to vectors
- Saves vectors to ChromaDB database
- **🎯 Impact:** More CPU cores = faster parallel processing (5-15 min for 458 docs)
- **GPU:**
- **Can accelerate embeddings** if Ollama uses GPU mode
- **🎯 Impact:** **5-10x faster ingestion** (1-3 min instead of 5-15 min)
- **Note:** Your current RTX 5050 8GB works for embeddings during ingestion!
- **RAM:**
- Temporarily holds document content and chunks before processing
- **🎯 Impact:** 16GB allows processing large document batches smoothly
- **Insufficient RAM = system swaps to disk = dramatically slower**
**🔎 2. QUERY RETRIEVAL (Finding Relevant Context)**
```
Your Question → Embedding → Search Vector DB → Retrieve Chunks → Rerank → Top 15 Chunks
```
**Component Roles During Retrieval:**
- **CPU:**
- Converts your question to an embedding vector
- Searches ChromaDB database (vector similarity calculation)
- Runs reranking model (Flashrank) on 100 → 15 chunks
- **🎯 Impact:** Fast enough (0.5-1 second total) - rarely a bottleneck
- **GPU:**
- **Not used** for retrieval in this project
- Embeddings and reranking run on CPU only
- **🎯 Impact:** None
- **RAM:**
- Loads vector database into memory
- Holds 100 retrieved chunks during reranking
- **🎯 Impact:** 16GB is more than enough for retrieval
**🤖 3. ANSWER GENERATION (LLM Response)**
```
Your Question + Top 15 Chunks → LLM (qwen2.5:14b-instruct) → Detailed Answer
```
**Component Roles During Answer Generation:**
- **CPU:**
- **Runs the entire LLM** when GPU is disabled (OLLAMA_NUM_GPU=0)
- Processes tokens one-by-one through 14 billion parameters
- **🎯 Impact:** 8-15 seconds per answer (slow but works!)
- **GPU:**
- **Runs the entire LLM** when GPU is enabled (default)
- Processes tokens MUCH faster using parallel computation
- **🎯 Impact:** **6-10x faster** (1-2 sec instead of 8-15 sec)
- **⚠️ Problem:** Your RTX 5050 (8GB VRAM) is too small for 14B model (needs 10GB)
- **Why you use CPU mode:** 14B doesn't fit in 8GB VRAM → CUDA error → must disable GPU
- **RAM:**
- **Stores the entire LLM model** in CPU mode
- 14B model = ~9GB loaded into RAM
- Also holds context (your question + 15 chunks)
- **🎯 Impact:** 16GB is minimum for 14B, 32GB better for 32B
- **Insufficient RAM = model won't load at all**
#### 📊 Component Impact Summary Table
| Component | Ingestion Speed | Query Retrieval | Answer Speed | Answer Quality |
|-----------|----------------|-----------------|--------------|----------------|
| **CPU** | ⭐⭐⭐ Major | ⭐⭐⭐ Critical | ⭐⭐⭐ Critical (CPU mode) | ❌ No impact |
| **GPU** | ⭐⭐ Helpful | ❌ Not used | ⭐⭐⭐ Critical (GPU mode) | ❌ No impact |
| **RAM** | ⭐ Minor | ⭐ Minor | ⭐⭐⭐ Critical | ⭐⭐ Indirect* |
| **Storage (SSD)** | ⭐ Minor | ⭐ Minor | ❌ No impact | ❌ No impact |
**\*** More RAM → allows larger models → better quality answers
#### 🎯 Why Your Current Setup Uses CPU-Only Mode
**Your Hardware:** RTX 5050 8GB VRAM, 16GB RAM, 24 CPU cores
**The Problem:**
1. **Embeddings (mxbai-embed-large):** 669MB → ✅ Fits in 8GB GPU → Works great!
2. **14B LLM Model:** Needs ~10GB VRAM → ❌ Only 8GB available → CUDA error!
**Your Choice:** Use 14B on CPU (slow but best quality) instead of 7B on GPU (fast but lower quality)
**Result:**
- **During ingestion:** GPU helps with embeddings → faster (2-5 min)
- **During queries:** LLM runs on CPU → slower (8-15 sec) but highest quality answers
Arthur Delarue
committed
#### CPU vs GPU Performance
**CPU-Only Systems (current setup):**
- **7B models:** 3-5 seconds/query (acceptable)
- **14B models:** 8-15 seconds/query (slow but usable)
- **32B models:** 30-60+ seconds/query (very slow)
- **Ingestion:** 5-15 minutes for 458 documents
**With GPU (NVIDIA recommended):**
- **7B models:** 0.5-1 second/query (8-10x faster)
- **14B models:** 1-2 seconds/query (6-8x faster)
- **32B models:** 3-5 seconds/query (10-15x faster)
- **Ingestion:** 1-3 minutes (5-10x faster)
**GPU Requirements:**
- 7B models: 6GB VRAM minimum (RTX 3060, RTX 4060)
- 14B models: 10GB VRAM minimum (RTX 3080, RTX 4070)
- 32B models: 24GB VRAM minimum (RTX 3090, RTX 4090)
**⚠️ GPU Memory Insufficient?**
If you have a GPU but get CUDA errors (e.g., RTX 5050 with 8GB trying to run 14B):
- **Option 1:** Force CPU-only mode (see Troubleshooting section)
- **Option 2:** Use smaller model (7B works on 8GB VRAM)
- **Trade-off:** CPU is slower but works with any model size
Arthur Delarue
committed
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
#### RAM Impact
| RAM | Max Model | Max Chunks | Experience |
|-----|-----------|------------|------------|
| 8GB | 7B | 50 | Basic, slow |
| 16GB | 14B | 100 | Good |
| 32GB | 32B | 300+ | Excellent |
| 64GB+ | 70B+ | 1000+ | Professional |
**⚠️ Important:** More RAM ≠ faster queries, but allows:
- Larger models (better quality)
- More retrieval chunks (better recall)
- Multiple processes without swapping
#### Storage Impact
**SSD vs HDD:**
- **SSD (Recommended):** Vector store loads in 0.5-1 second
- **HDD:** Vector store loads in 3-5 seconds
- **NVMe SSD:** Vector store loads in 0.2-0.5 second
**Model Storage:**
- 3B model: ~2GB
- 7B model: ~5GB
- 14B model: ~9GB
- 32B model: ~19GB
- Embedding models: 270MB - 1.3GB
- Vector database: ~2-3MB per 10,000 chunks
#### CPU Impact on Ingestion
**Embedding Generation (CPU-bound):**
- **8 CPU cores:** ~8-12 minutes (458 docs)
- **16 CPU cores:** ~5-8 minutes
- **32 CPU cores:** ~3-5 minutes
**Document Loading (I/O + CPU):**
- Single-core: 1 document/second
- Multi-core: 5-10 documents/second (parallel processing)
#### Optimal Hardware Recommendations
**Budget Setup ($0 upgrade cost - current):**
- CPU: Any modern CPU (4+ cores)
- RAM: 8-16GB
- Storage: Any SSD
- Model: `qwen2.5:7b-instruct`
- Expected: 3-5s queries, adequate quality
**Recommended Setup ($300-500):**
- CPU: Intel i5/Ryzen 5+ (8+ cores)
- RAM: 16GB
- Storage: SSD (500GB+)
- GPU: RTX 3060 (12GB VRAM)
- Model: `qwen2.5:14b-instruct`
- Expected: 1-2s queries, excellent quality
**Professional Setup ($1500-2000):**
- CPU: Intel i7/Ryzen 7+ (12+ cores)
- RAM: 32GB
- Storage: NVMe SSD (1TB+)
- GPU: RTX 4080/4090 (16-24GB VRAM)
- Model: `qwen2.5:32b`
- Expected: <1s queries, GPT-4 quality
**💡 Key Insight:** Even budget CPU-only setups work fine! GPU mainly improves speed, not quality. The open-source approach keeps costs at $0 regardless of hardware.
---
## 📁 Project Structure
Arthur Delarue
committed
LLMQwen/
├── docs/ # Your documents go here
├── storage/
│ └── chroma/ # Vector database storage
├── rag/
│ ├── ingest.py # Document ingestion script
│ └── query.py # Query execution script
├── frontend/
│ ├── app.py # Web server (backend)
│ ├── index.html # Web interface
│ ├── script.js # Frontend logic
│ └── style.css # UI styling
├── .env # Your configuration
├── .env.example # Configuration template
├── requirements.txt # Python dependencies
└── README.md # This file
```
---
## 🐛 Troubleshooting
### "Collection expecting embedding with dimension of X, got Y"
Arthur Delarue
committed
Arthur Delarue
committed
**Cause:** Embedding model mismatch between ingestion and query.
Arthur Delarue
committed
**Solution:**
Arthur Delarue
committed
# Clear vector store
Remove-Item -Path "storage\chroma" -Recurse -Force
# Re-ingest with correct model
python rag\ingest.py
```
### "the input length exceeds the context length"
**Cause:** Chunks too large for embedding model.
**Solution:** Reduce CHUNK_SIZE in .env:
```env
CHUNK_SIZE=600
CHUNK_OVERLAP=120
Arthur Delarue
committed
Then re-ingest.
Arthur Delarue
committed
### Ollama Connection Error
Arthur Delarue
committed
**Cause:** Ollama service not running.
Arthur Delarue
committed
**Solution:**
Arthur Delarue
committed
ollama serve
```
### Out of Memory During Query
**Cause:** Too many chunks for available RAM.
**Solution:** Reduce RETRIEVAL_CHUNKS:
```env
RETRIEVAL_CHUNKS=20
TOP_N_RERANK=5
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
### CUDA Error / GPU Out of Memory
**Error:** `llama runner process has terminated: CUDA error`
**Cause:** Model too large for your GPU VRAM, or GPU memory is full from other processes.
**GPU VRAM Requirements:**
- 7B models: 6GB VRAM minimum
- 14B models: 10GB VRAM minimum
- 32B models: 24GB VRAM minimum
**Check your GPU:**
```powershell
nvidia-smi
```
**Solution 1: Force CPU-Only Mode (if insufficient VRAM)**
Permanently enable CPU-only mode:
```powershell
[System.Environment]::SetEnvironmentVariable('OLLAMA_NUM_GPU', '0', 'User')
$env:OLLAMA_NUM_GPU = '0'
```
Restart your terminal and frontend. Models will run on CPU (slower but works).
**Solution 2: Use Smaller Model (if you want GPU speed)**
Switch to 7B model if you have 8GB VRAM:
```powershell
ollama pull qwen2.5:7b-instruct
```
Update `.env`:
```env
OLLAMA_MODEL=qwen2.5:7b-instruct
```
**Solution 3: Free GPU Memory**
Close other GPU-consuming applications (games, video editing, etc.) and restart Ollama:
```powershell
Get-Process ollama -ErrorAction SilentlyContinue | Stop-Process -Force
Start-Sleep -Seconds 3
# Ollama will auto-restart
```
**To Re-Enable GPU Mode Later:**
```powershell
[System.Environment]::SetEnvironmentVariable('OLLAMA_NUM_GPU', '1', 'User')
$env:OLLAMA_NUM_GPU = '1'
```
Arthur Delarue
committed
---
## 📈 Performance Benchmarks
Based on 458 documents (~30,000 chunks):
**GPU Mode:**
| Configuration | Ingestion Time | Query Time | Accuracy* |
|---------------|----------------|------------|-----------|
| nomic + 7B (GPU) | 2 min | 0.5-1s | ⭐⭐⭐⭐ |
| mxbai + 7B (GPU) | 3 min | 0.5-1s | ⭐⭐⭐⭐⭐ |
| mxbai + 14B (GPU) | 3 min | 1-2s | ⭐⭐⭐⭐⭐ |
**CPU-Only Mode (OLLAMA_NUM_GPU=0):**
Arthur Delarue
committed
| Configuration | Ingestion Time | Query Time | Accuracy* |
|---------------|----------------|------------|-----------|
| nomic + 7B (CPU) | 5 min | 3-5s | ⭐⭐⭐⭐ |
| mxbai + 7B (CPU) | 8 min | 3-5s | ⭐⭐⭐⭐⭐ |
| mxbai + 14B (CPU) | 8 min | 8-15s | ⭐⭐⭐⭐⭐ |
| bge-large + 14B (CPU) | 12 min | 8-15s | ⭐⭐⭐⭐⭐ |
Arthur Delarue
committed
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
*Accuracy for technical documentation queries
---
## 🤝 Contributing
Contributions welcome! Key areas:
- Additional document loaders
- Embedding model benchmarks
- Prompt engineering improvements
- UI enhancements
---
## 📄 License
MIT License - see LICENSE file for details
---
## 🙏 Acknowledgments
Built with:
- [LangChain](https://github.com/langchain-ai/langchain) - RAG framework
- [Chroma](https://github.com/chroma-core/chroma) - Vector database
- [Ollama](https://ollama.ai/) - Local LLM runtime
- [Flashrank](https://github.com/PrithivirajDamodaran/FlashRank) - Fast reranking
- [Qwen](https://github.com/QwenLM/Qwen) - Language models