Blame
|
1 | # LAN Machines |
||||||
| 2 | ||||||||
| 3 | ← [[Home]] |
|||||||
| 4 | ||||||||
| 5 | Physical machines on the LAN that are not LXC containers — primarily Ollama inference servers. |
|||||||
| 6 | ||||||||
| 7 | --- |
|||||||
| 8 | ||||||||
| 9 | ## Machines |
|||||||
| 10 | ||||||||
| 11 | | IP | Hardware | Role | |
|||||||
| 12 | |----|----------|------| |
|||||||
| 13 | | `192.168.2.11` | RTX 3060 (12 GB VRAM) | Ollama — large models | |
|||||||
| 14 | | `192.168.2.40` | RTX 2060 Super (16 GB RAM) | Ollama — small models | |
|||||||
| 15 | | `192.168.2.73` | GTX 1050 | Local workstation | |
|||||||
| 16 | ||||||||
| 17 | --- |
|||||||
| 18 | ||||||||
| 19 | ## Ollama @ 192.168.2.11 (RTX 3060) |
|||||||
| 20 | ||||||||
| 21 | Primary inference server for large models. API: `http://192.168.2.40:11434` |
|||||||
| 22 | ||||||||
| 23 | **Running models:** |
|||||||
| 24 | ||||||||
| 25 | | Model | Notes | |
|||||||
| 26 | |-------|-------| |
|||||||
| 27 | | `qwen3.6:27b` | 27B MoE — fits in 12 GB VRAM | |
|||||||
| 28 | | `qwen3.5` | | |
|||||||
| 29 | | `ministral-3` | | |
|||||||
| 30 | | `llama3.2` | | |
|||||||
| 31 | | `llama3.1:8b` | Aliased as `llama3.1:8b-gpu` in LiteLLM | |
|||||||
| 32 | | `llava:7b` | Vision/OCR — used by paperless-gpt | |
|||||||
| 33 | | `nomic-embed-text` | Embeddings for Qdrant (vector size 192) | |
|||||||
| 34 | ||||||||
| 35 | --- |
|||||||
| 36 | ||||||||
| 37 | ## Ollama @ 192.168.2.40 (RTX 2060 Super) |
|||||||
| 38 | ||||||||
| 39 | Secondary inference server for small models. API: `http://192.168.2.40:11434` |
|||||||
| 40 | ||||||||
| 41 | Models stored at `C:\Users\Damien\.ollama\` (Windows machine). |
|||||||
| 42 | ||||||||
| 43 | **Running models:** |
|||||||
| 44 | ||||||||
| 45 | | Model | Notes | |
|||||||
| 46 | |-------|-------| |
|||||||
| 47 | | `llama3.1:8b` | | |
|||||||
| 48 | | `llama3.2:3b` | | |
|||||||
| 49 | ||||||||
| 50 | --- |
|||||||
| 51 | ||||||||
| 52 | ## Model Sizing Notes |
|||||||
| 53 | ||||||||
| 54 | - **RTX 3060 (12 GB):** fits up to ~14B dense or ~27B MoE at Q4_K_M |
|||||||
| 55 | - **Qwen3-coder-30b-a3b** (MoE) needs ~22 GB VRAM at Q4_K_M — exceeds both `.11` and `.40`. Not runnable locally. |
|||||||
| 56 | - **Rule of thumb:** Q4_K_M quantization uses roughly 0.5–0.6 GB per billion parameters for dense models; MoE models use much less because only a fraction of params activate per token. |
|||||||
| 57 | ||||||||
| 58 | --- |
|||||||
| 59 | ||||||||
| 60 | ## LiteLLM Integration |
|||||||
| 61 | ||||||||
| 62 | Both Ollama servers are configured as backends in LiteLLM on PCT 109. See [[AI Stack]] for the full model list and proxy config. Reference Ollama models in LiteLLM as their configured model names (e.g. `qwen3.6:27b`, `llama3.1:8b`). |
|||||||
