LAN Machines
← Home
Physical machines on the LAN that are not LXC containers — primarily Ollama inference servers.
Machines
| IP | Hardware | Role |
|---|---|---|
192.168.2.11 |
RTX 3060 (12 GB VRAM) | Ollama — large models |
192.168.2.40 |
RTX 2060 Super (16 GB RAM) | Ollama — small models |
192.168.2.73 |
GTX 1050 | Local workstation |
Ollama @ 192.168.2.11 (RTX 3060)
Primary inference server for large models. API: http://192.168.2.40:11434
Running models:
| Model | Notes |
|---|---|
qwen3.6:27b |
27B MoE — fits in 12 GB VRAM |
qwen3.5 |
|
ministral-3 |
|
llama3.2 |
|
llama3.1:8b |
Aliased as llama3.1:8b-gpu in LiteLLM |
llava:7b |
Vision/OCR — used by paperless-gpt |
nomic-embed-text |
Embeddings for Qdrant (vector size 192) |
Ollama @ 192.168.2.40 (RTX 2060 Super)
Secondary inference server for small models. API: http://192.168.2.40:11434
Models stored at C:\Users\Damien\.ollama\ (Windows machine).
Running models:
| Model | Notes |
|---|---|
llama3.1:8b |
|
llama3.2:3b |
Model Sizing Notes
- RTX 3060 (12 GB): fits up to ~14B dense or ~27B MoE at Q4_K_M
- Qwen3-coder-30b-a3b (MoE) needs ~22 GB VRAM at Q4_K_M — exceeds both
.11and.40. Not runnable locally. - Rule of thumb: Q4_K_M quantization uses roughly 0.5–0.6 GB per billion parameters for dense models; MoE models use much less because only a fraction of params activate per token.
LiteLLM Integration
Both Ollama servers are configured as backends in LiteLLM on PCT 109. See AI Stack for the full model list and proxy config. Reference Ollama models in LiteLLM as their configured model names (e.g. qwen3.6:27b, llama3.1:8b).
