LAN Machines

Physical machines on the LAN that are not LXC containers — primarily Ollama inference servers.

Machines

IP	Hardware	Role
`192.168.2.11`	RTX 3060 (12 GB VRAM)	Ollama — large models
`192.168.2.40`	RTX 2060 Super (16 GB RAM)	Ollama — small models
`192.168.2.73`	GTX 1050	Local workstation

Ollama @ 192.168.2.11 (RTX 3060)

Primary inference server for large models. API: http://192.168.2.40:11434

Running models:

Model	Notes
`qwen3.6:27b`	27B MoE — fits in 12 GB VRAM
`qwen3.5`
`ministral-3`
`llama3.2`
`llama3.1:8b`	Aliased as `llama3.1:8b-gpu` in LiteLLM
`llava:7b`	Vision/OCR — used by paperless-gpt
`nomic-embed-text`	Embeddings for Qdrant (vector size 192)

Ollama @ 192.168.2.40 (RTX 2060 Super)

Secondary inference server for small models. API: http://192.168.2.40:11434

Models stored at C:\Users\Damien\.ollama\ (Windows machine).

Running models:

Model	Notes
`llama3.1:8b`
`llama3.2:3b`

Model Sizing Notes

RTX 3060 (12 GB): fits up to ~14B dense or ~27B MoE at Q4_K_M
Qwen3-coder-30b-a3b (MoE) needs ~22 GB VRAM at Q4_K_M — exceeds both .11 and .40. Not runnable locally.
Rule of thumb: Q4_K_M quantization uses roughly 0.5–0.6 GB per billion parameters for dense models; MoE models use much less because only a fraction of params activate per token.

LiteLLM Integration

Both Ollama servers are configured as backends in LiteLLM on PCT 109. See AI Stack for the full model list and proxy config. Reference Ollama models in LiteLLM as their configured model names (e.g. qwen3.6:27b, llama3.1:8b).