LAN Machines

Home

Physical machines on the LAN that are not LXC containers — primarily Ollama inference servers.


Machines

IP Hardware Role
192.168.2.11 RTX 3060 (12 GB VRAM) Ollama — large models
192.168.2.40 RTX 2060 Super (16 GB RAM) Ollama — small models
192.168.2.73 GTX 1050 Local workstation

Ollama @ 192.168.2.11 (RTX 3060)

Primary inference server for large models. API: http://192.168.2.40:11434

Running models:

Model Notes
qwen3.6:27b 27B MoE — fits in 12 GB VRAM
qwen3.5
ministral-3
llama3.2
llama3.1:8b Aliased as llama3.1:8b-gpu in LiteLLM
llava:7b Vision/OCR — used by paperless-gpt
nomic-embed-text Embeddings for Qdrant (vector size 192)

Ollama @ 192.168.2.40 (RTX 2060 Super)

Secondary inference server for small models. API: http://192.168.2.40:11434

Models stored at C:\Users\Damien\.ollama\ (Windows machine).

Running models:

Model Notes
llama3.1:8b
llama3.2:3b

Model Sizing Notes

  • RTX 3060 (12 GB): fits up to ~14B dense or ~27B MoE at Q4_K_M
  • Qwen3-coder-30b-a3b (MoE) needs ~22 GB VRAM at Q4_K_M — exceeds both .11 and .40. Not runnable locally.
  • Rule of thumb: Q4_K_M quantization uses roughly 0.5–0.6 GB per billion parameters for dense models; MoE models use much less because only a fraction of params activate per token.

LiteLLM Integration

Both Ollama servers are configured as backends in LiteLLM on PCT 109. See AI Stack for the full model list and proxy config. Reference Ollama models in LiteLLM as their configured model names (e.g. qwen3.6:27b, llama3.1:8b).

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9