Local LLM Inference Server
A self-hosted, dual-GPU inference rig serving a fleet of large language models with hot-swapping, speculative decoding, full observability, and measured, validated performance tuning.
2x 16GB VRAM
GPUs
6 models, 25B–35B
Model fleet
256K tokens
Max context
~1.5–2x (MTP)
Decode speedup
65–80%
Draft acceptance
5–30 sec
Cold model swap
~17% effective clock
Memory OC gain
6x sustained runs
Stress validation
A home-built local AI infrastructure project: a dedicated Linux server running a dual-GPU (2x 16GB VRAM) inference rig behind an OpenAI-compatible API. Rather than keeping one model loaded at all times, the stack hot-swaps between a fleet of open-weight LLMs (25B–35B parameter class) on demand, automatically freeing and reallocating GPU memory per request — all while exposing a single stable API endpoint that any OpenAI-compatible client or coding agent can talk to. Every performance change (GPU clocks, decoding strategy, context sizing) was validated with real before/after benchmarks rather than assumed — the same rigor expected of production infrastructure work.
Highlights
- Dual-GPU inference server running a model-swapping proxy in front of llama.cpp, exposing a single OpenAI-compatible API across a fleet of open-weight models
- On-demand hot-swapping between multiple 25B–35B-class models in 5–30 seconds cold, instant on subsequent requests — automatically freeing and reloading GPU memory with no manual intervention
- Extra-long context windows (up to 256K tokens) across the entire model fleet, including multimodal (vision) support on one model
- Self-speculative multi-token-prediction (MTP) decoding on select models, cutting generation time by ~1.5–2x with a ~65–80% draft-acceptance rate — no separate draft model required
- Full observability stack — Prometheus + Grafana dashboards tracking GPU utilization, VRAM, temperatures, and host metrics in real time
- Validated GPU memory overclocking (structured A/B testing across multiple offsets) delivering ~17% higher effective memory clock with zero throughput regressions after 6x sustained stress-test cycles and a full reboot re-validation
- Rigorous change-management discipline: a live core-clock experiment that risked corrupting an in-flight speculative-decoding session was caught, diagnosed, and reverted — with the root cause and safe operating procedure documented for future changes
- Benchmarked model upgrade path — evaluated a new coding-focused model against the existing fleet on public benchmarks (SWE-bench Verified, Terminal-Bench, NL2Repo) before adopting it, rather than upgrading on vendor claims alone
- Wired into day-to-day developer tooling — coding agents and IDE integrations route requests to the local fleet exactly like a cloud LLM provider
- Kept current with routine OS/package maintenance and health checks as part of ongoing operation, not a set-it-and-forget-it box
Architecture & Infrastructure
Model-swapping API gateway
A lightweight proxy sits in front of llama.cpp and exposes one stable OpenAI-compatible endpoint. Requesting a different model by name transparently unloads the current one, frees its VRAM, and loads the next — clients never need to know which model is physically resident.
VRAM-aware capacity planning
With 32GB of VRAM total across two GPUs, only one 25–35B-class model fits at a time. Every model in the fleet is tuned (quantization level, KV-cache precision, context length) to leave a safe, measured VRAM margin at full 256K context rather than guessing at headroom.
Speculative decoding for real throughput gains
Several models use self-speculative multi-token prediction — a draft head baked into the same model file predicts multiple tokens ahead, which the main model verifies in a single pass. Measured draft-acceptance rates of 65–80% translate to a genuine ~1.5–2x faster generation, with no extra model to manage.
Full-stack observability
Prometheus scrapes GPU, host, and inference-server metrics; Grafana visualizes them across host health, per-GPU utilization/VRAM/temperature/power, and request-level dashboards — the same observability pattern used for production ML-serving infrastructure.
Evidence-based performance tuning
GPU memory overclocking was pushed and validated incrementally with repeatable benchmarks (determinism checks, sustained stress runs, full reboot cycles) rather than applied blindly. An unstable setting was identified and rolled back before it could affect production use, and a separate live core-clock change that silently degraded speculative-decoding accuracy was caught, root-caused, and reverted with a documented safe-change procedure going forward.
Benchmark-driven model evaluation
New models are evaluated against the existing fleet on public, third-party benchmarks (SWE-bench Verified, Terminal-Bench, NL2Repo) before being promoted to production use — treating model selection as a measured engineering decision, not a trend follow.