Local AI / Infra2026

Local LLM Inference Server

A self-hosted, dual-GPU inference rig serving a fleet of large language models with hot-swapping, speculative decoding, full observability, and measured, validated performance tuning.

Local AILLM Inferencellama.cppGPU ComputingPerformance TuningObservabilityLinuxSelf-Hosted

2x 16GB VRAM

GPUs

6 models, 25B–35B

Model fleet

256K tokens

Max context

~1.5–2x (MTP)

Decode speedup

65–80%

Draft acceptance

5–30 sec

Cold model swap

~17% effective clock

Memory OC gain

6x sustained runs

Stress validation

A home-built local AI infrastructure project: a dedicated Linux server running a dual-GPU (2x 16GB VRAM) inference rig behind an OpenAI-compatible API. Rather than keeping one model loaded at all times, the stack hot-swaps between a fleet of open-weight LLMs (25B–35B parameter class) on demand, automatically freeing and reallocating GPU memory per request — all while exposing a single stable API endpoint that any OpenAI-compatible client or coding agent can talk to. Every performance change (GPU clocks, decoding strategy, context sizing) was validated with real before/after benchmarks rather than assumed — the same rigor expected of production infrastructure work.

Highlights

Dual-GPU inference server running a model-swapping proxy in front of llama.cpp, exposing a single OpenAI-compatible API across a fleet of open-weight models
On-demand hot-swapping between multiple 25B–35B-class models in 5–30 seconds cold, instant on subsequent requests — automatically freeing and reloading GPU memory with no manual intervention
Extra-long context windows (up to 256K tokens) across the entire model fleet, including multimodal (vision) support on one model
Self-speculative multi-token-prediction (MTP) decoding on select models, cutting generation time by ~1.5–2x with a ~65–80% draft-acceptance rate — no separate draft model required
Full observability stack — Prometheus + Grafana dashboards tracking GPU utilization, VRAM, temperatures, and host metrics in real time
Validated GPU memory overclocking (structured A/B testing across multiple offsets) delivering ~17% higher effective memory clock with zero throughput regressions after 6x sustained stress-test cycles and a full reboot re-validation
Rigorous change-management discipline: a live core-clock experiment that risked corrupting an in-flight speculative-decoding session was caught, diagnosed, and reverted — with the root cause and safe operating procedure documented for future changes
Benchmarked model upgrade path — evaluated a new coding-focused model against the existing fleet on public benchmarks (SWE-bench Verified, Terminal-Bench, NL2Repo) before adopting it, rather than upgrading on vendor claims alone
Wired into day-to-day developer tooling — coding agents and IDE integrations route requests to the local fleet exactly like a cloud LLM provider
Kept current with routine OS/package maintenance and health checks as part of ongoing operation, not a set-it-and-forget-it box

Architecture & Infrastructure

Model-swapping API gateway

A lightweight proxy sits in front of llama.cpp and exposes one stable OpenAI-compatible endpoint. Requesting a different model by name transparently unloads the current one, frees its VRAM, and loads the next — clients never need to know which model is physically resident.

VRAM-aware capacity planning

With 32GB of VRAM total across two GPUs, only one 25–35B-class model fits at a time. Every model in the fleet is tuned (quantization level, KV-cache precision, context length) to leave a safe, measured VRAM margin at full 256K context rather than guessing at headroom.

Speculative decoding for real throughput gains

Several models use self-speculative multi-token prediction — a draft head baked into the same model file predicts multiple tokens ahead, which the main model verifies in a single pass. Measured draft-acceptance rates of 65–80% translate to a genuine ~1.5–2x faster generation, with no extra model to manage.

Full-stack observability

Prometheus scrapes GPU, host, and inference-server metrics; Grafana visualizes them across host health, per-GPU utilization/VRAM/temperature/power, and request-level dashboards — the same observability pattern used for production ML-serving infrastructure.

Evidence-based performance tuning

GPU memory overclocking was pushed and validated incrementally with repeatable benchmarks (determinism checks, sustained stress runs, full reboot cycles) rather than applied blindly. An unstable setting was identified and rolled back before it could affect production use, and a separate live core-clock change that silently degraded speculative-decoding accuracy was caught, root-caused, and reverted with a documented safe-change procedure going forward.

Benchmark-driven model evaluation

New models are evaluated against the existing fleet on public, third-party benchmarks (SWE-bench Verified, Terminal-Bench, NL2Repo) before being promoted to production use — treating model selection as a measured engineering decision, not a trend follow.

Stack

llama.cppllama-swapLinuxCUDAPrometheusGrafana