Work / AI Inference Stack
06 · Bare metal → API
AI Inference Stack
In production · Internal · Hardware to API, end-to-end
The plumbing that runs underneath every product on this site. A complete AI inference stack - bare-metal NVIDIA GPU server builds, CUDA toolchain, model deployment, vLLM, load balancing, and production monitoring. Full vertical ownership from hardware to API.
Why I run my own
Hosted inference is fine until the moment a vendor changes pricing, deprecates a model, throttles you, or simply doesn't have the open-weights model you need. For a studio shipping AI products end-to-end, owning the inference stack is the difference between margin and no margin - and between an experiment and a defensible product.
Running the stack myself also means I can do things that hosted APIs make awkward: speech-aware open-weights LLMs, self-hosted voice TTS, image generation with open-weights diffusion models, and small fine-tuned classifiers - all on the same hardware, all reachable from the same internal network, with no per-token surprises.
What's in the stack
Hardware & host
NVIDIA GPU servers I racked and configured myself. A bare-metal hypervisor with GPU-passthrough VMs for isolation per workload. Dual-WAN failover so a single ISP outage doesn't take a production service down.
Drivers & runtime
CUDA toolchain, NVIDIA Container Toolkit for Docker. Deep-learning runtime matched to the driver. Reproducible builds - the same image runs on dev and prod, the only variable is which device it's mapped to.
Model serving
vLLM for LLMs (paged-attention, continuous batching). Custom servers for the more specialised workloads - open-weights image generation, self-hosted TTS, speech-aware LLMs. Each model behind a stable internal HTTP API.
Networking & access
A mesh VPN between services, so the inference plane never touches the public internet. Tunnel-based ingress for any externally-facing endpoints - no open ports, no public IPs, no exposed ssh.
Orchestration
Docker Compose per host, with shared volumes for models and a clean separation between dev and prod deploys. The same patterns this very portfolio site runs on.
Observability
Per-GPU utilisation, per-model latency and throughput, queue depth, and request timing. Alerts when a queue grows or a GPU starts thermal-throttling - not just when something has already crashed.
Stack