DDaz Williams

Work  /  AI Inference Stack

06 · Bare metal → API

AI Inference Stack

In production · Internal · Hardware to API, end-to-end

The plumbing that runs underneath every product on this site. A complete AI inference stack - bare-metal NVIDIA GPU server builds, CUDA toolchain, model deployment, vLLM, load balancing, and production monitoring. Full vertical ownership from hardware to API.

Why I run my own

Hosted inference is fine until the moment a vendor changes pricing, deprecates a model, throttles you, or simply doesn't have the open-weights model you need. For a studio shipping AI products end-to-end, owning the inference stack is the difference between margin and no margin - and between an experiment and a defensible product.

Running the stack myself also means I can do things that hosted APIs make awkward: speech-aware open-weights LLMs, self-hosted voice TTS, image generation with open-weights diffusion models, and small fine-tuned classifiers - all on the same hardware, all reachable from the same internal network, with no per-token surprises.

What's in the stack

Hardware & host

NVIDIA GPU servers I racked and configured myself. A bare-metal hypervisor with GPU-passthrough VMs for isolation per workload. Dual-WAN failover so a single ISP outage doesn't take a production service down.

Drivers & runtime

CUDA toolchain, NVIDIA Container Toolkit for Docker. Deep-learning runtime matched to the driver. Reproducible builds - the same image runs on dev and prod, the only variable is which device it's mapped to.

Model serving

vLLM for LLMs (paged-attention, continuous batching). Custom servers for the more specialised workloads - open-weights image generation, self-hosted TTS, speech-aware LLMs. Each model behind a stable internal HTTP API.

Networking & access

A mesh VPN between services, so the inference plane never touches the public internet. Tunnel-based ingress for any externally-facing endpoints - no open ports, no public IPs, no exposed ssh.

Orchestration

Docker Compose per host, with shared volumes for models and a clean separation between dev and prod deploys. The same patterns this very portfolio site runs on.

Observability

Per-GPU utilisation, per-model latency and throughput, queue depth, and request timing. Alerts when a queue grows or a GPU starts thermal-throttling - not just when something has already crashed.

Stack

NVIDIA GPUsCUDADeep-learning runtimevLLMHypervisor + GPU-passthrough VMsContainer orchestrationMesh VPNTunnel-based ingressnginxPython API serviceLinux