Flux Image Engine

In production · Internal · Built end-to-end on bare metal

A production image-generation server I built from scratch on open-weights diffusion models. Custom queuing and batching on top of NVIDIA GPUs I racked and configured myself. Has served hundreds of thousands of images at scale.

What it does

Image generation looks simple from the outside - prompt in, picture out. In production it becomes a question of throughput: how many concurrent generations can a single GPU handle, how do you queue requests fairly without starving anyone, and how do you batch them to get the most out of the hardware without ballooning latency for the next request in line.

The engine combines a request-level queue, a per-GPU worker pool, and a small batching scheduler that groups compatible requests on the same step boundary. It's the workhorse underneath multiple internal products and has run for months without a manual restart.

Engineering highlights

Bare-metal NVIDIA + CUDA setup. Hardware build, driver and CUDA toolchain, model deployment. Not a hosted API I subscribed to - physical GPUs I racked myself.
Custom queue + batcher. Per-priority queues, fairness across tenants, and step-aligned batching that gets close to the GPU's peak throughput without sacrificing tail latency.
Two-model serving. A fast/cheap model for low-latency turns, a higher-fidelity model for the prestige work - model is selected per-request, weights pinned in VRAM where possible.
Observability. Per-request timing, per-GPU utilisation, queue-depth alarms. When a GPU pegs, you know why.

Stack

PythonPython API frameworkDeep learning runtimeNVIDIA / CUDAOpen-weights diffusion modelsRedis (queue)DockerHypervisor + GPU-passthrough VMsMesh VPN

Next: AI Inference Stack → Get in touch