llm-d Joins CNCF Sandbox: Kubernetes-Native Distributed LLM Inference

On March 24, 2026, the CNCF announced that llm-d was accepted as a Sandbox project: a Kubernetes-native distributed inference framework backed by Red Hat, Google, IBM Research, NVIDIA, AMD, and CoreWeave. The pitch is ambitious: treat distributed LLM inference as a first-class cloud native workload, with the same operational rigor as any other Kubernetes service.
If you've operated LLM workloads in production, you know the pain. Standard Kubernetes round-robin load balancing falls apart when requests are stateful, expensive, and non-uniform. A single long-context chat session can fragment your KV cache across nodes. Co-locating prefill and decode phases on the same pod wastes GPU memory bandwidth. And without cache-aware routing, your prefix hit rate tanks under multi-tenant load.
llm-d solves this by sitting between high-level control planes like KServe and low-level inference engines like vLLM. It implements the Gateway API Inference Extension (GAIE) with the Endpoint Picker Protocol (EPP) for programmable, cache-aware routing. It disaggregates prefill and decode into independently scalable pods. It tiers KV cache across GPU, CPU, and storage. And it does all of this with accelerator-agnostic support: NVIDIA, AMD, Intel, Google TPU.
Why LLM Inference Needs Distributed Architecture
Traditional web applications are fast fashion: requests are short-lived, uniform, and stateless. Your load balancer distributes traffic round-robin across identical replicas, and every request costs roughly the same to process. This works because HTTP requests are cheap and interchangeable.
LLM inference is bespoke tailoring. Each request has a unique "shape": the number of input tokens (prompt length) and output tokens (generation length) varies wildly across use cases:
- RAG workloads: long inputs (retrieved documents + system prompt), short outputs
- Reasoning/agent workloads: short or medium inputs, very long outputs with multiple thinking loops
- Chat sessions: multi-turn requests that reuse the same conversation context
These differences compound. A 32K context RAG query costs 100× more than a 128-token chat completion. When your load balancer sends these requests uniformly across replicas, you get cache fragmentation, queue imbalances, and tail latency spikes.
Four characteristics make LLM serving unique:
Requests are expensive and non-uniform. A single request can run for seconds or minutes, with resource utilization varying by orders of magnitude based on prompt length and model size.
Routing to cache-local replicas matters. Many workloads have multi-turn patterns: agentic tool calls, code completion, chat sessions. If requests are routed to replicas that already have the conversation context in KV cache, you skip expensive prefill computation. Prefix cache hits can reduce latency by 10-100×.
Prefill and decode have different resource profiles. Prefill (processing the prompt) is compute-bound and runs in parallel over all tokens. Decode (generating tokens) is memory bandwidth-bound and runs sequentially. Co-locating both phases on the same replica leads to inefficient resource use.
Quality of service requirements vary widely. Code completion needs O(ms) latency. Chat agents tolerate O(seconds). Batch summarization is O(minutes) or O(hours). Tight latency SLOs are exponentially more expensive: you need infrastructure that can match request urgency to resource allocation.
Standard Kubernetes services don't understand any of this. They route round-robin. They scale replicas uniformly. They treat all requests as interchangeable. llm-d was built to close that gap.
llm-d Architecture: Gateway API Inference Extension in Action
llm-d sits in the middle of the inference stack:
┌─────────────────────────────────────────────────────────┐
│ Control Plane: KServe (optional) │
│ - LLMInferenceService CRD │
│ - High-level model management │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Orchestration: llm-d │
│ - Inference Gateway (Envoy + EPP) │
│ - Inference Scheduler (cache-aware routing) │
│ - Variant Autoscaler (QoS-aware scaling) │
│ - KV Cache Manager (tiered offloading) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Execution: vLLM │
│ - Model inference engine │
│ - Automatic prefix caching │
│ - Continuous batching │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Infrastructure: Kubernetes │
│ - LeaderWorkerSet (multi-node orchestration) │
│ - Gateway API (traffic management) │
│ - HPA/KEDA (autoscaling) │
└─────────────────────────────────────────────────────────┘
The architecture is modular by design. You can run llm-d standalone or as a data-plane extension to KServe. The key components:
Inference Gateway with Endpoint Picker Protocol
llm-d implements the Gateway API Inference Extension (GAIE), a Kubernetes standard for inference-aware traffic management. The gateway uses Envoy's external processing (ext-proc) to intercept requests and route them based on:
- Model name and variant
- KV cache locality (which replicas have the prefix cached)
- Current load and queue depth
- Request priority and QoS class
The Endpoint Picker Protocol (EPP) is the magic. Instead of round-robin, the gateway queries the llm-d inference scheduler, which scores each replica based on cache state and load. Requests land on the replica that minimizes TTFT (Time to First Token) and maximizes cache hit rate.
Native Orchestration with LeaderWorkerSet
Large models, especially Mixture of Experts (MoE) like DeepSeek-R1 or Llama 3.1 405B, don't fit on a single GPU. They require tensor parallelism (TP) or expert parallelism (EP) across multiple nodes. llm-d uses LeaderWorkerSet (LWS), a Kubernetes primitive for orchestrating multi-node replicas with fast interconnects.
LWS handles:
- Gang scheduling (all nodes start together)
- Health checking (if one node fails, restart the group)
- Rolling updates (update replicas without downtime)
- Topology-aware placement (keep TP/EP groups on the same rack or fabric)
Hierarchical KV Cache Management
llm-d tiers KV cache across three levels:
- GPU memory: fastest, smallest (80GB on H100)
- CPU memory: slower, larger (TBs available)
- Storage (NVMe/SSD): slowest, largest (multi-TB)
The KV cache manager automatically offloads cold cache entries to CPU or storage, freeing GPU memory for active requests. When a cached request returns, the manager prefetches the KV state back to GPU. This tiering increases effective cache capacity by 10-100× at the cost of higher latency for cache misses.
Example Inference Gateway configuration (from llm-d repo):
apiVersion: inference.networking.k8s.io/v1alpha1
kind: InferencePool
metadata:
name: qwen3-32b-pool
spec:
targetPortNumber: 8000
endpointPicker:
extProc:
service:
name: llm-d-scheduler
namespace: llm-d
failureMode: FailOpen
targetModels:
- name: qwen3-32b
weight: 1
The Four "Well-Lit Paths"
llm-d doesn't try to be everything to everyone. Instead, it provides "well-lit paths": validated, benchmarked deployment patterns for common production scenarios. Each path is a Helm chart with tested configurations and reproducible workflows.
1. Intelligent Inference Scheduling
Use case: Multi-tenant SaaS with shared model endpoints
What it does: Deploys vLLM behind the Inference Gateway with prefix-cache-aware routing. The scheduler tracks KV cache state across replicas and routes requests to maximize cache hits.
Performance: On Qwen3-32B with 8×vLLM pods and 16×NVIDIA H100:
- 3× lower TTFT at 4 QPS compared to baseline Kubernetes service
- 2× higher QPS while meeting P95 TTFT ≤ 2s SLO
- Sustained throughput of ~120k tokens/second
When to use it: You're running a shared model endpoint with multi-turn workloads (chat, agents, code completion). Cache locality matters more than raw throughput.
2. Prefill/Decode Disaggregation
Use case: Large models (70B+) with long prompts or high concurrency
What it does: Splits prefill and decode phases into separate pod types. Prefill pods process prompts and transfer KV state to decode pods, which handle token generation.
Performance: For prefill-heavy workloads (20:1 input/output ratio):
- Reduced TTFT (prefill pods optimized for compute)
- Predictable TPOT (decode pods optimized for memory bandwidth)
- Independent scaling of prefill vs decode capacity
Trade-offs: Adds network complexity. Requires fast interconnect (InfiniBand, RDMA, or high-speed Ethernet) for KV transfer. Not worth it for small models (<13B) or short prompts.
When to use it: You're serving large models with long system prompts, RAG contexts, or batch summarization workloads.
3. Wide Expert Parallelism
Use case: Very large MoE models (DeepSeek-R1, Mixtral, Llama 3.1 405B)
What it does: Scales MoE models across multiple nodes with expert parallelism. Each node hosts a subset of experts, and requests are routed to the right experts dynamically.
Performance: Enables serving models that don't fit on a single node while maintaining low latency. DeepSeek-R1 (671B total, 37B active) can serve with sub-second TTFT on a cluster of H100s.
When to use it: You need to serve frontier models that exceed single-node memory. Requires fast interconnect (NVLink, InfiniBand) for expert routing.
4. Tiered Prefix Cache
Use case: Long-context workloads (100K+ tokens) or high concurrency
What it does: Offloads KV cache from GPU to CPU memory and storage. Increases effective cache size at the cost of higher latency for offloaded entries.
Performance: Can increase cache capacity from 80GB (GPU) to 512GB+ (CPU + NVMe). Hit rate improves by 2-5× for long-context workloads.
When to use it: You're running RAG with long retrieved contexts, multi-hour chat sessions, or batch processing with context reuse.
Performance Benchmarks
The CNCF announcement included benchmarks from the llm-d v0.5 release. The test setup: Qwen3-32B model, 8×vLLM pods, 16×NVIDIA H100 GPUs, multi-tenant SaaS workload with shared customer contexts.
Key results:
| Metric | Baseline K8s Service | llm-d with Cache-Aware Routing |
|---|---|---|
| TTFT at 4 QPS | ~600ms | ~200ms (3× lower) |
| Max QPS under SLO (P95 TTFT ≤ 2s) | ~60k tok/s | ~120k tok/s (2× higher) |
| Throughput degradation point | ~70k tok/s | Sustains to ~120k tok/s |
Source: llm-d v0.5 benchmarks, CNCF announcement (March 2026)
The baseline service degrades rapidly under load because cache fragmentation increases with QPS. llm-d maintains near-zero TTFT by routing requests to cache-local replicas.
Prefill/Decode disaggregation benchmarks (from Red Hat Developer article):
- 20:1 ISL/OSL ratio (long prompts, short responses): significant TTFT reduction
- Decode-only scaling: higher throughput for chat workloads
- Network overhead: low-latency KV transfer over InfiniBand
Deployment Considerations
Accelerator Agnosticism
llm-d was founded with the principle: any model, any accelerator, any cloud. The project supports accelerators including:
- NVIDIA: H100, L40S, RTX 4000/6000 Ada (full feature set)
- AMD: MI300X, MI250 (via ROCm, vLLM backend)
- Intel: Gaudi2/3 (via Habana Labs backend)
- Google: TPU v4/v5 (via JAX/vLLM TPU backend)
This matters if you're avoiding vendor lock-in or running in hybrid environments. The inference scheduler adapts routing policies based on accelerator characteristics: for example, prioritizing cache locality on memory-constrained GPUs.
Deployment Patterns
llm-d targets two primary use cases:
Self-hosted LLM: Single workload across tens or hundreds of nodes. Example: internal code completion service for a 10,000-engineer org.
Model-as-a-Service: Multi-tenant platform with many users and workloads sharing one or more LLM deployments. Example: startup offering Llama 3.1 API with usage-based billing.
The deployment time is 15-20 minutes on a managed Kubernetes cluster (DigitalOcean, GKE, EKS) using the automated scripts. The Helm charts provide reusable building blocks, but won't support every possible configuration: advanced teams can customize the scheduler filters and scoring algorithms.
Operational Costs
Distributed inference trades infrastructure complexity for performance gains. The operational cost profile differs significantly from traditional web workloads:
- GPU utilization: Cache-aware routing improves GPU utilization by 30-50% compared to round-robin, but requires more sophisticated monitoring
- Network costs: P/D disaggregation and expert parallelism demand high-bandwidth, low-latency interconnects (InfiniBand, RDMA-over-Ethernet)
- Memory overhead: Tiered KV caching increases CPU and storage requirements proportionally to cache hit rate targets
- Engineering time: Teams need inference-specific expertise (TTFT/TPOT tuning, KV cache behavior) beyond standard Kubernetes operations
Competitive Landscape
llm-d isn't the only player in distributed inference:
AIBrix (ByteDance): Integrated serving platform with lifecycle management, StormService for P/D disaggregation, and high-density LoRA support. More opinionated than llm-d, but requires buying into the full platform.
NVIDIA Dynamo: High-performance framework optimized for NVIDIA infrastructure. Supports multi-node disaggregation and LeaderWorkerSet. Best-in-class performance on NVIDIA hardware, but vendor-locked.
KServe: High-level control plane for model serving. llm-d complements KServe: it's a data-plane extension, not a replacement.
Volcano/Kthena: Batch scheduling and gang scheduling. llm-d uses standard Kubernetes primitives (LWS) instead.
llm-d differentiates by being vendor-neutral, modular, and API-standard-aligned (Gateway API Inference Extension). Trade-off: more flexibility means more configuration decisions. Teams choosing llm-d gain accelerator freedom but lose the opinionated guardrails of integrated platforms like AIBrix or Dynamo.
Gotchas
KV cache locality is everything. If your routing decisions ignore cache state, you'll get worse performance than a simple round-robin setup. The EPP scheduler is the core of llm-d: don't skip understanding how it scores replicas.
Disaggregation adds network complexity. Prefill/decode split requires fast interconnect for KV transfer. If you're running on commodity Ethernet without RDMA, the network overhead can eat your performance gains. Start with intelligent scheduling, add disaggregation only if your workload justifies it.
Model access requirements. If you're deploying Llama 3.x models, you need a HuggingFace token and must accept the Meta license agreement. Approval takes a few hours. For quick testing, use open alternatives like Google Gemma, Qwen, or Microsoft Phi-3.
Monitoring gap. Standard Kubernetes observability (Prometheus, Grafana) doesn't track inference-specific metrics out of the box. You need TTFT, TPOT, KV cache hit rate, and queue depth. llm-d exports these metrics, but you'll need to build custom dashboards.
LeaderWorkerSet is alpha. LWS is the recommended primitive for multi-node orchestration, but it's still alpha. For production, consider using StatefulSet with pod affinity rules or wait for LWS to graduate.
Wrap-up
llm-d joining the CNCF Sandbox marks a maturation point for AI infrastructure on Kubernetes. The project standardizes distributed inference as a first-class workload, with vendor-neutral APIs, validated deployment patterns, and measurable performance gains.
The core insight: LLM serving isn't just another HTTP service. It's stateful, non-uniform, and cache-sensitive. Treating it as such, with inference-aware routing, disaggregated phases, and hierarchical cache management, unlocks order-of-magnitude improvements in latency and throughput.
If you're operating LLM workloads on Kubernetes, llm-d is worth evaluating. NVIDIA Dynamo 1.0, released the same week, takes a different approach: hyperscale-optimized and NVIDIA-native, revealing the trade-offs between vendor-neutral modularity and vertical integration.
Mastering the Kubernetes ecosystem — depth-first, no hype.
Subscribe to KubeDojo
Get the latest articles delivered to your inbox.
Related Articles

Production LLM Serving on Kubernetes: vLLM + KServe Stack
Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.

NVIDIA Dynamo 1.0: The Inference Operating System for AI Factories
Production deployment patterns for NVIDIA Dynamo 1.0 on EKS and GKE — disaggregated serving, KV-aware routing, and gotchas from real deployments.

CNCF Certified Kubernetes AI Conformance Program
CNCF launched v1.0 of the Kubernetes AI Conformance Program defining baseline capabilities for running AI workloads across conformant clusters.