vllm

5 posts

Production LLM Serving on Kubernetes: vLLM + KServe Stack

Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.

by KubeDojo

llm-d cncf sandbox

llm-d Joins CNCF Sandbox: Kubernetes-Native Distributed LLM Inference

llm-d was accepted as a CNCF Sandbox project, providing Kubernetes-native distributed inference with KV-cache-aware routing, prefill/decode disaggregation, and accelerator-agnostic serving.

by KubeDojo

nvidia dynamo inference

NVIDIA Dynamo 1.0: The Inference Operating System for AI Factories

Production deployment patterns for NVIDIA Dynamo 1.0 on EKS and GKE — disaggregated serving, KV-aware routing, and gotchas from real deployments.

by KubeDojo

hami gpu virtualization

HAMi: GPU Virtualization as the Reference Pattern for AI Infrastructure

HAMi (CNCF Sandbox) emerged at KubeCon EU 2026 as the reference implementation for GPU resource management across NVIDIA, AMD, Huawei, and Cambricon accelerators.

by KubeDojo

gateway-api inference-extension ai-gateway

AI Gateway Working Group and Gateway API Inference Extension GA

Gateway API Inference Extension GA standardizes model-aware routing for self-hosted LLMs with KV-cache-aware scheduling and LoRA adapter affinity.

by KubeDojo