
Production LLM Serving on Kubernetes: vLLM + KServe Stack
Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.
2 posts

Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.

llm-d was accepted as a CNCF Sandbox project, providing Kubernetes-native distributed inference with KV-cache-aware routing, prefill/decode disaggregation, and accelerator-agnostic serving.