
Production LLM Serving on Kubernetes: vLLM + KServe Stack
Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.
9 posts

Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.

Scraping KEDA operator metrics, building Grafana dashboards for scaling events, and diagnosing common ScaledObject issues in production.

Combining KEDA's event-driven pod scaling with Karpenter's just-in-time node provisioning for a fully reactive, cost-efficient Kubernetes autoscaling stack.

ScaledJob creates one Kubernetes Job per event, scales dynamically, and lets long-running batch workloads terminate cleanly.

How the KEDA HTTP Add-on intercepts traffic to scale HTTP workloads to zero, and when the Prometheus scaler is better.

Using KEDA's Prometheus scaler to drive autoscaling from any PromQL query — replacing Prometheus Adapter with a simpler, more flexible approach.

How KEDA's Kafka and SQS scalers calculate lag and queue depth, with TriggerAuthentication patterns and production edge cases.

How KEDA extends Kubernetes HPA with 65+ scalers, scale-to-zero, and a two-phase architecture for event-driven pod autoscaling.

How kagent, Agent Sandbox, KEDA, and OPA/Kyverno form the production stack for agentic AI on Kubernetes.