
NVIDIA AI Cluster Runtime: Validated GPU Kubernetes Recipes
NVIDIA released AI Cluster Runtime, an open-source project providing validated, version-locked Kubernetes configurations for GPU infrastructure.
12 posts

NVIDIA released AI Cluster Runtime, an open-source project providing validated, version-locked Kubernetes configurations for GPU infrastructure.

Kubernetes 1.36 decouples scheduling policy from runtime instances with Workload API v1alpha2, standalone PodGroups, and a dedicated group scheduling cycle.

CNCF launched v1.0 of the Kubernetes AI Conformance Program defining baseline capabilities for running AI workloads across conformant clusters.

llm-d was accepted as a CNCF Sandbox project, providing Kubernetes-native distributed inference with KV-cache-aware routing, prefill/decode disaggregation, and accelerator-agnostic serving.

Production deployment patterns for NVIDIA Dynamo 1.0 on EKS and GKE — disaggregated serving, KV-aware routing, and gotchas from real deployments.

DRA went GA in Kubernetes v1.34 and continues evolving — replacing Device Plugins with richer semantics including DeviceClass, ResourceClaim, CEL-based filtering, and topology awareness.

How Karpenter's groupless, pod-driven provisioning model solves the scaling limitations that plagued Kubernetes Cluster Autoscaler for years.

How Karpenter automatically replaces drifted nodes, consolidates underutilized capacity, and respects disruption budgets to keep clusters lean and current.

How KEDA extends Kubernetes HPA with 65+ scalers, scale-to-zero, and a two-phase architecture for event-driven pod autoscaling.

What KubeDojo is and what you'll find here: deep dives into real code, honest explorations of the Kubernetes ecosystem, and structured learning paths to master every certification.

A practical map of the five CNCF Kubernetes certifications — what each one covers, how exams work, and which path fits your career.

How kagent, Agent Sandbox, KEDA, and OPA/Kyverno form the production stack for agentic AI on Kubernetes.