KubeDojo

CNCF Certified Kubernetes AI Conformance Program

KubeDojo
by KubeDojo··17 min read·
CNCF Certified Kubernetes AI Conformance Program

GPU scheduling works differently on every Kubernetes platform. DRA implementations vary. Inference routing is a patchwork of vendor-specific solutions. If you've tried moving an AI training job from one managed Kubernetes service to another, you know the drill: rewrite the resource requests, swap the device plugin, reconfigure the autoscaler, and hope the gang scheduling solution you picked is available on the target cluster.

The CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon NA 2025 to fix exactly this. Modeled on the original Kubernetes conformance program that brought 100+ distributions to a common API baseline, this new initiative defines the minimum capabilities a platform must offer to reliably run AI and ML workloads. The goal: if your training job or inference service works on one conformant platform, it works on any of them.

This article walks through what the program requires, how the conformance checklist is structured, and what it means for teams choosing where to run production AI workloads.

What the Program Covers

The AI Conformance Program targets three workload categories:

  • Training: distributed jobs that need accelerators, predictable scheduling, and all-or-nothing placement.
  • Inference: model serving where latency, traffic routing, and autoscaling matter.
  • Agentic workloads: multi-step workflows combining tools, memory, and long-running tasks.

Not every workload type uses every conformant feature. A single-GPU inference service doesn't need gang scheduling. A training job doesn't need Gateway API routing. The point is that a conformant platform supports the full surface area, so you don't discover missing capabilities after you've committed to a vendor.

The v1.35 checklist is strongest for training and inference. Agentic workloads, where long-running agents combine tools, memory, and iterative reasoning, are the newest pattern. The conformance requirements don't yet address their specific needs (persistent state across restarts, dynamic tool provisioning, long-lived WebSocket connections). The working group has flagged agentic workloads as a research area for v2.0, but for now, conformance mainly tells you whether training and inference will work reliably.

The Checklist

Conformance is validated through a YAML checklist, versioned per Kubernetes release. The latest is AIConformance-1.35.yaml. Each item has a level:

  • MUST: mandatory for conformance. Fail one, fail the whole assessment.
  • SHOULD: recommended but not blocking. Platforms are expected to implement these over time.
  • N/A: allowed only with justification. "We don't support this feature" is not a valid reason. You need to explain why the requirement's context doesn't apply to your platform. Example: cluster autoscaling can be N/A for bare-metal on-premises deployments where node pools don't scale dynamically.

The checklist covers six domains with 12 total items. Here's the full breakdown.

Kubernetes AI Conformance v1.35 requirement domains Figure 1: The six requirement domains and their checklist items. Gold items are MUST requirements; gray items are SHOULD.

The Six Requirement Domains

Accelerators

This is the heaviest section, with one MUST and three SHOULDs.

# AIConformance-1.35.yaml — accelerators section (descriptions trimmed)
spec:
  accelerators:
    - id: dra_support
      description: "Support Dynamic Resource Allocation (DRA) APIs to enable
        more flexible and fine-grained resource requests beyond simple counts."
      level: MUST
    - id: driver_runtime_management
      description: "Provide a verifiable mechanism for ensuring that compatible
        accelerator drivers and corresponding container runtime configurations
        are correctly installed and maintained on nodes with accelerators.
        ..." # forward-looking DRA verification note trimmed
      level: SHOULD
    - id: gpu_sharing
      description: "For accelerators that support static GPU sharing, provide
        well-defined mechanisms for at least one GPU sharing strategy to
        improve utilization for workloads that do not require a full
        dedicated GPU. ..." # hardware partitioning, time-slicing, and
        # forward-looking DRA dynamic sharing notes trimmed
      level: SHOULD
    - id: virtualized_accelerator
      description: "For accelerators that support virtualized accelerator
        technologies (e.g. vGPU), provide well-defined mechanisms for these
        to be exposed and managed. ..." # consistency and DRA notes trimmed
      level: SHOULD

DRA is the only MUST here. The old device plugin model (nvidia.com/gpu: 1) treats GPUs as opaque integers. DRA, stable since Kubernetes 1.34, works more like PersistentVolumeClaim: you define resource classes, claim specific device capabilities, and let the scheduler handle placement. This matters for multi-GPU training jobs where topology awareness (which GPUs share an NVLink bus) can significantly affect throughput.

The three SHOULD items reflect where the ecosystem is heading. GPU sharing (MIG partitioning, time-slicing, MPS) and vGPU support aren't mandatory yet, but the checklist is clear that platforms should expose fractional GPU resources as schedulable units once the hardware supports it through DRA. Watch the SHOULD items closely. They signal what becomes MUST in v2.0. If your platform doesn't support GPU sharing today, it will likely need to by the next certification cycle.

Networking

# AIConformance-1.35.yaml — networking section
spec:
  networking:
    - id: ai_inference
      description: "Support the Kubernetes Gateway API with an implementation
        for advanced traffic management for inference services, which enables
        capabilities like weighted traffic splitting, header-based routing
        (for OpenAI protocol headers), and optional integration with service meshes."
      level: MUST

One item, and it's a MUST. Inference services need more than basic Service load balancing. The requirement specifically calls out Gateway API support with weighted traffic splitting (for canary deployments of model versions) and header-based routing for OpenAI protocol headers. This is how you route requests to different model backends based on the model field in the OpenAI-compatible API request.

This is the requirement most tightly coupled to a specific ecosystem choice. Gateway API is the community's answer to Ingress, but inference routing is still evolving fast. Projects like KGateway (formerly Envoy Gateway with inference extensions) and Kubermatic's KubeLB AI Gateway are building on top of Gateway API, but the implementations vary significantly. Ask your vendor which Gateway API controller they certified with, and whether it supports the inference-specific features or just the base spec.

Scheduling and Orchestration

# AIConformance-1.35.yaml — scheduling section (descriptions trimmed)
spec:
  schedulingOrchestration:
    - id: gang_scheduling
      description: "The platform must allow for the installation and successful
        operation of at least one gang scheduling solution that ensures
        all-or-nothing scheduling for distributed AI workloads (e.g. Kueue,
        Volcano, etc.) ..." # vendor demonstration requirement trimmed
      level: MUST
    - id: cluster_autoscaling
      description: "If the platform provides a cluster autoscaler or an
        equivalent mechanism, it must be able to scale up/down node groups
        containing specific accelerator types based on pending pods requesting
        those accelerators."
      level: MUST
    - id: pod_autoscaling
      description: "If the platform supports the HorizontalPodAutoscaler,
        it must function correctly for pods utilizing accelerators.
        ..." # custom metrics requirement trimmed
      level: MUST

Three MUSTs. Gang scheduling prevents the classic distributed training deadlock: Job A grabs 4 of 8 required GPUs, Job B grabs the other 4, and both sit forever waiting for resources that will never free up. The checklist requires at least one solution (Kueue, Volcano, or equivalent) that guarantees all-or-nothing placement.

The autoscaling requirements are conditional: "if the platform provides" a cluster autoscaler or HPA. But the bar is specific. The cluster autoscaler must handle accelerator-typed node groups, not just generic CPU/memory pools. HPA must work with custom metrics relevant to AI workloads (GPU utilization, queue depth, tokens per second), not just CPU percentage.

Gang scheduling is the requirement most likely to expose differences between conformant platforms. Kueue is the community standard, but it's maturing fast. Features like MultiKueue (cross-cluster job dispatching) and ProvisioningRequest (just-in-time node provisioning) shipped recently. If your vendor certified with an older Kueue release, you may have gang scheduling on paper but miss the features that make it production-ready for large-scale training. Ask which version they tested against.

Observability

# AIConformance-1.35.yaml — observability section (descriptions trimmed)
spec:
  observability:
    - id: accelerator_metrics
      description: "For supported accelerator types, the platform must allow
        for the installation and successful operation of at least one
        accelerator metrics solution that exposes fine-grained performance
        metrics via a standardized, machine-readable metrics endpoint.
        ..." # core metrics list, OTel alignment, and managed solution
        # opt-out notes trimmed
      level: MUST
    - id: ai_service_metrics
      description: "Provide a monitoring system capable of discovering and
        collecting metrics from workloads that expose them in a standard
        format (e.g. Prometheus exposition format).
        ..." # framework integration note trimmed
      level: MUST

Two MUSTs. The full accelerator_metrics description requires per-device utilization and memory usage at minimum, plus temperature, power draw, and interconnect bandwidth where the hardware exposes them. The checklist recommends alignment with emerging standards like OpenTelemetry for interoperability, though it stops short of mandating a specific format.

The second item requires the platform to discover and collect metrics from workloads that expose Prometheus-format endpoints. This isn't about shipping a specific monitoring stack. It's about ensuring your vLLM server's /metrics endpoint gets scraped without custom integration work on each platform.

Security

# AIConformance-1.35.yaml — security section
spec:
  security:
    - id: secure_accelerator_access
      description: "Ensure that access to accelerators from within containers
        is properly isolated and mediated by the Kubernetes resource management
        framework (device plugin or DRA) and container runtime, preventing
        unauthorized access or interference between workloads."
      level: MUST

One MUST. GPUs are shared resources in multi-tenant clusters. Without proper isolation, a container could access another workload's GPU memory, which is a data exfiltration risk. The requirement is straightforward: accelerator access must go through the Kubernetes resource management framework (device plugin or DRA), and the container runtime must enforce isolation between workloads.

Operators

# AIConformance-1.35.yaml — operator section
spec:
  operator:
    - id: robust_controller
      description: "The platform must prove that at least one complex AI
        operator with a CRD (e.g., Ray, Kubeflow) can be installed and
        functions reliably. This includes verifying that the operator's pods
        run correctly, its webhooks are operational, and its custom resources
        can be reconciled."
      level: MUST

One MUST. This is a practical sanity check. AI frameworks like Ray and Kubeflow are complex operators that stress the API server, webhook infrastructure, and CRD reconciliation loop. If a platform can't reliably run these operators, the rest of the conformance checklist is academic. The requirement asks vendors to demonstrate that at least one such operator installs cleanly, its webhooks fire, and custom resources reconcile.

Who's Already Certified

The initial v1.0 certifications were announced at KubeCon NA 2025 in Atlanta. The roster includes:

Category Platforms
Major cloud AWS EKS, Google GKE, Microsoft Azure AKS, Oracle OCI
European / sovereign Kubermatic KKP, SUSE, Gardener
AI-native infrastructure CoreWeave, Akamai
Open source Red Hat OpenShift, Sidero Labs (Talos), VMware VKS

Checklists are available for Kubernetes v1.33, v1.34, and v1.35. Each platform certifies per-product and per-configuration. A cloud deployment and an air-gapped deployment of the same product are separate certifications.

The breadth of initial participants matters. It's not just the hyperscalers checking a compliance box. European vendors like Kubermatic explicitly frame conformance as a digital sovereignty play: "The future of AI will be built on open standards, not walled gardens." CoreWeave and Akamai represent the GPU-native infrastructure layer that didn't exist in the original K8s conformance era.

Gotchas

Self-assessment today, automated tests later. The v1.0 process is a structured self-assessment: fill out the YAML checklist, provide links to public documentation as evidence, and submit a pull request. CNCF reviews submissions within 10 business days. Automated conformance tests are planned for 2026, which will raise the bar significantly.

N/A is not a free pass. Vendors can mark a requirement as "Not Applicable," but the justification matters. "We don't support this feature" gets rejected. A valid justification explains why the requirement's context doesn't apply: bare-metal platforms can justify N/A for cluster autoscaling because there are no dynamic node pools to scale. Misunderstanding this distinction is the fastest way to get a submission sent back.

Annual renewal, aligned with Kubernetes releases. Certifications are valid for one year and must be renewed with each Kubernetes minor version. A platform certified for v1.34 needs to re-certify for v1.35. This keeps the program aligned with the Kubernetes release cycle and prevents stale certifications from lingering.

v2.0 will expand scope. The current checklist is deliberately focused. The v2.0 roadmap, expected in 2026, will likely add requirements for advanced inference patterns, enhanced monitoring metrics, and stricter security for model serving. If you're building conformance into your platform strategy, plan for the requirements to grow.

Conformance doesn't mean identical. Two conformant platforms can implement requirements differently. GKE enables DRA by default and ships Gateway API as a managed feature. Kubermatic bundles Kueue in its default application catalog and integrates the KubeLB AI Gateway. Both are conformant, but the operational experience differs. Conformance guarantees the capabilities exist, not that they're configured identically.

Wrap-up

Twelve requirements across six domains. That's the current bar for Kubernetes AI Conformance: DRA, gang scheduling, Gateway API routing, accelerator metrics, isolated GPU access, and operator reliability.

The program is the community's bet that open standards will beat proprietary AI infrastructure stacks, and the initial roster of certified platforms (hyperscalers, European vendors, GPU-native providers) suggests the bet is landing. With 82% of organizations building custom AI solutions and 58% already on Kubernetes, the fragmentation cost is real enough that vendors are willing to certify.

Before committing to a platform for production AI workloads, check its conformance status at github.com/cncf/k8s-ai-conformance. If it's not on the list, ask why. The checklist is public, the bar is clearly defined, and there's no technical reason a production-grade Kubernetes platform can't meet it.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles