KubeDojo

Dynamic Resource Allocation (DRA) GA: The New GPU Interface for Kubernetes

KubeDojo
by KubeDojo··17 min read·
Dynamic Resource Allocation (DRA) GA: The New GPU Interface for Kubernetes

If you've operated GPU workloads on Kubernetes, you know the gap. You request nvidia.com/gpu: 1 and hope the scheduler places your Pod on a node with an available GPU. You can't ask for a specific GPU model. You can't express "two GPUs on the same PCIe root complex." You can't share a GPU across containers without resorting to time-slicing hacks configured outside Kubernetes.

Dynamic Resource Allocation changed this. DRA went GA in Kubernetes v1.34, replacing the count-based Device Plugin model with a claim-based API that treats GPUs as first-class, attribute-aware resources. Instead of resources.limits, you create a ResourceClaim that references a DeviceClass with CEL filters. The scheduler matches claims to ResourceSlices published by DRA drivers, then places Pods on nodes that can access the allocated devices.

This article walks through the DRA architecture, the migration path from Device Plugins, and real examples from the NVIDIA and AMD DRA drivers. You'll see how to write CEL filters for GPU selection, set up DeviceClasses, and deploy workloads that share GPUs across containers.

DRA Architecture: DeviceClass, ResourceClaim, ResourceSlice

DRA introduces four API kinds in the resource.k8s.io/v1 group. Understanding how they interact is the foundation for migrating from Device Plugins.

DeviceClass

A DeviceClass defines a category of devices. Think of it as a StorageClass for GPUs — it tells workload operators what types of devices exist and how to request them.

# deviceclass.yaml (k8s.io/examples/dra/deviceclass.yaml)
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: nvidia-gpu
spec:
  selectors:
  - cel:
      expression: |-
        device.driver == "gpu.nvidia.com"

This DeviceClass selects all devices managed by the gpu.nvidia.com driver. The CEL expression can filter on any attribute published in the ResourceSlice — driver name, device model, memory capacity, topology hints.

Cluster admins typically create DeviceClass objects during driver installation. The NVIDIA DRA driver, for example, creates gpu.nvidia.com and mig.nvidia.com classes out of the box.

ResourceSlice

A ResourceSlice represents one or more devices attached to nodes. DRA drivers create and manage these objects — you don't create them manually. Each slice publishes device attributes that CEL filters can match against.

Here's a representative ResourceSlice from the AMD GPU DRA driver advertising an MI300X (attributes based on github.com/ROCm/k8s-gpu-dra-driver docs/driver-attributes.md):

# ResourceSlice example (representative, based on github.com/ROCm/k8s-gpu-dra-driver driver attributes)
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gpu-worker-2-gpu.amd.com-5fddf
spec:
  driver: gpu.amd.com
  nodeName: gpu-worker-2
  devices:
  - name: gpu-56-184
    attributes:
      cardIndex:
        int: 56
      deviceID:
        string: "8462659767828489944"
      driverVersion:
        version: 6.12.12
      family:
        string: AI
      partitionProfile:
        string: spx_nps1
      pciAddr:
        string: "0003:00:04.0"
      productName:
        string: AMD_Instinct_MI300X_OAM
      resource.kubernetes.io/pcieRoot:
        string: pci0003:00
      type:
        string: amdgpu
    capacity:
      computeUnits:
        value: "304"
      simdUnits:
        value: "1216"
      memory:
        value: 196592Mi

Every GPU gets this treatment — PCI address, partition profile, driver version, memory, compute units. The scheduler sees the entire cluster's GPU landscape, not just a count per node.

ResourceClaim vs ResourceClaimTemplate

You request devices using either a ResourceClaim or a ResourceClaimTemplate. The choice depends on whether you want to share devices across Pods or give each Pod its own device.

ResourceClaim — manual creation, shared access, lifecycle independent of Pods:

# resourceclaim.yaml (k8s.io/examples/dra/resourceclaim.yaml)
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: shared-gpu-claim
spec:
  devices:
    requests:
    - name: single-gpu-claim
      exactly:
        deviceClassName: nvidia-gpu
        allocationMode: All
        selectors:
        - cel:
            expression: |-
              device.attributes["gpu.nvidia.com"].type == "gpu"

Multiple Pods can reference shared-gpu-claim and share the allocated GPU. This is ideal for inference workloads where multiple replicas can use the same device.

ResourceClaimTemplate — automatic per-Pod generation, dedicated devices, lifecycle bound to Pod:

# resourceclaimtemplate.yaml (k8s.io/examples/dra/resourceclaimtemplate.yaml)
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: per-pod-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu-claim
        exactly:
          deviceClassName: nvidia-gpu
          selectors:
            - cel:
                expression: |-
                  device.attributes["gpu.nvidia.com"].type == "gpu" &&
                  device.capacity["gpu.nvidia.com"].memory == quantity("64Gi")

Kubernetes generates a fresh ResourceClaim for each Pod created from a Deployment or Job. Each Pod gets its own GPU. Use this for training workloads where each replica needs dedicated hardware.

DRA Workflow

The DRA workflow ties these components together:

  1. Admin creates DeviceClass — defines GPU categories with CEL selectors
  2. DRA Driver creates ResourceSlice — advertises GPU attributes to the cluster
  3. Operator creates ResourceClaim — requests devices for a workload
  4. Scheduler matches Claim to Slice — finds nodes with matching devices
  5. Pod scheduled on node — kubelet attaches allocated device to container

DRA workflow showing DeviceClass, ResourceSlice, ResourceClaim, Scheduler, and Pod interaction Figure 1: DRA scheduling flow — the scheduler matches ResourceClaims to ResourceSlices, then places Pods on nodes with allocated devices

Migration: From Device Plugin to DRA

The migration from Device Plugins to DRA is more about changing how you request devices than replacing your entire GPU stack. The Device Plugin model isn't going away — but DRA is the future for attribute-aware scheduling.

What changes

Request syntax — from count-based to claim-based:

# Device Plugin (old)
containers:
- name: app
  image: my-gpu-app
  resources:
    limits:
      nvidia.com/gpu: 1

# DRA (new)
containers:
- name: app
  image: my-gpu-app
  resources:
    claims:
    - name: gpu-claim
resourceClaims:
- name: gpu-claim
  resourceClaimTemplateName: per-pod-gpu

Scheduling semantics — the scheduler now matches claims to ResourceSlices using CEL filters. You can express constraints that were impossible before: "two GPUs from the same parent device," "GPUs with at least 64Gi memory," "GPUs on the same PCIe root complex."

What stays

Device Plugin metrics — kubelet still exposes PodResourcesLister gRPC service for monitoring. Tools like kubectl top and Prometheus exporters continue working.

nvidia-smi visibility — once the Pod is running, nvidia-smi -L shows the same GPU list. The allocation mechanism changed, not the runtime behavior.

When to migrate

New GPU workloads — start with DRA. The claim-based model gives you more flexibility and better scheduling.

Existing Device Plugin deployments — no rush. Device Plugins continue working. Migrate when you need attribute-based scheduling or GPU sharing across containers.

When DRA is worth the migration

Migrate to DRA when:

  • You need GPU sharing across containers or Pods
  • You're running MIG workloads and want partition-aware scheduling
  • You have multi-vendor GPUs and want a unified scheduling model
  • You need topology constraints (same PCIe root, same NVLink domain)

Stay on Device Plugins when:

  • Your workloads only need whole-GPU allocation
  • You're on Kubernetes < v1.34
  • Your GPU operator doesn't yet support DRA

Real example: NVIDIA GPU Operator with MIG + DRA on AKS

Azure AKS demonstrates the migration path. The GPU Operator installs both the Device Plugin and DRA driver, but you disable the Device Plugin to let DRA take over:

# operator-install.yaml (blog.aks.azure.com/2026/03/03/multi-instance-gpu-with-dra-on-aks)
mig:
  strategy: single
devicePlugin:
  enabled: false  # DRA replaces Device Plugin
driver:
  enabled: true
toolkit:
  env:
  - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
    value: "false"

The DRA driver installation then points to the driver root managed by the Operator:

# dra-install.yaml (blog.aks.azure.com/2026/03/03/multi-instance-gpu-with-dra-on-aks)
gpuResourcesEnabledOverride: true
resources-computeDomains:
  enabled: false
controller:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/mode
            operator: In
            values:
            - system
nvidiaDriverRoot: "/run/nvidia/driver"

With MIG enabled and DRA managing allocation, you create a ResourceClaimTemplate for MIG partitions:

# mig-gpu-1g ResourceClaimTemplate (blog.aks.azure.com/2026/03/03/multi-instance-gpu-with-dra-on-aks)
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: mig-gpu-1g
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: nvidia-mig
          count: 1

A TensorFlow Job then references the template:

# samples-tf-mnist-demo Job (blog.aks.azure.com/2026/03/03/multi-instance-gpu-with-dra-on-aks, adapted from k8s.io/examples/dra/dra-example-job.yaml)
apiVersion: batch/v1
kind: Job
metadata:
  name: samples-tf-mnist-demo
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        resources:
          claims:
          - name: gpu
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: mig-gpu-1g

The Job runs on a MIG partition, not a full GPU. Multiple Jobs can share the same physical GPU, each with isolated memory and compute.

Setting Up DRA: NVIDIA and AMD Drivers

Setting up DRA requires installing the DRA driver for your GPU vendor. The drivers handle ResourceSlice creation, device allocation, and CDI injection.

NVIDIA k8s-dra-driver-gpu

The NVIDIA DRA driver manages two resource types:

GPUs — traditional GPU allocation with MIG support (beta, not yet officially supported as of v25.3.0)

ComputeDomains — abstraction for Multi-Node NVLink (MNNVL), officially supported

Installation uses Helm. First, create a kind cluster (or use your production cluster):

git clone https://github.com/NVIDIA/k8s-dra-driver-gpu.git
cd k8s-dra-driver-gpu
export KIND_CLUSTER_NAME="kind-dra-1"
./demo/clusters/kind/build-dra-driver-gpu.sh
./demo/clusters/kind/create-cluster.sh
./demo/clusters/kind/install-dra-driver-gpu.sh

The driver creates ResourceSlices automatically. Verify with:

kubectl get resourceslices

AMD k8s-gpu-dra-driver

The AMD driver integrates with ROCm and publishes ResourceSlices with detailed attributes. Each MI300X gets advertised with:

  • PCI address (pciAddr: "0003:00:04.0")
  • Partition profile (partitionProfile: "spx_nps1")
  • Memory capacity (memory: 196592Mi)
  • Compute units (computeUnits: "304")

The driver also supports partitioned GPUs. A child device references its parent:

# Partitioned GPU ResourceSlice (representative, based on github.com/ROCm/k8s-gpu-dra-driver driver attributes)
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gpu-worker-2-gpu.amd.com-5fddf
spec:
  driver: gpu.amd.com
  nodeName: gpu-worker-2
  devices:
  - name: gpu-4-132
    attributes:
      parentDeviceID:
        string: "16912319329091163297"
      parentPciAddr:
        string: "0002:00:01.0"
      partitionProfile:
        string: cpx_nps1
      productName:
        string: AMD_Instinct_MI300X_OAM
      type:
        string: amdgpu-partition
    capacity:
      computeUnits:
        value: "38"
      memory:
        value: 24574Mi

GPU sharing across containers

The NVIDIA driver demo shows two containers sharing one GPU via a ResourceClaimTemplate:

# gpu-test2.yaml (github.com/NVIDIA/k8s-dra-driver-gpu/demo/specs/quickstart/v1/)
apiVersion: v1
kind: Pod
metadata:
  name: pod
  namespace: gpu-test2
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu

Both containers reference the same claim. When they run nvidia-smi -L, they see the same UUID:

[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

CEL Filters: Expressive GPU Selection

CEL (Common Expression Language) filters are the heart of DRA's expressiveness. You can filter on any attribute published in the ResourceSlice.

Basic filter

Select all devices from a driver:

selectors:
- cel:
    expression: device.driver == "gpu.nvidia.com"

Attribute filter

Select a specific GPU model:

selectors:
- cel:
    expression: |-
      device.attributes["gpu.amd.com"].productName == "AMD_Instinct_MI300X_OAM" &&
      device.attributes["gpu.amd.com"].partitionProfile == "spx_nps1"

Capacity filter

Request GPUs with minimum memory:

selectors:
- cel:
    expression: |-
      device.attributes["driver.example.com"].type == "gpu" &&
      device.capacity["driver.example.com"].memory == quantity("64Gi")

Topology constraint

Allocate two GPU partitions from the same parent device:

# ResourceClaim with constraints (github.com/ROCm/k8s-gpu-dra-driver)
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: two-partitions-same-parent
  namespace: gpu-test
spec:
  devices:
    constraints:
    - matchAttribute: "gpu.amd.com/deviceID"
      requests: ["gpu1", "gpu2"]
    requests:
    - name: gpu1
      exactly:
        deviceClassName: gpu.amd.com
        selectors:
        - cel:
            expression: device.attributes["gpu.amd.com"].type == "amdgpu-partition"
    - name: gpu2
      exactly:
        deviceClassName: gpu.amd.com
        selectors:
        - cel:
            expression: device.attributes["gpu.amd.com"].type == "amdgpu-partition"

The matchAttribute constraint ensures both gpu1 and gpu2 have the same deviceID — they're partitions of the same physical GPU.

Gotchas

DRA requires Kubernetes v1.34+ — check before deploying. Run kubectl get deviceclasses. If you get error: the server doesn't have a resource type "deviceclasses", your cluster is too old or the API group is disabled.

ResourceClaim must exist before Pod references it — unless you use ResourceClaimTemplate. If you manually create a ResourceClaim and reference it in a Pod, the Pod won't schedule until the claim exists. ResourceClaimTemplate avoids this by generating claims automatically.

Admin access requires namespace labeling — the adminAccess field in a ResourceClaim grants privileged device access. Starting with Kubernetes v1.34, you can only use this field in namespaces labeled resource.kubernetes.io/admin-access: "true" (case-sensitive).

Device taints can block scheduling — DRA supports device taints (alpha in v1.33). A device with a NoSchedule taint won't be allocated. A NoExecute taint evicts Pods already using the device. Taints are applied via DeviceTaintRule objects.

Pre-scheduled Pods bypass the scheduler — if you set spec.nodeName directly, the scheduler doesn't run. ResourceClaims may not be allocated. Use a node selector instead:

spec:
  nodeSelector:
    kubernetes.io/hostname: name-of-the-intended-node

Binding conditions wait up to 600 seconds — alpha feature in v1.34. External device preparation (e.g., fabric-attached GPUs) can delay binding. The scheduler waits for conditions like dra.example.com/is-prepared to become True. Configure the timeout in KubeSchedulerConfiguration.

Wrap-up

DRA replaces the count-based Device Plugin model with a claim-based API that treats GPUs as first-class, attribute-aware resources. You create DeviceClass objects to categorize devices, ResourceClaim or ResourceClaimTemplate to request them, and CEL filters to express precise requirements.

The migration path is straightforward: install the DRA driver, create or use existing DeviceClass objects, and update your workload manifests to use resources.claims instead of resources.limits. The NVIDIA and AMD drivers are both production-ready, with the NVIDIA driver focusing on ComputeDomains for MNNVL and the AMD driver providing full GPU attribute visibility.

For AI workloads, DRA is the foundation for more advanced scheduling patterns. Kueue uses DRA for batch workload management. HAMi provides GPU virtualization on top of DRA. The Gateway API inference extension integrates with DRA for model-aware routing.

Start by deploying the NVIDIA or AMD DRA driver on a test cluster. Migrate a single GPU workload and compare the scheduling behavior against your current Device Plugin setup. The claim-based model takes more YAML upfront, but you get back something Device Plugins never offered: the ability to ask for exactly what you need.

The claim-based model restores Kubernetes' core promise: you declare what you need, and the scheduler handles the rest. No more node labels for GPU models. No more manual coordination between cluster admins and workload operators. Just declarative, attribute-aware GPU allocation.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles