KubeDojo

GPU and AI/ML Workload Scaling with Karpenter

KubeDojo
by KubeDojo··15 min read·
GPU and AI/ML Workload Scaling with Karpenter

GPU nodes cost $3 to $30 per hour depending on the accelerator. A p4d.24xlarge with eight A100 GPUs runs about $32/hour on-demand. Every minute that GPU sits idle during an image pull, a consolidation replacement, or a partial scheduling failure is money burned with nothing to show for it.

Karpenter's just-in-time provisioning is a natural fit for GPU workloads. It can launch exactly the right accelerated instance type for a pending pod, skip the ASG guesswork, and tear down expensive nodes the moment they're empty. But the defaults that work for CPU clusters can waste thousands on accelerated hardware. A consolidation policy that recycles "underutilized" nodes ignores GPU occupancy. A drift-triggered AMI update can break CUDA compatibility mid-training. A 10GB container image can leave a GPU node idle for ten minutes before any work starts.

This article covers the Karpenter configuration patterns that prevent these problems: dedicated GPU NodePools, cold start mitigation, disruption protection for training jobs, GPU sharing with time-slicing and MIG, and gang scheduling for distributed workloads.

Dedicated GPU NodePools

The first rule of GPU scaling with Karpenter: separate your GPU capacity into a dedicated NodePool with taints. Without this separation, Karpenter might schedule a lightweight logging pod onto a $30/hour GPU node simply because it has available CPU and memory.

The aws-samples/karpenter-blueprints repository provides a production-ready starting point. The EC2NodeClass configures the underlying AWS infrastructure:

# gpu-nodeclass.yaml (aws-samples/karpenter-blueprints)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  role: "KarpenterNodeRole"
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      iops: 10000
      throughput: 125
      volumeSize: 100Gi
      volumeType: gp3
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster

The 100Gi gp3 volume with 10,000 IOPS matters. GPU container images are large, and the default volume size and throughput can bottleneck image extraction.

The NodePool is where GPU isolation happens:

# gpu-nodepool.yaml (aws-samples/karpenter-blueprints)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  limits:
    cpu: 100
    memory: 100Gi
    nvidia.com/gpu: 5
  template:
    metadata:
      labels:
        nvidia.com/gpu.present: "true"
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        name: gpu
        kind: EC2NodeClass
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g"]
        - key: karpenter.k8s.aws/instance-gpu-manufacturer
          operator: In
          values: ["nvidia"]
      expireAfter: 720h
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

Three things to note here. First, the nvidia.com/gpu taint ensures only pods that explicitly tolerate GPUs land on these nodes. Second, the nvidia.com/gpu: 5 resource limit caps the total GPU count Karpenter can provision, preventing runaway spend. Third, consolidationPolicy: WhenEmpty avoids consolidating nodes that still have GPU pods running.

Broaden your instance selection

The blueprint above restricts to the g category. For production, widen the selection to include training-class instances and require a minimum generation:

# gpu-nodepool-broad.yaml
requirements:
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values: ["g", "p"]
  - key: karpenter.k8s.aws/instance-generation
    operator: Gte
    values: ["4"]
  - key: karpenter.k8s.aws/instance-gpu-manufacturer
    operator: In
    values: ["nvidia"]

This gives Karpenter a wider menu. If you need a specific GPU, use the karpenter.k8s.aws/instance-gpu-name label in your pod's nodeSelector to target T4, A10G, A100, or H100 instances without constraining the NodePool itself.

warning: Some P5 and G6 instances currently report a GPU count of 0 to the DescribeInstanceTypes API. If these end up in a general-purpose NodePool, Karpenter may schedule non-GPU pods onto them. Exclude them explicitly with a NotIn operator until AWS patches the API reporting.

Solving the Cold Start Problem

AI container images routinely exceed 10GB. A GPU node that takes 5 to 10 minutes pulling an image before any work starts is burning expensive compute on I/O. Three approaches solve this at different layers.

Bottlerocket EBS snapshot pre-seeding

Bottlerocket stores container images on a separate data volume (/dev/xvdb). You can pre-pull your ML images onto an EC2 instance, snapshot that volume, and reference the snapshot in your EC2NodeClass. Every new GPU node boots with images already present.

# gpu-nodeclass-bottlerocket.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-bottlerocket
spec:
  amiSelectorTerms:
  - alias: bottlerocket@latest
  role: "KarpenterNodeRole"
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 10Gi
      volumeType: gp3
  - deviceName: /dev/xvdb
    ebs:
      volumeSize: 100Gi
      volumeType: gp3
      snapshotID: snap-0abc123def456789
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster

This approach requires Bottlerocket as the AMI family. The dual-volume architecture that makes snapshot pre-seeding possible does not exist on AL2023 or Amazon Linux 2.

AWS measured pod startup dropping from 49 seconds to 3 seconds for a 4.93GB image using this approach. For 10GB+ GPU images, the savings are even more dramatic.

The trade-off: you need to rebuild the snapshot whenever your base image changes. Automate this in CI using aws-samples/bottlerocket-images-cache.

Peer-to-peer image distribution

Spegel and Dragonfly let nodes pull image layers from neighboring nodes instead of the registry. When Karpenter scales from one GPU node to ten, the second through tenth nodes pull from the first rather than hammering the registry. No snapshot management required, but the first node still pays the full pull cost.

Lazy loading with Seekable OCI

Seekable OCI and e-stargz allow containers to start before the full image downloads. The runtime fetches layers on demand. This works well for inference workloads where the model loads after the container starts, but training jobs that need the entire image at startup see less benefit.

Which to pick: Bottlerocket snapshots for predictable workloads with a known set of images. Lazy loading for inference services with varied models. Peer-to-peer distribution for mixed clusters where snapshot management is too much overhead.

Protecting Long-Running Training Jobs

A model training run can take hours or days. Karpenter's consolidation, drift detection, and node expiration are all designed to keep clusters lean, but they can terminate a 48-hour training job mid-epoch if you don't configure protection.

The do-not-disrupt annotation

The karpenter.sh/do-not-disrupt: "true" annotation on a pod tells Karpenter to skip voluntary disruption for that node. Consolidation and drift won't touch the node while the annotated pod is running.

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-finetune
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: trainer
        image: my-registry/llm-trainer:v2.1
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: 32Gi
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

warning: The do-not-disrupt annotation blocks voluntary disruptions only. Spot interruptions are involuntary, and Karpenter cannot prevent them. For training jobs longer than two hours, use On-Demand capacity.

Pin your GPU AMIs

If your EC2NodeClass uses alias: al2023@latest, a new EKS AMI release triggers drift detection. Karpenter recycles your GPU nodes to pick up the new AMI, potentially breaking CUDA driver compatibility. Pin the version:

# gpu-nodeclass-pinned.yaml
amiSelectorTerms:
- alias: al2023@v20240807

Update the pinned version on your schedule, after validating CUDA compatibility in a staging environment.

terminationGracePeriod for checkpointing

When a node must be terminated (Spot interruption, forced expiration), terminationGracePeriod gives your training pods time to save a checkpoint before forceful deletion:

# gpu-nodepool-training.yaml (snippet)
spec:
  template:
    spec:
      terminationGracePeriod: 1800s
      expireAfter: Never

The 1800s (30 minutes) gives most training frameworks enough time to write a checkpoint. Tune this to your specific workload: PyTorch Lightning checkpointing on a large model may need more, while a small fine-tuning job may need less. Setting expireAfter: Never prevents the default 30-day expiration from killing long training runs. Combined with the do-not-disrupt annotation, this configuration keeps training nodes stable until the job completes.

GPU Sharing with Time-Slicing and MIG

Not every GPU workload needs exclusive device access. Development notebooks, small inference models, and CI/CD pipeline tests often use a fraction of a GPU's capacity. Sharing a single GPU across multiple pods improves utilization on expensive hardware.

Time-slicing

The NVIDIA device plugin supports time-slicing through a ConfigMap that controls how many virtual GPUs the scheduler sees:

# nvidia-device-plugin-config.yaml (NVIDIA device plugin)
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 3

With replicas: 3, one physical GPU appears as three nvidia.com/gpu resources to the Kubernetes scheduler. Three pods can each request one GPU and share the physical device through round-robin context switching.

The critical limitation: time-slicing provides no memory isolation. One pod can allocate all available VRAM and crash every other pod sharing the GPU. Use time-slicing for dev/test environments where this risk is acceptable.

Multi-Instance GPU (MIG)

MIG is available on A100 and H100 GPUs. It partitions a single GPU into up to seven isolated instances, each with dedicated memory, compute cores, and cache. Unlike time-slicing, a MIG partition cannot steal memory from its neighbors.

The trade-off is flexibility. MIG partitions must be configured at the node level before pods schedule, and not all partition sizes are available on all GPUs. For production inference requiring isolation, MIG is the right choice. For cost-sensitive development, time-slicing is simpler.

Approach Isolation Memory safety Best for
Exclusive access Full Full Training, large models
Time-slicing None None Dev/test, notebooks
MIG Hardware-level Full Production inference

Gang Scheduling for Distributed Training

Distributed training with PyTorch DDP or Horovod requires all worker pods to start simultaneously. If Karpenter provisions seven out of eight required nodes and the eighth fails due to capacity, seven GPUs sit idle waiting. Without gang scheduling, partial allocation wastes expensive hardware.

Kueue

Kueue is a Kubernetes-native job queueing system. It holds a job's pods until sufficient capacity exists for the entire group, then releases them together. Kueue integrates with Karpenter by creating pending pods that Karpenter provisions nodes for, but only after Kueue has verified the full resource requirement can be met.

NVIDIA KAI Scheduler

The KAI Scheduler takes gang scheduling further with topology-aware placement, hierarchical PodGroups, and Dominant Resource Fairness. It schedules pods onto nodes with optimal GPU interconnect topology, reducing NCCL communication overhead for distributed training. KAI is hosted under the Linux Foundation and is production-ready for large GPU clusters.

EFA for NCCL communication

Distributed training on P4 and P5 instances requires Elastic Fabric Adapter (EFA) for NCCL's low-latency collective operations. A common cause of NCCL timeouts: the EC2NodeClass security group doesn't include a self-referential rule allowing all ports and protocols from itself. Without this rule, NCCL traffic between GPU nodes is silently blocked, and training hangs with cryptic timeout errors.

Prefix assignment for pod density

GPU instances often have a low pod limit due to ENI constraints. If you're time-slicing and want to run 20 pods on one node, you'll hit the IP limit. Enable VPC CNI prefix assignment mode to increase the number of IPs per ENI, ensuring pod density is limited by GPU capacity rather than networking.

Gotchas

P5/G6 GPU count reporting. Some newer instances report zero GPUs to the DescribeInstanceTypes API. If these instances are in a general-purpose NodePool, Karpenter schedules non-GPU pods onto them. Exclude these families with a NotIn requirement until the API is fixed.

Drift-triggered CUDA breakage. Using alias: al2023@latest in your EC2NodeClass means every new EKS AMI release triggers drift, recycling your GPU fleet. Pin your AMI version and update on your own schedule after testing CUDA compatibility.

Time-slicing OOM cascades. Time-slicing advertises virtual GPUs to the scheduler, but there's no memory isolation. One pod exhausting VRAM takes down every pod on the shared GPU. Use MIG for workloads that can't tolerate this risk.

Missing EFA security group rule. Forgetting the self-referential security group rule for EFA causes NCCL timeouts that look like training code hangs. The fix is a single security group rule allowing all traffic from itself.

WhenEmptyOrUnderutilized on GPU NodePools. This consolidation policy evaluates utilization by CPU and memory. A node running one GPU pod at 5% CPU utilization looks "underutilized" to Karpenter, even though the GPU is fully occupied. Use WhenEmpty for GPU NodePools.

Default expireAfter mid-training. The default expireAfter of 720 hours (30 days) triggers node replacement. For training NodePools, set expireAfter: Never and rely on the do-not-disrupt annotation or terminationGracePeriod to manage node lifecycle.

Wrap-up

GPU hardware is too expensive for CPU-cluster defaults. Every Karpenter setting, from consolidation policy to AMI selection to node expiration, needs re-evaluation when accelerated instances enter the picture.

Start with a dedicated GPU NodePool and the WhenEmpty consolidation policy. Add Bottlerocket snapshots once your image set stabilizes. Layer in do-not-disrupt annotations as training jobs move to production. Each layer addresses a specific failure mode, and you can adopt them incrementally as your GPU footprint grows. The configurations here, drawn from the Karpenter blueprints and AWS best practices, give you the foundation to scale GPU workloads without paying for idle accelerators.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles