GPU and AI/ML Workload Scaling with Karpenter

GPU nodes cost $3 to $30 per hour depending on the accelerator. A p4d.24xlarge with eight A100 GPUs runs about $32/hour on-demand. Every minute that GPU sits idle during an image pull, a consolidation replacement, or a partial scheduling failure is money burned with nothing to show for it.
Karpenter's just-in-time provisioning is a natural fit for GPU workloads. It can launch exactly the right accelerated instance type for a pending pod, skip the ASG guesswork, and tear down expensive nodes the moment they're empty. But the defaults that work for CPU clusters can waste thousands on accelerated hardware. A consolidation policy that recycles "underutilized" nodes ignores GPU occupancy. A drift-triggered AMI update can break CUDA compatibility mid-training. A 10GB container image can leave a GPU node idle for ten minutes before any work starts.
This article covers the Karpenter configuration patterns that prevent these problems: dedicated GPU NodePools, cold start mitigation, disruption protection for training jobs, GPU sharing with time-slicing and MIG, and gang scheduling for distributed workloads.
Dedicated GPU NodePools
The first rule of GPU scaling with Karpenter: separate your GPU capacity into a dedicated NodePool with taints. Without this separation, Karpenter might schedule a lightweight logging pod onto a $30/hour GPU node simply because it has available CPU and memory.
The aws-samples/karpenter-blueprints repository provides a production-ready starting point. The EC2NodeClass configures the underlying AWS infrastructure:
# gpu-nodeclass.yaml (aws-samples/karpenter-blueprints)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu
spec:
amiSelectorTerms:
- alias: al2023@latest
role: "KarpenterNodeRole"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
iops: 10000
throughput: 125
volumeSize: 100Gi
volumeType: gp3
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
The 100Gi gp3 volume with 10,000 IOPS matters. GPU container images are large, and the default volume size and throughput can bottleneck image extraction.
The NodePool is where GPU isolation happens:
# gpu-nodepool.yaml (aws-samples/karpenter-blueprints)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
limits:
cpu: 100
memory: 100Gi
nvidia.com/gpu: 5
template:
metadata:
labels:
nvidia.com/gpu.present: "true"
spec:
nodeClassRef:
group: karpenter.k8s.aws
name: gpu
kind: EC2NodeClass
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g"]
- key: karpenter.k8s.aws/instance-gpu-manufacturer
operator: In
values: ["nvidia"]
expireAfter: 720h
taints:
- key: nvidia.com/gpu
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
Three things to note here. First, the nvidia.com/gpu taint ensures only pods that explicitly tolerate GPUs land on these nodes. Second, the nvidia.com/gpu: 5 resource limit caps the total GPU count Karpenter can provision, preventing runaway spend. Third, consolidationPolicy: WhenEmpty avoids consolidating nodes that still have GPU pods running.
Broaden your instance selection
The blueprint above restricts to the g category. For production, widen the selection to include training-class instances and require a minimum generation:
# gpu-nodepool-broad.yaml
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g", "p"]
- key: karpenter.k8s.aws/instance-generation
operator: Gte
values: ["4"]
- key: karpenter.k8s.aws/instance-gpu-manufacturer
operator: In
values: ["nvidia"]
This gives Karpenter a wider menu. If you need a specific GPU, use the karpenter.k8s.aws/instance-gpu-name label in your pod's nodeSelector to target T4, A10G, A100, or H100 instances without constraining the NodePool itself.
warning: Some P5 and G6 instances currently report a GPU count of 0 to the
DescribeInstanceTypesAPI. If these end up in a general-purpose NodePool, Karpenter may schedule non-GPU pods onto them. Exclude them explicitly with aNotInoperator until AWS patches the API reporting.
Solving the Cold Start Problem
AI container images routinely exceed 10GB. A GPU node that takes 5 to 10 minutes pulling an image before any work starts is burning expensive compute on I/O. Three approaches solve this at different layers.
Bottlerocket EBS snapshot pre-seeding
Bottlerocket stores container images on a separate data volume (/dev/xvdb). You can pre-pull your ML images onto an EC2 instance, snapshot that volume, and reference the snapshot in your EC2NodeClass. Every new GPU node boots with images already present.
# gpu-nodeclass-bottlerocket.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-bottlerocket
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
role: "KarpenterNodeRole"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 10Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
volumeSize: 100Gi
volumeType: gp3
snapshotID: snap-0abc123def456789
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
This approach requires Bottlerocket as the AMI family. The dual-volume architecture that makes snapshot pre-seeding possible does not exist on AL2023 or Amazon Linux 2.
AWS measured pod startup dropping from 49 seconds to 3 seconds for a 4.93GB image using this approach. For 10GB+ GPU images, the savings are even more dramatic.
The trade-off: you need to rebuild the snapshot whenever your base image changes. Automate this in CI using aws-samples/bottlerocket-images-cache.
Peer-to-peer image distribution
Spegel and Dragonfly let nodes pull image layers from neighboring nodes instead of the registry. When Karpenter scales from one GPU node to ten, the second through tenth nodes pull from the first rather than hammering the registry. No snapshot management required, but the first node still pays the full pull cost.
Lazy loading with Seekable OCI
Seekable OCI and e-stargz allow containers to start before the full image downloads. The runtime fetches layers on demand. This works well for inference workloads where the model loads after the container starts, but training jobs that need the entire image at startup see less benefit.
Which to pick: Bottlerocket snapshots for predictable workloads with a known set of images. Lazy loading for inference services with varied models. Peer-to-peer distribution for mixed clusters where snapshot management is too much overhead.
Protecting Long-Running Training Jobs
A model training run can take hours or days. Karpenter's consolidation, drift detection, and node expiration are all designed to keep clusters lean, but they can terminate a 48-hour training job mid-epoch if you don't configure protection.
The do-not-disrupt annotation
The karpenter.sh/do-not-disrupt: "true" annotation on a pod tells Karpenter to skip voluntary disruption for that node. Consolidation and drift won't touch the node while the annotated pod is running.
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
spec:
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: trainer
image: my-registry/llm-trainer:v2.1
resources:
requests:
nvidia.com/gpu: 1
memory: 32Gi
limits:
nvidia.com/gpu: 1
restartPolicy: Never
warning: The
do-not-disruptannotation blocks voluntary disruptions only. Spot interruptions are involuntary, and Karpenter cannot prevent them. For training jobs longer than two hours, use On-Demand capacity.
Pin your GPU AMIs
If your EC2NodeClass uses alias: al2023@latest, a new EKS AMI release triggers drift detection. Karpenter recycles your GPU nodes to pick up the new AMI, potentially breaking CUDA driver compatibility. Pin the version:
# gpu-nodeclass-pinned.yaml
amiSelectorTerms:
- alias: al2023@v20240807
Update the pinned version on your schedule, after validating CUDA compatibility in a staging environment.
terminationGracePeriod for checkpointing
When a node must be terminated (Spot interruption, forced expiration), terminationGracePeriod gives your training pods time to save a checkpoint before forceful deletion:
# gpu-nodepool-training.yaml (snippet)
spec:
template:
spec:
terminationGracePeriod: 1800s
expireAfter: Never
The 1800s (30 minutes) gives most training frameworks enough time to write a checkpoint. Tune this to your specific workload: PyTorch Lightning checkpointing on a large model may need more, while a small fine-tuning job may need less. Setting expireAfter: Never prevents the default 30-day expiration from killing long training runs. Combined with the do-not-disrupt annotation, this configuration keeps training nodes stable until the job completes.
GPU Sharing with Time-Slicing and MIG
Not every GPU workload needs exclusive device access. Development notebooks, small inference models, and CI/CD pipeline tests often use a fraction of a GPU's capacity. Sharing a single GPU across multiple pods improves utilization on expensive hardware.
Time-slicing
The NVIDIA device plugin supports time-slicing through a ConfigMap that controls how many virtual GPUs the scheduler sees:
# nvidia-device-plugin-config.yaml (NVIDIA device plugin)
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 3
With replicas: 3, one physical GPU appears as three nvidia.com/gpu resources to the Kubernetes scheduler. Three pods can each request one GPU and share the physical device through round-robin context switching.
The critical limitation: time-slicing provides no memory isolation. One pod can allocate all available VRAM and crash every other pod sharing the GPU. Use time-slicing for dev/test environments where this risk is acceptable.
Multi-Instance GPU (MIG)
MIG is available on A100 and H100 GPUs. It partitions a single GPU into up to seven isolated instances, each with dedicated memory, compute cores, and cache. Unlike time-slicing, a MIG partition cannot steal memory from its neighbors.
The trade-off is flexibility. MIG partitions must be configured at the node level before pods schedule, and not all partition sizes are available on all GPUs. For production inference requiring isolation, MIG is the right choice. For cost-sensitive development, time-slicing is simpler.
| Approach | Isolation | Memory safety | Best for |
|---|---|---|---|
| Exclusive access | Full | Full | Training, large models |
| Time-slicing | None | None | Dev/test, notebooks |
| MIG | Hardware-level | Full | Production inference |
Gang Scheduling for Distributed Training
Distributed training with PyTorch DDP or Horovod requires all worker pods to start simultaneously. If Karpenter provisions seven out of eight required nodes and the eighth fails due to capacity, seven GPUs sit idle waiting. Without gang scheduling, partial allocation wastes expensive hardware.
Kueue
Kueue is a Kubernetes-native job queueing system. It holds a job's pods until sufficient capacity exists for the entire group, then releases them together. Kueue integrates with Karpenter by creating pending pods that Karpenter provisions nodes for, but only after Kueue has verified the full resource requirement can be met.
NVIDIA KAI Scheduler
The KAI Scheduler takes gang scheduling further with topology-aware placement, hierarchical PodGroups, and Dominant Resource Fairness. It schedules pods onto nodes with optimal GPU interconnect topology, reducing NCCL communication overhead for distributed training. KAI is hosted under the Linux Foundation and is production-ready for large GPU clusters.
EFA for NCCL communication
Distributed training on P4 and P5 instances requires Elastic Fabric Adapter (EFA) for NCCL's low-latency collective operations. A common cause of NCCL timeouts: the EC2NodeClass security group doesn't include a self-referential rule allowing all ports and protocols from itself. Without this rule, NCCL traffic between GPU nodes is silently blocked, and training hangs with cryptic timeout errors.
Prefix assignment for pod density
GPU instances often have a low pod limit due to ENI constraints. If you're time-slicing and want to run 20 pods on one node, you'll hit the IP limit. Enable VPC CNI prefix assignment mode to increase the number of IPs per ENI, ensuring pod density is limited by GPU capacity rather than networking.
Gotchas
P5/G6 GPU count reporting. Some newer instances report zero GPUs to the DescribeInstanceTypes API. If these instances are in a general-purpose NodePool, Karpenter schedules non-GPU pods onto them. Exclude these families with a NotIn requirement until the API is fixed.
Drift-triggered CUDA breakage. Using alias: al2023@latest in your EC2NodeClass means every new EKS AMI release triggers drift, recycling your GPU fleet. Pin your AMI version and update on your own schedule after testing CUDA compatibility.
Time-slicing OOM cascades. Time-slicing advertises virtual GPUs to the scheduler, but there's no memory isolation. One pod exhausting VRAM takes down every pod on the shared GPU. Use MIG for workloads that can't tolerate this risk.
Missing EFA security group rule. Forgetting the self-referential security group rule for EFA causes NCCL timeouts that look like training code hangs. The fix is a single security group rule allowing all traffic from itself.
WhenEmptyOrUnderutilized on GPU NodePools. This consolidation policy evaluates utilization by CPU and memory. A node running one GPU pod at 5% CPU utilization looks "underutilized" to Karpenter, even though the GPU is fully occupied. Use WhenEmpty for GPU NodePools.
Default expireAfter mid-training. The default expireAfter of 720 hours (30 days) triggers node replacement. For training NodePools, set expireAfter: Never and rely on the do-not-disrupt annotation or terminationGracePeriod to manage node lifecycle.
Wrap-up
GPU hardware is too expensive for CPU-cluster defaults. Every Karpenter setting, from consolidation policy to AMI selection to node expiration, needs re-evaluation when accelerated instances enter the picture.
Start with a dedicated GPU NodePool and the WhenEmpty consolidation policy. Add Bottlerocket snapshots once your image set stabilizes. Layer in do-not-disrupt annotations as training jobs move to production. Each layer addresses a specific failure mode, and you can adopt them incrementally as your GPU footprint grows. The configurations here, drawn from the Karpenter blueprints and AWS best practices, give you the foundation to scale GPU workloads without paying for idle accelerators.
Mastering the Kubernetes ecosystem — depth-first, no hype.
Subscribe to KubeDojo
Get the latest articles delivered to your inbox.
Related Articles

NVIDIA AI Cluster Runtime: Validated GPU Kubernetes Recipes
NVIDIA released AI Cluster Runtime, an open-source project providing validated, version-locked Kubernetes configurations for GPU infrastructure.

KEDA and Karpenter Together — Pod and Node Scaling Synergy
Combining KEDA's event-driven pod scaling with Karpenter's just-in-time node provisioning for a fully reactive, cost-efficient Kubernetes autoscaling stack.

GPU Sharing Strategies for Multi-Tenant Kubernetes: MIG, Time-Slicing, and MPS
NVIDIA's GPU sharing mechanisms — MIG, time-slicing, and MPS — are gaining traction as teams run multiple inference workloads per GPU.