KubeDojo

Disruption, Drift, and Consolidation — Karpenter's Node Lifecycle

KubeDojo
by KubeDojo··16 min read·
Disruption, Drift, and Consolidation — Karpenter's Node Lifecycle

You updated an AMI selector on your EC2NodeClass, pushed the change, and twenty minutes later half your nodes are cycling. Or you scaled a Deployment from 20 replicas down to 5 and nothing happened for hours. Both behaviors are Karpenter working as designed. The mechanics behind each are different.

Karpenter's disruption controller is the engine that keeps your nodes current and right-sized. It runs two automated graceful methods in a fixed order: Drift first, then Consolidation. Without understanding its control flow, you're guessing at why nodes cycle or don't.

This article covers the disruption controller's control flow, drift detection, consolidation mechanics, the budget and annotation system that rate-limits both, and the forceful methods (expiration, interruption) that operate outside those limits.

The Disruption Controller Loop

The disruption controller runs continuously, executing one method at a time in strict order: Drift, then Consolidation. Each pass follows the same six-step flow:

  1. Identify candidates for the current method. Nodes with pods that cannot be evicted are skipped.
  2. Check disruption budgets. If the NodePool's budget is exhausted, stop.
  3. Simulate scheduling to determine whether replacement nodes are needed.
  4. Taint the candidate with karpenter.sh/disrupted:NoSchedule to prevent new pods from landing.
  5. Pre-spin replacements and wait for them to become ready.
  6. Delete the original node and let the Termination Controller handle graceful shutdown.

If a replacement node fails to initialize, Karpenter un-taints the original and restarts from step 1 with the first disruption method. This is the safety valve that prevents consolidation from stranding pods on a failed replacement.

The Termination Controller handles the actual shutdown. Karpenter sets a finalizer on every node and NodeClaim it provisions. When a node is deleted, the finalizer blocks removal while the controller:

  1. Taints the node karpenter.sh/disrupted:NoSchedule
  2. Evicts pods via the Kubernetes Eviction API, respecting PDBs
  3. Waits for all VolumeAttachments to detach
  4. Terminates the underlying cloud instance
  5. Removes the finalizer

Static pods, pods tolerating the disruption taint, and pods in terminal phase (Succeeded/Failed) are ignored during eviction.

Drift Detection

Drift answers a simple question: does the running node still match what the NodePool and EC2NodeClass say it should be? Karpenter annotates both CRDs with a hash of the NodeClaimTemplateSpec. When the hash changes, existing NodeClaims are marked Drifted.

What triggers drift

Changes to these fields cause drift:

NodePool: spec.template.spec.requirements (instance types, zones, architectures, capacity types)

EC2NodeClass: spec.subnetSelectorTerms, spec.securityGroupSelectorTerms, spec.amiSelectorTerms

There are also special cases where drift occurs without a CRD change. The most common: a new AMI is published that matches your existing amiSelectorTerms. Karpenter discovers it, compares it to what the NodeClaim is running, and marks the NodeClaim as drifted. This is how automated AMI rollouts work with Karpenter.

Here's a concrete example. You change the AMI selector on your EC2NodeClass to pin a new AMI family:

# karpenter.k8s.aws/v1 EC2NodeClass (amiSelectorTerms change triggers drift)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        environment: prod
    # Adding or changing these tags causes all NodeClaims
    # using this EC2NodeClass to be marked Drifted

Every NodeClaim referencing this EC2NodeClass gets the Drifted status condition. Karpenter then replaces them through the standard disruption flow, rate-limited by your budgets.

Conversely, not all CRD changes cause drift. Relaxing a requirement (adding m5.2xlarge to an existing list of [m5.large, m5.xlarge]) does not drift nodes that are already running a compatible instance type.

What does NOT trigger drift

Behavioral fields are explicitly excluded:

  • spec.weight
  • spec.limits
  • spec.disruption.* (consolidation policy, budgets, consolidateAfter)

These control Karpenter's behavior, not the desired state of the node. Changing your consolidation policy from WhenEmpty to WhenEmptyOrUnderutilized takes effect immediately without cycling any nodes.

Similarly, changing spec.template.spec.expireAfter on the NodePool does not update existing NodeClaims directly. It induces drift, and the replacement NodeClaims get the new value. This is a subtle but important distinction: you're getting a rolling update through drift, not a direct field change.

Consolidation Mechanics

Consolidation is Karpenter's cost optimization engine. It identifies nodes that are empty or underutilized and either removes them or replaces them with cheaper alternatives.

Two policies

WhenEmpty: Only consolidates nodes with zero non-daemon workload pods. Conservative, safe for workloads that are expensive to reschedule.

WhenEmptyOrUnderutilized: Actively right-sizes the cluster. Karpenter evaluates whether pods on a node could fit on other existing nodes or on a cheaper replacement. This is where real cost savings happen.

The consolidateAfter field controls how long Karpenter waits after a pod schedules or is removed before evaluating the node. Setting it to 1m means Karpenter won't react to brief scaling bursts. Setting it to Never disables consolidation entirely.

# karpenter.sh/v1 NodePool (disruption block)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Three mechanisms, in order

Karpenter tries consolidation in a specific sequence:

  1. Empty node deletion: Remove nodes with no workload pods. Multiple empty nodes are deleted in parallel.
  2. Multi-node consolidation: Try to delete two or more nodes simultaneously, optionally launching a single cheaper replacement. Karpenter uses a heuristic to find likely candidates since evaluating all combinations is impractical.
  3. Single-node consolidation: Evaluate each node individually. Delete it if pods fit elsewhere, or replace it if a cheaper instance type works.

When multiple candidates qualify, Karpenter prefers to consolidate the node that disrupts workloads the least: nodes running fewer pods first, then nodes expiring soon, then nodes with lower priority pods.

Spot-to-Spot consolidation

By default, Spot nodes only get deletion consolidation (removing empty Spot nodes). Replacement consolidation for Spot requires the SpotToSpotConsolidation feature gate.

The challenge with Spot replacement is avoiding a "race to the bottom" where Karpenter repeatedly replaces instances with ever-cheaper but less available options. Karpenter solves this by requiring a minimum of 15 instance types cheaper than the current candidate for single-node Spot-to-Spot consolidation. It then passes those types to the EC2 Fleet API using the price-capacity-optimized allocation strategy, which balances price against interruption likelihood.

Multi-node Spot-to-Spot consolidation (many nodes to one) does not have this 15-type requirement, since collapsing multiple nodes into one doesn't create the same price-chasing dynamic.

If your NodePool doesn't have enough instance type flexibility, Karpenter logs an Unconsolidatable event:

Events:
  Type     Reason                Age    From        Message
  ----     ------                ----   ----        -------
  Normal   Unconsolidatable      31s    karpenter   SpotToSpotConsolidation requires 15 cheaper
                                                    instance type options than the current
                                                    candidate to consolidate, got 1

Diagnosing unconsolidatable nodes

Karpenter periodically emits events explaining why nodes aren't being consolidated. Common reasons:

Events:
  Type     Reason                Age               From        Message
  ----     ------                ----              ----        -------
  Normal   Unconsolidatable      66s               karpenter   pdb default/inflate-pdb prevents
                                                               pod evictions
  Normal   Unconsolidatable      33s (x3 over 30m) karpenter   can't replace with a lower-priced node

For comparison, a healthy consolidation replacement looks like this:

Events:
  Type     Reason              Age    From        Message
  ----     ------              ----   ----        -------
  Normal   DisruptionReplacing 2m     karpenter   Replacing node with a node from types
                                                   m5.xlarge, c5.xlarge, m6i.xlarge, c6i.xlarge
                                                   and 10 other(s)

These events are your primary diagnostic tool. If you expect a node to consolidate but it doesn't, check kubectl describe node <name> for Unconsolidatable events.

Disruption Budgets and Controls

Disruption budgets rate-limit how fast Karpenter can cycle nodes through voluntary methods (drift, consolidation, empty node removal). They do not apply to forceful methods like expiration.

Budget configuration

Budgets are set on the NodePool under spec.disruption.budgets. Each budget can specify:

  • nodes: An absolute number ("5") or percentage ("20%") of nodes that may be disrupted simultaneously
  • reasons: Which disruption methods this budget applies to: Drifted, Underutilized, Empty
  • schedule and duration: A cron window during which this budget is active

When no reasons are specified, the budget applies to all methods. When multiple budgets are active, Karpenter takes the minimum (most restrictive).

# karpenter.sh/v1 NodePool (budget examples)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
    - nodes: "20%"
      reasons:
      - "Empty"
      - "Drifted"
    - nodes: "5"
    - nodes: "0"
      schedule: "@daily"
      duration: 10m
      reasons:
      - "Underutilized"

This configuration allows up to 20% of nodes to be disrupted for drift or emptiness, caps total simultaneous disruptions at 5, and blocks consolidation of underutilized nodes for the first 10 minutes of each day (UTC).

The budget calculation: allowed = roundup(total * percentage) - currently_deleting - notready. With 19 nodes and a 20% budget, that's roundup(19 * 0.2) = 4 allowed disruptions.

Pod-level controls

PodDisruptionBudgets block eviction at the Kubernetes API level. Every PDB on every pod on a node must simultaneously permit eviction for Karpenter to voluntarily disrupt that node. A single blocking PDB on any pod stops the entire node from being considered.

The complexity compounds with multiple PDBs:

  • Pod A matches PDB-1 (maxUnavailable: 0) and PDB-2 (maxUnavailable: 1): blocked by PDB-1
  • Pod B matches PDB-3 (minAvailable: 100%): blocked
  • Pod C has no PDB: fine, but irrelevant since A and B block the node

The karpenter.sh/do-not-disrupt: "true" annotation works like a permanent blocking PDB for a single pod. Apply it to a pod, Deployment template, or node to prevent voluntary disruption entirely.

# do-not-disrupt annotation on a Deployment
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

warning: Pods with do-not-disrupt are not evicted by the Termination Controller either. The node stays in draining state until the pod completes, enters a terminal phase, or terminationGracePeriod elapses.

terminationGracePeriod

This field on the NodePool's spec.template.spec.terminationGracePeriod sets the maximum drain time. Once elapsed, remaining pods are forcibly deleted and the instance terminates.

The critical override: when terminationGracePeriod is configured, drift can proceed even on nodes with blocking PDBs or do-not-disrupt pods. This ensures that critical updates (AMI patches for CVEs, for example) cannot be blocked indefinitely by misconfigured applications.

Maximum node lifetime = expireAfter + terminationGracePeriod. A node with expireAfter: 720h and terminationGracePeriod: 48h has an absolute maximum lifetime of 768 hours (32 days).

Forceful Methods: Expiration and Interruption

Graceful methods respect budgets and give you control over pace. Forceful methods don't wait.

Expiration

expireAfter sets the maximum node lifetime. The default is 720h (30 days). When a node exceeds this duration, Karpenter begins draining it immediately. Expiration is not rate-limited by disruption budgets.

Setting expireAfter to Never disables expiration entirely. This is appropriate for static workloads that never need node rotation, but most clusters benefit from regular cycling to pick up OS patches and reduce the risk of long-running node issues (memory leaks in system processes, file fragmentation).

Interruption

When interruption handling is enabled via the --interruption-queue CLI argument, Karpenter watches an SQS queue for involuntary disruption events:

  • Spot interruption warnings: 2-minute notice before EC2 reclaims the instance
  • Scheduled maintenance events: Health events from AWS
  • Instance stopping/terminating events: Direct EC2 lifecycle events

On receiving a Spot interruption warning, Karpenter immediately provisions a replacement node while simultaneously draining the interrupted node. Average Karpenter node startup time means the replacement is usually ready before the 2-minute window expires.

The SQS queue is populated by EventBridge rules that forward interruption events from AWS services. The Karpenter Getting Started CloudFormation template provisions this infrastructure.

Node Auto Repair

Node Auto Repair (alpha in v1.1.0) automatically replaces unhealthy nodes based on status conditions. When a node's Ready condition is False or Unknown for 30 minutes, or an accelerated hardware condition (AcceleratedHardwareReady, StorageReady, NetworkingReady, KernelReady, ContainerRuntimeReady) fails for its toleration duration, Karpenter forcefully terminates and replaces the node.

Safety mechanism: Karpenter will not repair if more than 20% of nodes in a NodePool are unhealthy, preventing cascading failures from turning a partial outage into a full one.

Enable it with the NodeRepair=true feature gate and deploy a node monitoring agent (or Node Problem Detector) to surface the required status conditions.

Gotchas

Changing expireAfter on the NodePool induces drift, not a direct update. Existing NodeClaims keep their original expireAfter value. Karpenter detects the mismatch as drift and replaces nodes through the graceful drift path, which IS rate-limited by disruption budgets. The replacement NodeClaims get the new expireAfter. If you have strict budgets, this rolling update could take hours.

WhenEmptyOrUnderutilized without consolidateAfter or budgets causes churn. Karpenter evaluates consolidation as soon as a pod is removed. During a rolling update, this can trigger rapid node replacement before the new pods schedule. Set consolidateAfter: 1m and pair it with disruption budgets.

A single blocking PDB blocks the entire node. If Pod A has a PDB with maxUnavailable: 0, no other pod on that node can be voluntarily evicted, even if those pods have permissive PDBs or none at all. Audit your PDBs before wondering why consolidation isn't working.

do-not-disrupt persists through draining. The termination controller won't evict annotated pods. The node sits in draining state until the pod finishes, enters a terminal phase, or terminationGracePeriod kicks in and forces deletion.

Spot-to-Spot consolidation requires 15+ cheaper instance types. If your NodePool constrains to a narrow set of instance families, single-node Spot consolidation silently does nothing. Check Unconsolidatable events. The fix is broader instance type flexibility in your NodePool requirements.

Behavioral fields don't trigger drift. Changing spec.disruption.consolidationPolicy, spec.limits, or spec.weight takes effect immediately on the controller's next loop. No nodes are cycled. This is by design, but it means you can't use drift to gradually roll out a new consolidation strategy.

Wrap-up

The disruption controller loop is deterministic: drift first, consolidation second, budgets enforced at every step.

Start by auditing your PDBs. A single maxUnavailable: 0 on a forgotten PDB silently blocks consolidation across every node it touches. Then set consolidateAfter to match your deployment cadence (1-2 minutes for most clusters) and add reason-specific budgets to separate drift velocity from consolidation velocity. Use terminationGracePeriod as the safety net for critical AMI rollouts that can't be blocked by misconfigured applications. When consolidation isn't happening, check Unconsolidatable events before changing any configuration.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles