KubeDojo

Observability and Troubleshooting for Karpenter

KubeDojo
by KubeDojo··18 min read·
Observability and Troubleshooting for Karpenter

Monday morning, 07:43 UTC. Karpenter's logs show five NodeClaims launched in the last ten minutes. The EC2 console confirms five running instances. But kubectl get nodes shows zero new nodes. Pods sit pending. The controller says everything is fine. Without the right metrics, you have no idea where the process broke.

This is a real scenario. It happened to PerfectScale when a buggy Ubuntu EKS AMI shipped with a broken kubelet bootstrap script. Karpenter did its job, the cloud provider did its job, but the nodes never joined the cluster. The only signal was a divergence between two Prometheus counters that nobody was watching.

Every production Karpenter deployment needs monitoring. This article covers how to scrape Karpenter's Prometheus metrics correctly, which signals matter for each operational concern, how to build dashboards and alerts that catch problems early, and a troubleshooting decision tree for the most common failures.

The Metrics Endpoint and Scraping Setup

Karpenter exposes Prometheus-format metrics at :8080/metrics, configurable via the METRICS_PORT environment variable. The health probe lives on a separate port, 8081, serving Kubernetes liveness checks. This two-port split is a common source of misconfiguration: scraping port 8081 returns a 200 OK but no metric data.

The simplest way to start scraping is through the Helm chart's built-in ServiceMonitor. Enable it during installation or upgrade:

# karpenter helm install (excerpt)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace karpenter --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set serviceMonitor.enabled=true \
  --wait

If you need a standalone ServiceMonitor (e.g., you installed Karpenter without the Helm chart's monitoring support), create one manually:

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - karpenter
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - path: /metrics
      port: http-metrics
      interval: 60s

Verify Prometheus is scraping Karpenter by checking the targets page or running:

# Verify scrape target is up
up{job="karpenter"}

A value of 1 confirms the scrape is succeeding. If the target does not appear, check that your ServiceMonitor labels match the Prometheus Operator's serviceMonitorSelector.

warning: If you are using custom labels on your Prometheus Operator to discover ServiceMonitors, add them to additionalLabels in the Helm chart values. A missing label is the second most common reason metrics never appear in Prometheus, right after scraping the wrong port.

The HA scraping problem

Karpenter runs two replicas by default, but only the leader pod actively handles provisioning and disruption. The standby pod sits idle for failover. When you scrape through a Service, Prometheus can hit either pod. The standby reports zero or stale values for most operational metrics, which skews aggregations and makes dashboards unreliable.

Two approaches handle this:

  1. Accept and aggregate. Use sum() or max() by instance in your PromQL queries. Since only the leader increments counters, the standby contributes zeros that don't affect sums. This is the pragmatic default.
  2. Target the leader only. Use a PodMonitor with a label selector that matches the leader election label. This is cleaner but couples your scraping to Karpenter's internal leader election mechanism.

Use sum() and move on. It handles the standby pod's zeros correctly, does not couple your scraping to Karpenter's leader election internals, and works without changes when you scale replicas.

Key Metrics to Track

Karpenter exposes over 60 metrics across nine subsystems. Rather than walking through them alphabetically, here are the ones that matter, organized by the operational question they answer.

Provisioning health

The most important operational signal in Karpenter is the delta between nodes launched and nodes registered.

Metric What it tells you
karpenter_nodeclaims_created_total Total NodeClaims created, labeled by reason and nodepool
karpenter_nodeclaims_terminated_total Total terminated, labeled by nodepool and capacity_type
karpenter_nodeclaims_disrupted_total Total disrupted, labeled by reason (consolidation, drift, expiry)
karpenter_pods_startup_duration_seconds Time from pod creation to running state

When karpenter_nodeclaims_created_total increases but you don't see corresponding nodes joining the cluster, something is failing between EC2 launch and kubelet registration. This exact scenario hit PerfectScale in production: Karpenter was launching instances and logging success, but a buggy Ubuntu EKS AMI caused kubelet to crash on startup. The nodes never registered, pods stayed pending, and the only signal was the divergence between launched and registered counts.

Disruption activity

# PromQL: disruption decisions in the last hour, broken down by reason
sum by (reason) (increase(karpenter_voluntary_disruption_decisions_total[1h]))

A healthy cluster shows a steady trickle of consolidation decisions, occasional drift replacements, and rare expiry events. Something like {consolidation: 3, drift: 1, expiry: 0} over an hour is normal for a moderately active cluster.

The diagnostic signal is the gap between eligible and decided. karpenter_voluntary_disruption_eligible_nodes shows how many nodes Karpenter wants to disrupt. If this stays high while decisions stay at zero, something is blocking disruption:

# PromQL: detect blocked disruption (eligible nodes but no decisions)
karpenter_voluntary_disruption_eligible_nodes > 5
  and on()
increase(karpenter_voluntary_disruption_decisions_total[1h]) == 0

The most common blockers: PodDisruptionBudgets that prevent eviction, the karpenter.sh/do-not-disrupt: "true" annotation on pods, or scheduling constraints that prevent consolidation candidates from being replaced. Check karpenter_voluntary_disruption_queue_failures_total to see how many disruption attempts failed and why.

Scheduling performance

karpenter_scheduler_scheduling_duration_seconds tracks how long Karpenter's scheduling simulation takes. A sustained increase means the controller is spending more time evaluating options before acting. karpenter_scheduler_queue_depth shows pending pods waiting for scheduling. If queue depth is high but scheduling duration is normal, the bottleneck is likely on the cloud provider side.

NodePool capacity

# PromQL: percentage of NodePool limit consumed
sum by (nodepool, resource_type) (karpenter_nodepools_usage)
/
sum by (nodepool, resource_type) (karpenter_nodepools_limit)
* 100

This is the query behind the KarpenterNodepoolNearCapacity alert. When a NodePool approaches its CPU or memory limit, new pods targeting that pool will go pending regardless of available cloud capacity.

Cloud provider health

karpenter_cloudprovider_errors_total counts API failures, labeled by controller, method, and error. karpenter_cloudprovider_duration_seconds tracks call latency. Together, they surface throttling, quota exhaustion, and region-specific outages.

On AWS, the interruption subsystem adds two additional metrics: karpenter_interruption_received_messages_total tracks SQS events for Spot interruptions, scheduled maintenance, and instance state changes. karpenter_interruption_message_queue_duration_seconds measures how long events sit before Karpenter processes them.

Grafana Dashboards and Alerting

Community dashboards

The kubernetes-autoscaling-mixin project provides three Jsonnet-generated Grafana dashboards specifically for Karpenter:

  • Overview (22171): NodePool summary, node and pod counts, resource usage vs limits. Breaks down by zone, architecture, instance type, and capacity type.
  • Activity (22172): Scale-up and scale-down events with the reasoning behind each disruption decision. Includes pod phase transitions and startup durations.
  • Performance (22173): Cloud provider errors, node termination duration, pod startup latency, interruption queue depth, work queue metrics, and controller reconciliation rates.

Import any of these by dashboard ID in Grafana. They work with the standard karpenter job label.

An alternative is the PerfectScale dashboard (20398), which takes a different layout approach and works with Karpenter >= 0.33.

PrometheusRule alerts

The kubernetes-autoscaling-mixin includes production-ready alert rules. Here are the five Karpenter alerts from the project, slightly reformatted for clarity:

# prometheus-rules.yaml (from kubernetes-autoscaling-mixin)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter
      rules:
        - alert: KarpenterCloudProviderErrors
          expr: |
            sum(
              increase(
                karpenter_cloudprovider_errors_total{
                  controller!~"nodeclaim.termination|node.termination",
                  error!="NodeClaimNotFoundError"
                }[5m]
              )
            ) by (provider, controller, method) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Cloud provider errors detected"

        - alert: KarpenterNodepoolNearCapacity
          expr: |
            sum by (nodepool, resource_type) (karpenter_nodepools_usage)
            /
            sum by (nodepool, resource_type) (karpenter_nodepools_limit)
            * 100 > 75
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "NodePool {{ $labels.nodepool }} approaching limit"

        - alert: KarpenterNodeClaimsTerminationDurationHigh
          expr: |
            sum(rate(karpenter_nodeclaims_termination_duration_seconds_sum[5m]))
              by (nodepool)
            /
            sum(rate(karpenter_nodeclaims_termination_duration_seconds_count[5m]))
              by (nodepool) > 1200
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Node termination taking over 20 minutes"

        - alert: KarpenterClusterStateNotSynced
          expr: |
            sum(karpenter_cluster_state_synced) == 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Karpenter cluster state out of sync"

        - alert: KarpenterSchedulerQueueDepthHigh
          expr: |
            sum(karpenter_scheduler_queue_depth) > 50
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Over 50 pods waiting in scheduler queue"

The exclusion filter on KarpenterCloudProviderErrors is intentional: termination controllers and NodeClaimNotFoundError produce transient errors during normal operation that should not trigger alerts.

The missing alert: nodes not registering

None of the standard alert sets catch the scenario where Karpenter launches instances that never join the cluster. Add this:

# prometheus-rules.yaml (custom)
- alert: KarpenterNodesNotRegistering
  expr: |
    sum by (nodepool) (increase(karpenter_nodeclaims_created_total[15m]))
    -
    sum by (nodepool) (increase(karpenter_nodes_created_total[15m]))
    > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "NodeClaims created but nodes not registering in {{ $labels.nodepool }}"

The increase() wrapper is essential here. Both karpenter_nodeclaims_created_total and karpenter_nodes_created_total are monotonically increasing counters. Subtracting raw counter values would produce a non-zero result even during normal operation. Using increase() over a 15-minute window compares recent activity only: if more NodeClaims were created than nodes in that window, something is preventing registration. This alert would have caught the PerfectScale AMI incident before the team noticed pending pods.

Controller Logs and Debugging

Karpenter emits structured JSON logs with a logger field that groups entries by controller. The three you will read most often:

  • controller.provisioner: pod evaluation and scheduling decisions
  • controller.nodeclaim.lifecycle: node launch, registration, and initialization
  • controller.disruption: consolidation, drift replacement, and expiry

The provisioning sequence

A healthy node launch produces this log sequence:

{"level":"INFO","time":"2026-03-22T02:24:16.114Z","logger":"controller.provisioner","message":"found provisionable pod(s)","Pods":"default/api-server-7b4f6d-x2k9p","duration":"10.5ms"}
{"level":"INFO","time":"2026-03-22T02:24:19.028Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","NodeClaim":{"name":"default-sfpsl"},"provider-id":"aws:///us-east-1b/i-01234567adb205c7e","instance-type":"c5.2xlarge","zone":"us-east-1b","capacity-type":"spot","allocatable":{"cpu":"8","memory":"16Gi"}}
{"level":"INFO","time":"2026-03-22T02:26:19.028Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","NodeClaim":{"name":"default-sfpsl"},"Node":{"name":"ip-10-0-12-34.us-east-1.compute.internal"}}
{"level":"INFO","time":"2026-03-22T02:26:52.642Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","NodeClaim":{"name":"default-sfpsl"},"Node":{"name":"ip-10-0-12-34.us-east-1.compute.internal"}}

If you see "launched nodeclaim" but never "registered nodeclaim," the instance started but kubelet failed to join the cluster. Check the instance directly via SSM.

Enabling debug logging

# Via Helm
$ helm upgrade karpenter oci://public.ecr.aws/karpenter/karpenter \
  --set logLevel=debug --reuse-values

# Or patch the deployment directly
$ kubectl set env deployment/karpenter -n karpenter LOG_LEVEL=debug

Inspecting NodeClaims

$ kubectl get nodeclaims
NAME            TYPE         ZONE         NODE                          READY   AGE
default-t5k2p   c5.large     us-east-1a   ip-10-0-12-192.ec2.internal   True    12m
default-8g4n1   m5.2xlarge   us-east-1b   ip-10-0-45-21.ec2.internal    True    4m

A NodeClaim with an empty NODE column and READY=False that persists beyond a few minutes means the instance launched but never registered. kubectl describe nodeclaim <name> shows the events timeline: Launched, Registered, Initialized. Missing events point to where the process broke.

Troubleshooting Decision Tree

Pods stuck pending

Start with the scheduler queue: karpenter_scheduler_queue_depth tells you if Karpenter is aware of the pending pods. If queue depth is zero, the pods don't match any NodePool's requirements. Check nodeSelector, tolerations, and affinity against your NodePool specs.

If queue depth is high, Karpenter is trying but failing. Check karpenter_cloudprovider_errors_total for API failures and controller logs for "no instance type met the scheduling requirements" messages. This often means the requested resources exceed what any instance type in the NodePool can provide, or all offerings in the requested zones are unavailable.

Nodes launched but not registering

This is the most insidious failure mode because Karpenter reports success. The controller logs show "launched nodeclaim" and the EC2 console shows running instances, but kubectl get nodes shows nothing new.

Common causes:

  • AMI bugs: kubelet fails on startup due to incompatible flags or missing dependencies.
  • IAM misconfiguration: the node role is missing from the aws-auth ConfigMap.
  • Network issues: the instance cannot reach the API server (security groups, NACLs, or missing VPC endpoints).

Debug by connecting to the instance via SSM and checking kubelet logs:

$ INSTANCE_ID=$(kubectl get nodeclaim <name> -o jsonpath='{.status.providerID}' | cut -d/ -f5)
$ aws ssm start-session --target $INSTANCE_ID
$ sudo journalctl -u kubelet

Excessive node churn

If nodes are being created and destroyed in rapid succession, check for interaction between Karpenter and AWS Node Termination Handler. When NTH has enableRebalanceDraining: true, it removes nodes on spot rebalance recommendations. Karpenter replaces them, potentially with the same instance type, triggering another rebalance recommendation. The fix: disable enableRebalanceDraining in NTH, or remove NTH entirely since Karpenter handles spot interruptions natively through its SQS-based interruption handling.

Also check consolidateAfter on your NodePools. A very short value causes aggressive consolidation that conflicts with workload startup times.

Cloud provider errors in private clusters

Karpenter needs to reach the STS endpoint for credential exchange and the EC2 API for instance management. In a private cluster without NAT gateways, missing VPC endpoints cause timeout errors:

Post "https://sts.us-east-1.amazonaws.com/": dial tcp: lookup sts.us-east-1.amazonaws.com: i/o timeout

Create VPC endpoints for sts, ec2, ec2messages, and ssm. For pricing data, there is no VPC endpoint for the Price List API. Set AWS_ISOLATED_VPC=true in the Karpenter configuration to suppress the pricing lookup errors. Karpenter bundles pricing data in the binary, updated with each release.

Gotchas

  • NTH recursive loop: Node Termination Handler with rebalance draining enabled plus Karpenter managing the same Spot nodes creates a launch-terminate cycle. Karpenter does not know why NTH removed the node and may launch the same instance type again. Disable enableRebalanceDraining or remove NTH entirely.
  • Memory overhead estimation: VM_MEMORY_OVERHEAD_PERCENT defaults to 7.5%. If your instance types have a different hypervisor overhead, Karpenter overestimates available memory and bin-packs too aggressively. Monitor the ConsistentStateFound condition on NodeClaims. A False status means actual capacity did not match the estimate.
  • Consolidation blocked by PDBs: karpenter_voluntary_disruption_eligible_nodes stays high but decisions stay at zero. PodDisruptionBudgets or the karpenter.sh/do-not-disrupt: "true" annotation are the usual culprits. Check karpenter_voluntary_disruption_queue_failures_total for failure counts.
  • Spot instance type diversity: If your NodePool allows only a few instance types, Spot interruptions cluster. karpenter_interruption_received_messages_total spikes but Karpenter cannot find replacement capacity. Widen your instance type list in the NodePool's instanceTypes field.

Wrap-up

The two metrics that catch Karpenter problems before users notice: the NodeClaim-to-Node registration delta and the cloud provider error rate. Start with the kubernetes-autoscaling-mixin dashboards and the custom registration alert from this article, then build runbooks around the troubleshooting decision tree as your team encounters new failure modes.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles