Observability and Troubleshooting for Karpenter

Monday morning, 07:43 UTC. Karpenter's logs show five NodeClaims launched in the last ten minutes. The EC2 console confirms five running instances. But kubectl get nodes shows zero new nodes. Pods sit pending. The controller says everything is fine. Without the right metrics, you have no idea where the process broke.
This is a real scenario. It happened to PerfectScale when a buggy Ubuntu EKS AMI shipped with a broken kubelet bootstrap script. Karpenter did its job, the cloud provider did its job, but the nodes never joined the cluster. The only signal was a divergence between two Prometheus counters that nobody was watching.
Every production Karpenter deployment needs monitoring. This article covers how to scrape Karpenter's Prometheus metrics correctly, which signals matter for each operational concern, how to build dashboards and alerts that catch problems early, and a troubleshooting decision tree for the most common failures.
The Metrics Endpoint and Scraping Setup
Karpenter exposes Prometheus-format metrics at :8080/metrics, configurable via the METRICS_PORT environment variable. The health probe lives on a separate port, 8081, serving Kubernetes liveness checks. This two-port split is a common source of misconfiguration: scraping port 8081 returns a 200 OK but no metric data.
The simplest way to start scraping is through the Helm chart's built-in ServiceMonitor. Enable it during installation or upgrade:
# karpenter helm install (excerpt)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace karpenter --create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set serviceMonitor.enabled=true \
--wait
If you need a standalone ServiceMonitor (e.g., you installed Karpenter without the Helm chart's monitoring support), create one manually:
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: karpenter
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- karpenter
selector:
matchLabels:
app.kubernetes.io/name: karpenter
endpoints:
- path: /metrics
port: http-metrics
interval: 60s
Verify Prometheus is scraping Karpenter by checking the targets page or running:
# Verify scrape target is up
up{job="karpenter"}
A value of 1 confirms the scrape is succeeding. If the target does not appear, check that your ServiceMonitor labels match the Prometheus Operator's serviceMonitorSelector.
warning: If you are using custom labels on your Prometheus Operator to discover ServiceMonitors, add them to
additionalLabelsin the Helm chart values. A missing label is the second most common reason metrics never appear in Prometheus, right after scraping the wrong port.
The HA scraping problem
Karpenter runs two replicas by default, but only the leader pod actively handles provisioning and disruption. The standby pod sits idle for failover. When you scrape through a Service, Prometheus can hit either pod. The standby reports zero or stale values for most operational metrics, which skews aggregations and makes dashboards unreliable.
Two approaches handle this:
- Accept and aggregate. Use
sum()ormax()by instance in your PromQL queries. Since only the leader increments counters, the standby contributes zeros that don't affect sums. This is the pragmatic default. - Target the leader only. Use a PodMonitor with a label selector that matches the leader election label. This is cleaner but couples your scraping to Karpenter's internal leader election mechanism.
Use sum() and move on. It handles the standby pod's zeros correctly, does not couple your scraping to Karpenter's leader election internals, and works without changes when you scale replicas.
Key Metrics to Track
Karpenter exposes over 60 metrics across nine subsystems. Rather than walking through them alphabetically, here are the ones that matter, organized by the operational question they answer.
Provisioning health
The most important operational signal in Karpenter is the delta between nodes launched and nodes registered.
| Metric | What it tells you |
|---|---|
karpenter_nodeclaims_created_total |
Total NodeClaims created, labeled by reason and nodepool |
karpenter_nodeclaims_terminated_total |
Total terminated, labeled by nodepool and capacity_type |
karpenter_nodeclaims_disrupted_total |
Total disrupted, labeled by reason (consolidation, drift, expiry) |
karpenter_pods_startup_duration_seconds |
Time from pod creation to running state |
When karpenter_nodeclaims_created_total increases but you don't see corresponding nodes joining the cluster, something is failing between EC2 launch and kubelet registration. This exact scenario hit PerfectScale in production: Karpenter was launching instances and logging success, but a buggy Ubuntu EKS AMI caused kubelet to crash on startup. The nodes never registered, pods stayed pending, and the only signal was the divergence between launched and registered counts.
Disruption activity
# PromQL: disruption decisions in the last hour, broken down by reason
sum by (reason) (increase(karpenter_voluntary_disruption_decisions_total[1h]))
A healthy cluster shows a steady trickle of consolidation decisions, occasional drift replacements, and rare expiry events. Something like {consolidation: 3, drift: 1, expiry: 0} over an hour is normal for a moderately active cluster.
The diagnostic signal is the gap between eligible and decided. karpenter_voluntary_disruption_eligible_nodes shows how many nodes Karpenter wants to disrupt. If this stays high while decisions stay at zero, something is blocking disruption:
# PromQL: detect blocked disruption (eligible nodes but no decisions)
karpenter_voluntary_disruption_eligible_nodes > 5
and on()
increase(karpenter_voluntary_disruption_decisions_total[1h]) == 0
The most common blockers: PodDisruptionBudgets that prevent eviction, the karpenter.sh/do-not-disrupt: "true" annotation on pods, or scheduling constraints that prevent consolidation candidates from being replaced. Check karpenter_voluntary_disruption_queue_failures_total to see how many disruption attempts failed and why.
Scheduling performance
karpenter_scheduler_scheduling_duration_seconds tracks how long Karpenter's scheduling simulation takes. A sustained increase means the controller is spending more time evaluating options before acting. karpenter_scheduler_queue_depth shows pending pods waiting for scheduling. If queue depth is high but scheduling duration is normal, the bottleneck is likely on the cloud provider side.
NodePool capacity
# PromQL: percentage of NodePool limit consumed
sum by (nodepool, resource_type) (karpenter_nodepools_usage)
/
sum by (nodepool, resource_type) (karpenter_nodepools_limit)
* 100
This is the query behind the KarpenterNodepoolNearCapacity alert. When a NodePool approaches its CPU or memory limit, new pods targeting that pool will go pending regardless of available cloud capacity.
Cloud provider health
karpenter_cloudprovider_errors_total counts API failures, labeled by controller, method, and error. karpenter_cloudprovider_duration_seconds tracks call latency. Together, they surface throttling, quota exhaustion, and region-specific outages.
On AWS, the interruption subsystem adds two additional metrics: karpenter_interruption_received_messages_total tracks SQS events for Spot interruptions, scheduled maintenance, and instance state changes. karpenter_interruption_message_queue_duration_seconds measures how long events sit before Karpenter processes them.
Grafana Dashboards and Alerting
Community dashboards
The kubernetes-autoscaling-mixin project provides three Jsonnet-generated Grafana dashboards specifically for Karpenter:
- Overview (22171): NodePool summary, node and pod counts, resource usage vs limits. Breaks down by zone, architecture, instance type, and capacity type.
- Activity (22172): Scale-up and scale-down events with the reasoning behind each disruption decision. Includes pod phase transitions and startup durations.
- Performance (22173): Cloud provider errors, node termination duration, pod startup latency, interruption queue depth, work queue metrics, and controller reconciliation rates.
Import any of these by dashboard ID in Grafana. They work with the standard karpenter job label.
An alternative is the PerfectScale dashboard (20398), which takes a different layout approach and works with Karpenter >= 0.33.
PrometheusRule alerts
The kubernetes-autoscaling-mixin includes production-ready alert rules. Here are the five Karpenter alerts from the project, slightly reformatted for clarity:
# prometheus-rules.yaml (from kubernetes-autoscaling-mixin)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: karpenter-alerts
namespace: monitoring
spec:
groups:
- name: karpenter
rules:
- alert: KarpenterCloudProviderErrors
expr: |
sum(
increase(
karpenter_cloudprovider_errors_total{
controller!~"nodeclaim.termination|node.termination",
error!="NodeClaimNotFoundError"
}[5m]
)
) by (provider, controller, method) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Cloud provider errors detected"
- alert: KarpenterNodepoolNearCapacity
expr: |
sum by (nodepool, resource_type) (karpenter_nodepools_usage)
/
sum by (nodepool, resource_type) (karpenter_nodepools_limit)
* 100 > 75
for: 15m
labels:
severity: warning
annotations:
summary: "NodePool {{ $labels.nodepool }} approaching limit"
- alert: KarpenterNodeClaimsTerminationDurationHigh
expr: |
sum(rate(karpenter_nodeclaims_termination_duration_seconds_sum[5m]))
by (nodepool)
/
sum(rate(karpenter_nodeclaims_termination_duration_seconds_count[5m]))
by (nodepool) > 1200
for: 15m
labels:
severity: warning
annotations:
summary: "Node termination taking over 20 minutes"
- alert: KarpenterClusterStateNotSynced
expr: |
sum(karpenter_cluster_state_synced) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Karpenter cluster state out of sync"
- alert: KarpenterSchedulerQueueDepthHigh
expr: |
sum(karpenter_scheduler_queue_depth) > 50
for: 15m
labels:
severity: warning
annotations:
summary: "Over 50 pods waiting in scheduler queue"
The exclusion filter on KarpenterCloudProviderErrors is intentional: termination controllers and NodeClaimNotFoundError produce transient errors during normal operation that should not trigger alerts.
The missing alert: nodes not registering
None of the standard alert sets catch the scenario where Karpenter launches instances that never join the cluster. Add this:
# prometheus-rules.yaml (custom)
- alert: KarpenterNodesNotRegistering
expr: |
sum by (nodepool) (increase(karpenter_nodeclaims_created_total[15m]))
-
sum by (nodepool) (increase(karpenter_nodes_created_total[15m]))
> 0
for: 15m
labels:
severity: warning
annotations:
summary: "NodeClaims created but nodes not registering in {{ $labels.nodepool }}"
The increase() wrapper is essential here. Both karpenter_nodeclaims_created_total and karpenter_nodes_created_total are monotonically increasing counters. Subtracting raw counter values would produce a non-zero result even during normal operation. Using increase() over a 15-minute window compares recent activity only: if more NodeClaims were created than nodes in that window, something is preventing registration. This alert would have caught the PerfectScale AMI incident before the team noticed pending pods.
Controller Logs and Debugging
Karpenter emits structured JSON logs with a logger field that groups entries by controller. The three you will read most often:
controller.provisioner: pod evaluation and scheduling decisionscontroller.nodeclaim.lifecycle: node launch, registration, and initializationcontroller.disruption: consolidation, drift replacement, and expiry
The provisioning sequence
A healthy node launch produces this log sequence:
{"level":"INFO","time":"2026-03-22T02:24:16.114Z","logger":"controller.provisioner","message":"found provisionable pod(s)","Pods":"default/api-server-7b4f6d-x2k9p","duration":"10.5ms"}
{"level":"INFO","time":"2026-03-22T02:24:19.028Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","NodeClaim":{"name":"default-sfpsl"},"provider-id":"aws:///us-east-1b/i-01234567adb205c7e","instance-type":"c5.2xlarge","zone":"us-east-1b","capacity-type":"spot","allocatable":{"cpu":"8","memory":"16Gi"}}
{"level":"INFO","time":"2026-03-22T02:26:19.028Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","NodeClaim":{"name":"default-sfpsl"},"Node":{"name":"ip-10-0-12-34.us-east-1.compute.internal"}}
{"level":"INFO","time":"2026-03-22T02:26:52.642Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","NodeClaim":{"name":"default-sfpsl"},"Node":{"name":"ip-10-0-12-34.us-east-1.compute.internal"}}
If you see "launched nodeclaim" but never "registered nodeclaim," the instance started but kubelet failed to join the cluster. Check the instance directly via SSM.
Enabling debug logging
# Via Helm
$ helm upgrade karpenter oci://public.ecr.aws/karpenter/karpenter \
--set logLevel=debug --reuse-values
# Or patch the deployment directly
$ kubectl set env deployment/karpenter -n karpenter LOG_LEVEL=debug
Inspecting NodeClaims
$ kubectl get nodeclaims
NAME TYPE ZONE NODE READY AGE
default-t5k2p c5.large us-east-1a ip-10-0-12-192.ec2.internal True 12m
default-8g4n1 m5.2xlarge us-east-1b ip-10-0-45-21.ec2.internal True 4m
A NodeClaim with an empty NODE column and READY=False that persists beyond a few minutes means the instance launched but never registered. kubectl describe nodeclaim <name> shows the events timeline: Launched, Registered, Initialized. Missing events point to where the process broke.
Troubleshooting Decision Tree
Pods stuck pending
Start with the scheduler queue: karpenter_scheduler_queue_depth tells you if Karpenter is aware of the pending pods. If queue depth is zero, the pods don't match any NodePool's requirements. Check nodeSelector, tolerations, and affinity against your NodePool specs.
If queue depth is high, Karpenter is trying but failing. Check karpenter_cloudprovider_errors_total for API failures and controller logs for "no instance type met the scheduling requirements" messages. This often means the requested resources exceed what any instance type in the NodePool can provide, or all offerings in the requested zones are unavailable.
Nodes launched but not registering
This is the most insidious failure mode because Karpenter reports success. The controller logs show "launched nodeclaim" and the EC2 console shows running instances, but kubectl get nodes shows nothing new.
Common causes:
- AMI bugs: kubelet fails on startup due to incompatible flags or missing dependencies.
- IAM misconfiguration: the node role is missing from the
aws-authConfigMap. - Network issues: the instance cannot reach the API server (security groups, NACLs, or missing VPC endpoints).
Debug by connecting to the instance via SSM and checking kubelet logs:
$ INSTANCE_ID=$(kubectl get nodeclaim <name> -o jsonpath='{.status.providerID}' | cut -d/ -f5)
$ aws ssm start-session --target $INSTANCE_ID
$ sudo journalctl -u kubelet
Excessive node churn
If nodes are being created and destroyed in rapid succession, check for interaction between Karpenter and AWS Node Termination Handler. When NTH has enableRebalanceDraining: true, it removes nodes on spot rebalance recommendations. Karpenter replaces them, potentially with the same instance type, triggering another rebalance recommendation. The fix: disable enableRebalanceDraining in NTH, or remove NTH entirely since Karpenter handles spot interruptions natively through its SQS-based interruption handling.
Also check consolidateAfter on your NodePools. A very short value causes aggressive consolidation that conflicts with workload startup times.
Cloud provider errors in private clusters
Karpenter needs to reach the STS endpoint for credential exchange and the EC2 API for instance management. In a private cluster without NAT gateways, missing VPC endpoints cause timeout errors:
Post "https://sts.us-east-1.amazonaws.com/": dial tcp: lookup sts.us-east-1.amazonaws.com: i/o timeout
Create VPC endpoints for sts, ec2, ec2messages, and ssm. For pricing data, there is no VPC endpoint for the Price List API. Set AWS_ISOLATED_VPC=true in the Karpenter configuration to suppress the pricing lookup errors. Karpenter bundles pricing data in the binary, updated with each release.
Gotchas
- NTH recursive loop: Node Termination Handler with rebalance draining enabled plus Karpenter managing the same Spot nodes creates a launch-terminate cycle. Karpenter does not know why NTH removed the node and may launch the same instance type again. Disable
enableRebalanceDrainingor remove NTH entirely. - Memory overhead estimation:
VM_MEMORY_OVERHEAD_PERCENTdefaults to 7.5%. If your instance types have a different hypervisor overhead, Karpenter overestimates available memory and bin-packs too aggressively. Monitor theConsistentStateFoundcondition on NodeClaims. AFalsestatus means actual capacity did not match the estimate. - Consolidation blocked by PDBs:
karpenter_voluntary_disruption_eligible_nodesstays high but decisions stay at zero. PodDisruptionBudgets or thekarpenter.sh/do-not-disrupt: "true"annotation are the usual culprits. Checkkarpenter_voluntary_disruption_queue_failures_totalfor failure counts. - Spot instance type diversity: If your NodePool allows only a few instance types, Spot interruptions cluster.
karpenter_interruption_received_messages_totalspikes but Karpenter cannot find replacement capacity. Widen your instance type list in the NodePool'sinstanceTypesfield.
Wrap-up
The two metrics that catch Karpenter problems before users notice: the NodeClaim-to-Node registration delta and the cloud provider error rate. Start with the kubernetes-autoscaling-mixin dashboards and the custom registration alert from this article, then build runbooks around the troubleshooting decision tree as your team encounters new failure modes.
This post is part of the Karpenter — Kubernetes Node Autoscaling from Setup to Optimization collection (6 of 6)
Mastering the Kubernetes ecosystem — depth-first, no hype.
Subscribe to KubeDojo
Get the latest articles delivered to your inbox.
Related Articles

Observability and Troubleshooting for KEDA
Scraping KEDA operator metrics, building Grafana dashboards for scaling events, and diagnosing common ScaledObject issues in production.

KEDA and Karpenter Together — Pod and Node Scaling Synergy
Combining KEDA's event-driven pod scaling with Karpenter's just-in-time node provisioning for a fully reactive, cost-efficient Kubernetes autoscaling stack.

HTTP-Based Autoscaling with the KEDA HTTP Add-on
How the KEDA HTTP Add-on intercepts traffic to scale HTTP workloads to zero, and when the Prometheus scaler is better.