KubeDojo

Observability and Troubleshooting for KEDA

KubeDojo
by KubeDojo·16 min read·
Observability and Troubleshooting for KEDA

Your queue depth is climbing, pods aren't scaling, and the KEDA operator logs show nothing obvious. Without metrics, you're guessing. With them, you know exactly which scaler failed, why, and how long ago.

KEDA exposes Prometheus metrics across its three components: the operator, the metrics server, and the admission webhooks. Each serves a different role in the scaling pipeline, and each has its own set of metrics worth monitoring. The goal: make KEDA's scaling decisions visible, alertable, and debuggable before your users notice something is wrong.

KEDA metrics architecture showing the flow from KEDA components through Prometheus to Grafana and alerts

Figure 1 — KEDA's three components expose metrics on :8080, scraped by Prometheus via ServiceMonitor. Grafana dashboards and PrometheusRules consume the data. kubectl --raw queries the External Metrics API directly.

Enabling Prometheus Metrics

KEDA ships with Prometheus metrics disabled by default. Every prometheus.*.enabled flag in the Helm chart is false out of the box. Deploy KEDA with default values and you get zero metrics, not even from the operator.

To enable scraping across all three components:

# kedacore/charts/keda/values.yaml (prometheus section)
prometheus:
  metricServer:
    enabled: true
    port: 8080
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus    # must match your Prometheus operator's serviceMonitorSelector
  operator:
    enabled: true
    port: 8080
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus
  webhooks:
    enabled: true
    port: 8080
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus

Each component (operator, metrics server, webhooks) has independent configuration for metric exposure, ServiceMonitor creation, and PodMonitor creation. The additionalLabels field is critical: if your Prometheus instance uses a serviceMonitorSelector (most Prometheus Operator installations do), the labels must match or the ServiceMonitor will be silently ignored.

warning: Setting prometheus.operator.enabled: true without prometheus.operator.serviceMonitor.enabled: true exposes the metrics endpoint but does not create a ServiceMonitor. Prometheus won't scrape it unless you create the ServiceMonitor manually or configure a Pod annotation-based scrape.

If you're not using the Prometheus Operator, you can skip ServiceMonitors and use Pod annotations instead:

# Pod annotation-based scraping (alternative to ServiceMonitor)
podAnnotations:
  keda:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Key Metrics Reference

All three KEDA components expose metrics on port 8080 at /metrics. The operator metrics are the ones you'll reference most often.

Operator Metrics

Metric Type What it tells you
keda_scaler_active Gauge Whether a scaler is active (1) or inactive (0). First thing to check when scaling stops.
keda_scaler_metrics_value Gauge The current metric value per scaler. This is what the HPA sees when computing target replicas.
keda_scaler_metrics_latency_seconds Gauge Time to retrieve the metric from each scaler. Spikes here often mean the external system is slow, not KEDA.
keda_scaler_detail_errors_total Counter Errors per individual scaler. Auth failures, unreachable endpoints, malformed responses.
keda_scaled_object_errors_total Counter Errors aggregated per ScaledObject.
keda_scaled_job_errors_total Counter Errors aggregated per ScaledJob.
keda_scaled_object_paused Gauge Whether a ScaledObject is paused (1) or active (0). Catches forgotten maintenance pauses.
keda_internal_scale_loop_latency_seconds Gauge Deviation between expected and actual scale loop execution. Growing values mean the operator is overloaded.
keda_resource_registered_total Gauge Total KEDA CRDs per namespace. Useful for inventory and capacity planning.
keda_trigger_registered_total Gauge Total triggers per type. Shows your scaler distribution.
keda_build_info Info Static build information (version, git commit, Go runtime).

note: When you deploy KEDA without any ScaledObjects or ScaledJobs, the only metric visible is keda_build_info. Scaler-related metrics only appear once at least one ScaledObject with active triggers is deployed. This catches people off guard: they enable Prometheus scraping, see only keda_build_info, and assume the export is broken.

Metrics Server Metrics

The metrics server exposes gRPC metrics for the internal communication between the operator and the metrics API server:

  • keda_internal_metricsservice_grpc_client_started_total / handled_total — request counts
  • keda_internal_metricsservice_grpc_client_handling_seconds — response latency histogram

These matter when debugging metric delivery failures. If the operator collects metrics but the HPA doesn't see them, the gRPC channel between operator and metrics server is the next place to look.

Admission Webhooks Metrics

  • keda_webhook_scaled_object_validation_total — total validation requests
  • keda_webhook_scaled_object_validation_errors — validation failures

These are mostly relevant during rollout of new ScaledObject configurations. A spike in validation errors means your manifests have schema issues.

Grafana Dashboards

KEDA offers a premade Grafana dashboard, and the community has built richer alternatives.

Official Dashboard

The KEDA project maintains a dashboard at kedacore/keda/config/grafana/keda-dashboard.json. It has two sections: metrics server visualization (gRPC request rates, latency) and scale target replica changes over time. Template variables include datasource, namespace, scaledObject, scaledJob, scaler, and metric.

The limitation: it focuses on the metrics server's internal gRPC metrics, not on the operator-level scaler health metrics that you need for day-to-day monitoring. You won't see keda_scaler_active or keda_scaler_detail_errors_total on this dashboard.

Community Dashboards

The kubernetes-autoscaling-mixin project provides two dashboards that are more useful for operations:

Scaled Object dashboard (Grafana ID 23951):

  • Total ScaledObject count and status overview
  • Per-object scaler metric values and collection latency
  • Error rates per ScaledObject
  • Direct links to the HPA dashboard for deeper scaling insights

Scaled Job dashboard (Grafana ID 23952):

  • Same structure as Scaled Object, tailored for ScaledJobs
  • Links to the Kubernetes Workload dashboard for job resource consumption

Import them directly from Grafana Labs. They require the KEDA operator metrics to be enabled (the Helm prometheus.operator.enabled: true flag covered earlier).

If you're building your own panels, these PromQL queries cover the basics:

# Scaler health overview — which scalers are active?
keda_scaler_active{namespace="$namespace", scaledObject="$scaledObject"}

# Metric collection latency per scaler (p99 over 5m)
histogram_quantile(0.99, rate(keda_scaler_metrics_latency_seconds[5m]))

# Error rate per ScaledObject over the last hour
sum(increase(keda_scaled_object_errors_total[1h])) by (scaledObject, exported_namespace)

Start with the community dashboards for daily operations. Import the official dashboard if you need to debug the gRPC channel between operator and metrics server.

Prometheus Alerts

The same kubernetes-autoscaling-mixin project provides five production-ready alerts for KEDA. Here are the PromQL expressions you'll deploy:

# adinhodovic/kubernetes-autoscaling-mixin/prometheus_alerts.yaml (KEDA rules)
- alert: KedaScaledObjectErrors
  expr: |
    sum(
      increase(
        keda_scaled_object_errors_total{
          job="keda-operator"
        }[10m]
      )
    ) by (cluster, job, exported_namespace, scaledObject) > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    description: >-
      KEDA scaled objects are experiencing errors.
      Check the scaled object {{ $labels.scaledObject }}
      in namespace {{ $labels.exported_namespace }}.

- alert: KedaScalerLatencyHigh
  expr: |
    avg(
      keda_scaler_metrics_latency_seconds{
        job="keda-operator"
      }
    ) by (cluster, job, exported_namespace, scaledObject, scaler) > 5
  for: 10m
  labels:
    severity: warning
  annotations:
    description: >-
      Metric latency for scaler {{ $labels.scaler }}
      for the object {{ $labels.scaledObject }}
      has exceeded acceptable limits.

- alert: KedaScaledObjectPaused
  expr: |
    max(
      keda_scaled_object_paused{
        job="keda-operator"
      }
    ) by (cluster, job, exported_namespace, scaledObject) > 0
  for: 25h
  labels:
    severity: warning
  annotations:
    description: >-
      The scaled object {{ $labels.scaledObject }}
      in namespace {{ $labels.exported_namespace }}
      is paused for longer than 25h.

- alert: KedaScalerDetailErrors
  expr: |
    sum(
      increase(
        keda_scaler_detail_errors_total{
          job="keda-operator"
        }[10m]
      )
    ) by (cluster, job, exported_namespace, scaledObject, type, scaler) > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    description: >-
      Errors in KEDA scaler {{ $labels.scaler }}
      for {{ $labels.scaledObject }}
      in namespace {{ $labels.exported_namespace }}.

The full set also includes KedaScaledJobErrors (same pattern as KedaScaledObjectErrors but for ScaledJobs).

Tuning these thresholds: The 5-second latency threshold on KedaScalerLatencyHigh works for fast scalers like Prometheus and Redis. SQS and Kafka scalers involve API calls that can take longer under load. Adjust per scaler type. The 25-hour KedaScaledObjectPaused threshold catches maintenance pauses that were never unpaused. Shorten it if you don't routinely pause ScaledObjects for full-day windows.

You can deploy these using the Helm chart's built-in PrometheusRules support:

# kedacore/charts/keda/values.yaml (prometheusRules section)
prometheus:
  operator:
    prometheusRules:
      enabled: true
      alerts:
        - alert: KedaScaledObjectErrors
          expr: sum(increase(keda_scaled_object_errors_total[10m])) by (scaledObject) > 0
          for: 1m
          labels:
            severity: warning

This simplified example groups only by scaledObject. The community version groups by (cluster, job, exported_namespace, scaledObject) for multi-namespace granularity. Keep the full label set if you run KEDA across multiple namespaces.

Or import the community alerts file directly into your Prometheus deployment.

Querying Metrics with kubectl

You don't always need Prometheus to check what KEDA sees. The external metrics API is queryable directly with kubectl.

List all external metrics KEDA exposes:

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "externalmetrics",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": ["get"]
    }
  ]
}

Query a specific metric value for a ScaledObject:

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/production/s0-prometheus-http_requests_per_second?labelSelector=scaledobject.keda.sh%2Fname%3Dapi-scaler"
{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metricName": "s0-prometheus-http_requests_per_second",
      "metricLabels": null,
      "timestamp": "2026-03-31T10:15:00Z",
      "value": "42"
    }
  ]
}

To find the metric names KEDA generates for a ScaledObject:

$ kubectl get scaledobject api-scaler -n production -o jsonpath='{.status.externalMetricNames}'
["s0-prometheus-http_requests_per_second"]

warning: CPU and memory scalers do not register as external metrics. They use the metrics.k8s.io API (provided by metrics-server), not external.metrics.k8s.io. Querying the external metrics endpoint for CPU/memory scalers returns nothing.

These kubectl queries are useful for quick sanity checks: if KEDA reports a metric value of 0 but your Prometheus shows 42, the problem is in the scaler configuration (wrong query, wrong server address, auth failure), not in the HPA.

Troubleshooting Common Failures

ScaledObject Not Scaling

Check keda_scaler_active first. A value of 0 means the scaler is inactive, either because the external metric source is unreachable or because the scaler is returning errors.

$ kubectl describe scaledobject api-scaler -n production
# Look for:
#   Conditions:
#     Type    Status  Reason
#     Active  False   FailedGetExternalMetric
#     Ready   True    ScaledObjectReady

If Active is False, check the operator logs:

$ kubectl logs deployment/keda-operator -n keda | grep "api-scaler"

Common causes: expired credentials in a TriggerAuthentication, wrong serverAddress in a Prometheus trigger, network policies blocking egress from the KEDA namespace.

"Only keda_build_info Visible"

This is the most frequently reported KEDA metrics issue (GitHub issue #5599). Two possible causes:

  1. No scalers deployed yet. Scaler metrics only appear after at least one ScaledObject with active triggers exists. Deploy a ScaledObject and the remaining metrics populate within one scale loop interval (default 30s).
  2. Helm prometheus.*.enabled not set. The default is false for all three components. Verify with helm get values keda -n keda and confirm the prometheus section is configured.

Scaler Errors Spiking

Watch keda_scaler_detail_errors_total increasing. This counter increments per scaler, with labels for the scaler name and ScaledObject.

Common causes:

  • Expired credentials: TriggerAuthentication referencing a Secret with rotated credentials
  • Endpoint unreachable: Prometheus server moved, SQS endpoint changed, DNS resolution failing
  • Query returning no data: PromQL query valid but matching zero time series (label mismatch after a deployment change)
$ kubectl get triggerauthentication -n production -o yaml
# Verify secretTargetRef points to a current, valid Secret

Scale Loop Latency Growing

keda_internal_scale_loop_latency_seconds measures how far behind the scale loop is running compared to its expected interval. A value of 0 means on-time. Growing values mean the operator is spending too long collecting metrics.

Check per-scaler latency to find the bottleneck:

# PromQL: which scaler is slow?
topk(5, keda_scaler_metrics_latency_seconds)

Common causes: too many ScaledObjects per operator instance, slow external APIs, network latency between KEDA and the metric source. Consider splitting workloads across namespaces or reducing poll frequency on slow scalers.

ScaledObject Stuck at minReplicaCount

The scaler returns a value but it's below the threshold, so KEDA keeps the deployment at minReplicaCount instead of scaling to zero. Use the kubectl metric query to see what KEDA actually sees:

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/production/s0-prometheus-http_requests_per_second?labelSelector=scaledobject.keda.sh%2Fname%3Dapi-scaler"

Compare this value against your ScaledObject's threshold. If the metric value is 1 and the threshold is 50, you're at minReplicaCount. If the metric value is 0 and you still have pods running, check whether minReplicaCount is set to 1 (the default) instead of 0.

Gotchas

  • All Prometheus flags default to false. Deploying KEDA with default Helm values gives you zero metrics. Enable all three: prometheus.operator.enabled, prometheus.metricServer.enabled, prometheus.webhooks.enabled.

  • keda_build_info is a red herring. It's the only metric visible until you deploy at least one ScaledObject. People see it and assume the rest is broken. It's not. Deploy a ScaledObject and wait 30 seconds.

  • The official Grafana dashboard focuses on gRPC metrics. The premade keda-dashboard.json visualizes the metrics server's internal gRPC communication, not the operator-level scaler health metrics you'll need daily. The community dashboards from kubernetes-autoscaling-mixin are more operationally useful.

  • Scaler latency != KEDA latency. A spike in keda_scaler_metrics_latency_seconds usually means the external system (Prometheus, SQS, Kafka) is slow, not that KEDA has a bug. Check the external system first.

  • CPU/memory scalers use a different API. They go through metrics.k8s.io, not external.metrics.k8s.io. Querying the external metrics endpoint for them returns nothing, which looks like a broken ScaledObject but is actually expected behavior.

Wrap-up

Enable metrics scraping from day one. Deploy the community dashboards and alerts before you have your first scaling incident, not after.

With these metrics and alerts in place, the next step is building runbooks. Each alert in the community set maps to a diagnostic sequence: KedaScalerLatencyHigh leads you to check the external metric source, KedaScaledObjectErrors leads you to TriggerAuthentication secrets and operator logs, KedaScaledObjectPaused leads you to whoever paused it and forgot. Document those sequences for your on-call team. Event-driven autoscaling is only as trustworthy as the operational playbook behind it.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles