KubeDojo

Custom Metrics and Prometheus-Based Scaling with KEDA

KubeDojo
by KubeDojo·18 min read·
Custom Metrics and Prometheus-Based Scaling with KEDA

Kubernetes HPA scales on CPU and memory. Most production workloads need to scale on something else entirely: request rate, queue depth, error count, p95 latency, active sessions, orders per second. The metric that reflects real demand is almost never CPU utilization.

The traditional bridge between Prometheus and HPA is the Prometheus Adapter. It works, technically. But configuring it has driven more than a few teams to question their career choices. KEDA's Prometheus scaler replaces that adapter with a single ScaledObject: one PromQL query, one threshold, and the same scale-to-zero capability that makes KEDA worth running in the first place.

This article covers the Prometheus scaler in depth: how it works under the hood, the critical distinction between activation and scaling thresholds, authentication patterns, managed cloud Prometheus integration, and production-ready PromQL query design.

The Problem with Prometheus Adapter

Before KEDA, the standard path from Prometheus to HPA was the Prometheus Adapter, a Kubernetes SIG project that registers as a custom metrics API server. You define rules in a ConfigMap that tell the adapter how to discover Prometheus metrics, associate them with Kubernetes resources, rename them for the metrics API, and construct the PromQL query.

Here is what a real Prometheus Adapter rule looks like:

# prometheus-adapter ConfigMap rules
rules:
  - seriesQuery: '{__name__=~"^container_.*",container!="POD",namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: { resource: 'namespace' }
        pod: { resource: 'pod' }
    name:
      matches: '^container_(.*)_seconds_total$'
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>,container!="POD"}[2m])) by (<<.GroupBy>>)'

Each rule has four parts: discovery (which Prometheus series to find), association (which Kubernetes resources they map to), naming (how to expose them in the custom metrics API), and querying (how to transform them into PromQL). The template variables <<.Series>>, <<.LabelMatchers>>, and <<.GroupBy>> are adapter-specific syntax that most teams find cryptic on first (and second, and third) encounter.

The pain compounds at scale:

  • Single ConfigMap: All rules live in one ConfigMap. Hit the 1MB etcd size limit and you're stuck. Every team shares the same file, and one broken rule can take down metrics for every workload.
  • Single Prometheus instance: The adapter connects to exactly one Prometheus server. If your metrics span multiple Prometheus instances or you run Thanos, you need additional infrastructure.
  • No scale-to-zero: The adapter feeds metrics to HPA, and HPA doesn't scale below minReplicas. There is no activation phase.
  • Operational burden: The adapter is another deployment to run, monitor, and upgrade. Its failure mode is silent: metrics stop updating, HPA decisions go stale, and nobody notices until latency spikes.

KEDA eliminates all of these by moving the Prometheus query from a shared ConfigMap into a per-workload ScaledObject.

How the Prometheus Scaler Works

KEDA's Prometheus scaler is an HTTP client. On each polling interval, the KEDA operator sends a GET request to the Prometheus /api/v1/query endpoint, parses the response, and feeds the resulting float value to HPA as an external metric.

KEDA Prometheus scaler architecture showing the metrics flow from application pods through Prometheus to KEDA and HPA Figure 1: KEDA's Prometheus scaler replaces the Prometheus Adapter, polling Prometheus directly via PromQL and feeding metric values to HPA as external metrics.

The source code in prometheus_scaler.go shows the flow clearly. The ExecutePromQuery function builds the URL:

<serverAddress>/api/v1/query?query=<url-encoded-query>&time=<RFC3339-now>

It sends the request with the configured authentication headers, parses the JSON response, and returns a single float64 value. If the result set is empty and ignoreNullValues is true (the default), the scaler returns 0. If the result set contains more than one element, the scaler returns an error. Your PromQL query must return exactly one value.

Here is a minimal ScaledObject that scales a deployment based on HTTP request rate:

# keda-prometheus-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: my-deployment
  minReplicaCount: 0
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '100'
      activationThreshold: '5'

The key parameters:

Parameter Purpose Required
serverAddress Prometheus endpoint URL (supports VictoriaMetrics, Thanos) Yes
query PromQL query returning a single scalar/vector element Yes
threshold Target value for 1→N scaling (float) Yes
activationThreshold Target value for 0→1 scaling (float, default 0) No
namespace For namespaced queries (Thanos setups) No
ignoreNullValues Treat empty results as 0 (default true) No
unsafeSsl Skip TLS certificate validation No
timeout Per-trigger HTTP client timeout (overrides global) No
customHeaders Additional HTTP headers for the query No

Activation vs Scaling Thresholds

This is the single most important concept for the Prometheus scaler, and the one most commonly misconfigured.

Activation vs scaling thresholds showing KEDA operator handling 0-to-1 and HPA handling 1-to-N Figure 2: KEDA's two-phase model. The operator handles activation (0↔1) using activationThreshold, then hands off to HPA for scaling (1↔N) using threshold.

KEDA operates in two phases:

  1. Activation phase (0↔1): The KEDA operator decides whether to scale the workload from zero to one replica (or back to zero). This decision uses the activationThreshold.
  2. Scaling phase (1↔N): Once at least one replica is running, HPA takes over and scales between 1 and maxReplicaCount. HPA uses the threshold value as its target metric.

The activationThreshold defaults to 0, which means any metric value greater than 0 activates the deployment. The critical detail: activation occurs when the metric is strictly greater than the activationThreshold, not greater-than-or-equal. With the default of 0, a metric value of exactly 0 keeps the deployment scaled to zero, while 0.001 activates it.

The threshold works differently. KEDA uses the Value external metric type by default, so HPA calculates desired replicas as ceil(currentMetricValue / threshold). If your threshold is 100 and the current metric value is 450, HPA targets 5 replicas.

warning: The activation threshold takes priority over the scaling threshold. If you set threshold: 10 and activationThreshold: 50, a metric value of 40 keeps the deployment at zero replicas, even though HPA would want 4 instances. The scaler is not active, so HPA never gets consulted.

When minReplicaCount is 1 or higher, the scaler is always active and activationThreshold is irrelevant. The two-threshold model only matters when you're using scale-to-zero.

A practical example: you scale on sum(rate(http_requests_total{...}[2m])). Set activationThreshold: 5 to ignore background noise (health checks, monitoring scrapers). Set threshold: 100 to target one pod per 100 requests per second. The deployment stays at zero during quiet hours, wakes up when real traffic arrives, and HPA handles the rest.

Authentication and TLS

Production Prometheus instances are rarely open. KEDA supports four authentication modes through TriggerAuthentication, and they can be combined (e.g., authModes: "tls,basic").

Bearer token

# keda-prometheus-bearer-auth.yaml
apiVersion: v1
kind: Secret
metadata:
  name: keda-prom-secret
stringData:
  bearerToken: "your-prometheus-bearer-token"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-prom-creds
spec:
  secretTargetRef:
    - parameter: bearerToken
      name: keda-prom-secret
      key: bearerToken
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
spec:
  scaleTargetRef:
    name: my-deployment
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server:9090
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        threshold: '100'
        authModes: "bearer"
      authenticationRef:
        name: keda-prom-creds

TLS with client certificates

For mutual TLS, set authModes: "tls" and provide cert, key, and ca through the TriggerAuthentication. This is common in enterprise environments with internal CAs.

You can also set a CA certificate without any other auth mode. If your Prometheus endpoint uses a self-signed certificate or an enterprise CA, provide the ca parameter and KEDA will trust it regardless of the selected authModes.

Custom auth headers

For Prometheus behind a reverse proxy or API gateway that expects a custom header:

# keda-prometheus-custom-auth.yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-prom-custom
spec:
  secretTargetRef:
    - parameter: customAuthHeader
      name: keda-prom-secret
      key: customAuthHeader    # e.g., "X-Auth-Token"
    - parameter: customAuthValue
      name: keda-prom-secret
      key: customAuthValue     # the token value

One TriggerAuthentication can be referenced by multiple ScaledObjects, so a team with 20 workloads scaling from the same Prometheus instance defines credentials once.

Cloud-Managed Prometheus Services

All three major cloud providers offer managed Prometheus, and KEDA integrates with each using cloud-native identity mechanisms. No static credentials needed.

Amazon Managed Prometheus (AMP)

# keda-amp-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-aws
spec:
  podIdentity:
    provider: aws
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: amp-scaledobject
spec:
  scaleTargetRef:
    name: my-deployment
  triggers:
    - type: prometheus
      authenticationRef:
        name: keda-trigger-auth-aws
      metadata:
        serverAddress: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-abc123"
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        threshold: '100'
        awsRegion: us-east-1
        identityOwner: operator

The identityOwner: operator tells KEDA that the operator pod (not the workload pod) assumes the IAM role. Under the hood, the scaler uses AWS SigV4 request signing via NewSigV4RoundTripper to authenticate every query to AMP.

Azure Monitor Managed Prometheus

# keda-azure-prometheus-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: azure-managed-prometheus-auth
spec:
  podIdentity:
    provider: azure-workload
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: azure-prometheus-scaledobject
spec:
  scaleTargetRef:
    name: my-deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://my-workspace.eastus.prometheus.monitor.azure.com
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '100'
    authenticationRef:
      name: azure-managed-prometheus-auth

The identity needs the Monitoring Data Reader role on the Azure Monitor Workspace. No other authModes can be combined with Azure Workload Identity.

Google Managed Prometheus

# keda-gcp-prometheus-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ClusterTriggerAuthentication
metadata:
  name: google-workload-identity-auth
spec:
  podIdentity:
    provider: gcp
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gcp-prometheus-scaledobject
spec:
  scaleTargetRef:
    name: my-deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://monitoring.googleapis.com/v1/projects/my-project/location/global/prometheus
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '100'
    authenticationRef:
      kind: ClusterTriggerAuthentication
      name: google-workload-identity-auth

GCP uses ClusterTriggerAuthentication for cluster-wide scope. The Google Service Account needs the roles/monitoring.viewer role. The source code resolves GCP credentials through GetGCPOAuth2HTTPTransport with the monitoring.read OAuth scope.

Production Patterns and PromQL Design

Fallback replicas

When Prometheus is unreachable, you don't want your workload to scale to zero because the scaler can't get metrics. The fallback mechanism provides a safety net:

# keda-prometheus-fallback.yaml
spec:
  fallback:
    failureThreshold: 3
    replicas: 6

After 3 consecutive failures to query Prometheus, KEDA scales the deployment to 6 replicas. This keeps the workload running at a known-safe capacity until Prometheus recovers.

ignoreNullValues

The default (true) treats empty Prometheus responses as a metric value of 0. This is safe for most cases: if nobody is sending requests, the rate is genuinely 0.

Set it to false when a missing metric means the target is down, not idle. If your deployment is supposed to always emit metrics and Prometheus suddenly returns nothing, that's a monitoring failure, not a scaling signal. With ignoreNullValues: false, the scaler returns an error, which prevents scaling decisions based on phantom zeros.

PromQL query design

Your query must return a single value. This means aggregation is almost always required:

# Good: aggregated to a single value
sum(rate(http_requests_total{deployment="my-deployment"}[2m]))

# Bad: returns one value per pod (multiple results → scaler error)
rate(http_requests_total{deployment="my-deployment"}[2m])

The rate() window matters. A 30-second window is noisy and can cause rapid scaling oscillation. A 10-minute window is sluggish and misses traffic bursts. Two minutes ([2m]) is a reasonable default for most HTTP workloads. Match it to your application's traffic pattern.

RED metrics as scaling signals

The RED method (Rate, Errors, Duration) maps naturally to scaling triggers:

# Rate: scale on throughput (requests per second)
sum(rate(http_requests_total{deployment="my-deployment"}[2m]))

# Errors: scale on error rate (5xx responses per second)
sum(rate(http_requests_total{deployment="my-deployment",status=~"5.."}[2m]))

# Duration: scale on p95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket{deployment="my-deployment"}[2m])) by (le)
)

Scaling on latency is the most nuanced. A rising p95 often means your pods are saturated, so adding replicas helps. But latency spikes caused by downstream dependencies (a slow database, an external API) won't improve with more replicas. Use latency-based scaling when you've verified the bottleneck is in the scaled workload itself.

Scaling modifiers

When scaling on multiple signals, KEDA's scalingModifiers feature lets you compose trigger values with a formula:

# keda-scaling-modifiers-example.yaml
advanced:
  scalingModifiers:
    formula: "error_rate < 0.05 ? request_rate : request_rate * 2"
    target: "100"
triggers:
  - type: prometheus
    name: request_rate
    metadata:
      serverAddress: http://prometheus-server:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '100'
  - type: prometheus
    name: error_rate
    metadata:
      serverAddress: http://prometheus-server:9090
      query: |
        sum(rate(http_requests_total{deployment="my-deployment",status=~"5.."}[2m]))
        / sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '0.05'

This scales on request rate normally, but doubles the scaling signal when the error rate exceeds 5%. The formula language supports arithmetic, ternary operators, and aggregation functions across named triggers.

Metric caching

KEDA's metrics server normally queries the scaler on every HPA poll (roughly every 15 seconds). Enable useCachedMetrics: true in the trigger spec to cache the value between polling intervals, reducing load on your Prometheus instance:

# keda-cached-metrics.yaml
triggers:
  - type: prometheus
    useCachedMetrics: true
    metadata:
      serverAddress: http://prometheus-server:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
      threshold: '100'

This is especially valuable when your Prometheus is resource-constrained or when the query is expensive (high-cardinality labels, long range vectors).

Gotchas

Your query must return exactly one value. A rate() without sum() returns one value per pod. KEDA errors on multiple results. Always aggregate with sum(), avg(), or another function that collapses to a single element.

Inf values are treated as null. If Prometheus returns +Inf (which can happen with rate() over counter resets or division by zero), the scaler returns 0 when ignoreNullValues is true, or errors when it's false. If your scaling decisions seem stuck at zero, check whether your PromQL can produce infinity edge cases.

Pod identity is exclusive. When using podIdentity (AWS, Azure, or GCP), you cannot combine it with any authModes. The source code enforces this in checkAuthConfigWithPodIdentity. If you need TLS on top of pod identity, the TLS termination has to happen at the network layer (a service mesh or load balancer), not in KEDA's HTTP client.

The timeout parameter accepts human-readable durations. Not just milliseconds. Values like "30s" or "2m" work and override the global KEDA_HTTP_DEFAULT_TIMEOUT. Set this when your Prometheus instance is slow to respond under load, rather than letting the scaler fail with a timeout error.

activationThreshold is strictly greater-than, not greater-than-or-equal. With the default of 0, a metric value of exactly 0 does not activate the scaler. A value of 0.001 does. If you're debugging why your workload isn't waking up, check whether the metric is returning exactly your activation value.

Wrap-up

KEDA's Prometheus scaler replaces Prometheus Adapter's shared ConfigMap with per-workload ScaledObjects, adds scale-to-zero, and integrates natively with managed cloud Prometheus. The two numbers to get right: activationThreshold for when your workload wakes up, threshold for how many replicas HPA targets once it's running.

For workloads that need to scale on HTTP traffic specifically, KEDA also offers a dedicated HTTP Add-on with built-in traffic interception and scale-to-zero. That's a different approach with its own trade-offs, covered in the next article in this collection.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles