deployments rolling-updates kubernetes cka rollbacks

Deployments: Rolling Updates and Rollbacks

by Alexis Kinsella·May 6, 2026·18 min read

Three deployments. kubectl get deployments -A shows them stuck at 2/5 READY. UP-TO-DATE doesn't match AVAILABLE. You're on-call and need to understand fast: is this a rollout in progress, a stalled rollout, or a degraded deployment?

This is exactly what the CKA Workloads & Scheduling domain tests. Not "what field controls rolling update behavior," but "why is this rollout stuck, what does the deployment status tell you, and how do you fix it?" That spans both Workloads (15%) and Troubleshooting (30%) simultaneously. The same diagnostic process you use here applies to half the troubleshooting scenarios on the exam.

This article covers how rolling updates work under the hood, how to read deployment status conditions precisely, the full kubectl rollout toolkit, and the five most common reasons rollouts stall.

The Rolling Update Mechanism

A rolling update is a choreography between two ReplicaSets. When you trigger a rollout, the Deployment controller creates a new ReplicaSet and begins scaling it up while scaling down the old one. The rate of that exchange is controlled by two fields: maxSurge and maxUnavailable.

maxSurge is how many pods above desired can exist during the rollout. maxUnavailable is how many pods below desired can be unavailable simultaneously. Both default to 25%, but they round differently: maxSurge rounds up, maxUnavailable rounds down. For a 4-replica deployment, that means at most 5 pods running at peak (ceil(0.25 × 4) = 1 surge) and at most 1 unavailable (floor(0.25 × 4) = 1). The rounding difference matters more at low replica counts: for a 3-replica deployment, floor(0.25 × 3) = 0: the rollout cannot terminate any old pod. Combined with ceil(0.25 × 3) = 1 surge, it proceeds by creating one new pod first, then removing old ones one by one. With maxUnavailable: 0, surge capacity is required to make progress.

Not every change to a Deployment triggers a rollout. Only changes to .spec.template do. Scaling replicas, adding labels to the Deployment metadata, changing annotations on the Deployment object: none of those create a new ReplicaSet. This matters when you're wondering why a metadata change didn't cycle your pods.

What production deployments actually configure

CoreDNS is a good example of an explicit, conservative strategy. As cluster DNS, it cannot afford unnecessary unavailability during a rollout:

# coredns/deployment — kubernetes/coredns.yaml.sed (lines 84-91)
spec:
  # replicas: not specified here:
  # 1. Default is 1.
  # 2. Will be tuned in real time if DNS horizontal auto-scaling is turned on.
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

maxUnavailable: 1 (absolute, not percentage) means exactly one CoreDNS pod can be down during a rollout. maxSurge is unset, so it defaults to 25% rounded up. The replicas field is intentionally omitted and managed by the DNS autoscaler. CoreDNS also runs with priorityClassName: system-cluster-critical and podAntiAffinity using requiredDuringSchedulingIgnoredDuringExecution on kubernetes.io/hostname. When DNS pods must spread across nodes, the rolling constraint ensures they do so safely during updates.

Compare that to prometheus-operator, a leader-elected controller that runs as a single replica:

# prometheus-operator/kube-prometheus — manifests/prometheusOperator-deployment.yaml (lines 11-12)
spec:
  replicas: 1
  # no strategy field — Kubernetes defaults (25%/25%) apply

No strategy field. The default 25%/25% works fine for a single-replica controller. During a rollout, the old pod terminates and the new one starts. There's a brief gap, but leader election handles it. The manifest's real operational investment is elsewhere: explicit resource limits (cpu: 200m/100m, memory: 200Mi/100Mi) and full security hardening (allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, capabilities: drop: ALL). Rolling strategy is secondary when availability is already limited to one.

cert-manager's values.yaml documents the pattern explicitly for the single-replica + leader-election case:

# cert-manager/cert-manager — deploy/charts/cert-manager/values.yaml (lines 122-131)
# Deployment update strategy for the cert-manager controller deployment.
# For more information, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy).
#
# For example:
#  strategy:
#    type: RollingUpdate
#    rollingUpdate:
#      maxSurge: 0
#      maxUnavailable: 1
strategy: {}

maxSurge: 0 with maxUnavailable: 1 is the no-surge pattern: the old pod terminates before the new one starts. No extra capacity consumed, no surge-related scheduling pressure. cert-manager ships with empty strategy: {} (Kubernetes defaults) but documents this explicitly for resource-constrained environments.

revisionHistoryLimit and its cost

Every rollout creates a new ReplicaSet. By default, Kubernetes retains 10 old ReplicaSets in etcd. cert-manager exposes revisionHistoryLimit as a top-level Helm value applied via the controller deployment template:

# cert-manager/cert-manager — deploy/charts/cert-manager/templates/deployment.yaml (lines 19-21)
# ... (line 18: template comment on if-statement semantics, omitted for brevity)
{{- if not (has (quote .Values.global.revisionHistoryLimit) (list "" (quote ""))) }}
revisionHistoryLimit: {{ .Values.global.revisionHistoryLimit }}
{{- end }}

The commented-out default in values.yaml is revisionHistoryLimit: 1. For high-churn controllers, trimming history reduces etcd noise. Setting it to 0 eliminates rollback entirely. That's a permanent trade-off: once the old ReplicaSets are gone, kubectl rollout undo has nothing to restore.

Reading Deployment Status

kubectl get deployments shows four columns that are not synonyms:

READY (ready/desired): pods that have passed readiness probes
UP-TO-DATE: pods running the new pod template
AVAILABLE: pods that have been ready for at least minReadySeconds (default 0, so effectively equal to READY unless you've set it)

A healthy rolling deployment in progress shows UP-TO-DATE climbing while READY temporarily dips. When UP-TO-DATE equals READY equals desired, the rollout is complete.

The summary columns tell you what's happening. The .status.conditions field tells you why:

$ kubectl get deployment argocd-server -n argocd \
    -o jsonpath='{.status.conditions}' | jq .
[
  {
    "lastTransitionTime": "2026-03-01T12:00:00Z",
    "message": "Deployment has minimum availability.",
    "reason": "MinimumReplicasAvailable",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2026-03-01T12:00:00Z",
    "message": "ReplicaSet \"argocd-server-6d9f7b8c9\" has successfully progressed.",
    "reason": "NewReplicaSetAvailable",
    "status": "True",
    "type": "Progressing"
  }
]

Three condition types matter:

Condition	Status	Meaning
`Available`	True	Minimum replicas are available
`Progressing`	True	Rollout underway or completed successfully
`Progressing`	False	Stalled past `progressDeadlineSeconds` (reason: `ProgressDeadlineExceeded`)
`ReplicaFailure`	True	ReplicaSet can't create pods (quota, taints, scheduler rejection)

Deployment Status: Four States Figure 2: Deployment Status: Four States

progressDeadlineSeconds defaults to 600. When a rollout hasn't made progress for that window, Kubernetes sets Progressing=False. This is a diagnostic signal, not a control action.

warning: progressDeadlineSeconds does not trigger an automatic rollback. Kubernetes marks the rollout failed and keeps retrying. You must run kubectl rollout undo yourself, or fix the underlying cause and wait for the rollout to complete.

kubectl describe deployment surfaces this through Events and Conditions in human-readable form. On the CKA, describe is often faster than inspecting raw JSON conditions.

The kubectl rollout Toolkit

Status and CI/CD gating

kubectl rollout status watches the rollout and exits 0 on success, non-zero on failure. That exit code makes it useful as a pipeline gate:

$ kubectl set image deployment/argocd-server \
    argocd-server=quay.io/argoproj/argocd:v2.12.0 -n argocd
$ kubectl rollout status deployment/argocd-server -n argocd --timeout=5m
Waiting for deployment "argocd-server" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "argocd-server" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "argocd-server" rollout to finish: 1 old replicas are pending termination...
deployment "argocd-server" successfully rolled out

Without --timeout, it blocks indefinitely. In CI, always set a timeout.

History and revision tracking

Revision history lives in ReplicaSet annotations, not in the Deployment itself. The CHANGE-CAUSE column in rollout history is populated from the kubernetes.io/change-cause annotation on the Deployment:

$ kubectl rollout history deployment/argocd-server -n argocd
REVISION  CHANGE-CAUSE
1         initial deploy
2         upgrade to v2.12.0
3         <none>

The --record flag that used to set this annotation was deprecated in Kubernetes 1.22 and removed in 1.28. Set it manually after updating the image:

$ kubectl annotate deployment/argocd-server \
    kubernetes.io/change-cause="upgrade to v2.12.0" -n argocd

Inspect what a specific revision contains:

$ kubectl rollout history deployment/argocd-server -n argocd --revision=2
deployment.apps/argocd-server with revision #2
Pod Template:
  Labels: app.kubernetes.io/name=argocd-server
          pod-template-hash=6d9f7b8c9
  Annotations: kubernetes.io/change-cause=upgrade to v2.12.0
  Containers:
   argocd-server:
    Image: quay.io/argoproj/argocd:v2.12.0

Rollback

kubectl rollout undo scales the previous ReplicaSet back up and scales the current one down. It's the rolling update mechanism in reverse:

# Roll back to the previous revision
$ kubectl rollout undo deployment/argocd-server -n argocd

# Roll back to a specific revision
$ kubectl rollout undo deployment/argocd-server -n argocd --to-revision=1

After a rollback, the restored revision gets a new revision number. Revision 1 rolled back to becomes revision 4. Revision numbers are not reused.

Pause and resume

Pausing a Deployment stops pod template changes from triggering rollouts. This lets you make multiple changes without an intermediate rollout for each:

$ kubectl rollout pause deployment/argocd-server -n argocd
$ kubectl set image deployment/argocd-server \
    argocd-server=quay.io/argoproj/argocd:v2.12.1 -n argocd
$ kubectl set resources deployment/argocd-server \
    -c argocd-server --limits=cpu=500m,memory=512Mi -n argocd
# One rollout with all changes
$ kubectl rollout resume deployment/argocd-server -n argocd

A paused Deployment won't report Progressing status until resumed. If you pause during an active rollout, that rollout also pauses mid-flight.

Restart without spec change

kubectl rollout restart cycles all pods without modifying the pod template spec. Useful for picking up rotated secrets, refreshing in-memory state, or forcing a recycle after node-level changes:

$ kubectl rollout restart deployment/argocd-server -n argocd

Under the hood, it adds a kubectl.kubernetes.io/restartedAt annotation to .spec.template.metadata, which counts as a template change and triggers a normal rolling update.

Diagnosing Stuck Rollouts

When a rollout stalls, pod state is the first place to look. kubectl get pods -n <ns> shows you the new pods; their status tells you almost everything.

1. Image pull failure. New pods stuck in ImagePullBackOff or ErrImagePull. The event from kubectl describe pod reads: Failed to pull image "quay.io/argoproj/argocd:bad-tag": ... not found. The old ReplicaSet stays scaled at its current count; the new RS can't reach ready. Fix the image reference and the rollout continues automatically.

2. Insufficient resources. New pods stuck in Pending with event 0/3 nodes are available: 3 Insufficient memory. The scheduler can't place them. Check node capacity against pod resource requests; the new pod template may have higher requests than the old one. kubectl describe nodes | grep -A5 "Allocated resources" shows what's committed per node.

3. Resource quota exceeded. New pods in Pending, and kubectl describe deployment shows condition ReplicaFailure=True with message exceeded quota: requests.cpu. The namespace quota doesn't have headroom for surge pods. Either raise the quota or set maxSurge: 0 to avoid creating additional pods during the rollout.

4. Readiness probe failing. New pods start and run, but never become Ready: kubectl get pods shows 0/1 READY for new pods. kubectl describe pod shows the readiness probe failing with its specific error. From the Deployment's perspective, this is silent until progressDeadlineSeconds fires: UP-TO-DATE climbs but AVAILABLE doesn't. This is the hardest failure to spot from the Deployment view alone.

5. Taint/toleration mismatch. New pods in Pending with event node(s) had untolerated taint {node.kubernetes.io/unreachable: NoExecute}. The pod template doesn't tolerate a taint that now exists on all schedulable nodes. Check whether taint conditions on nodes changed since the last successful rollout.

The diagnostic sequence: kubectl get pods to identify state, kubectl describe pod <failing-pod> to read events, kubectl describe deployment to read conditions. That order works for every failure mode.

Gotchas

The single-replica deadlock. With maxUnavailable: 25% (default) and a 1-replica deployment, 25% of 1 rounds down to 0. The rollout cannot terminate the old pod (that would make 1 pod unavailable, exceeding the 0 limit). If maxSurge is also 0, the deployment is deadlocked: it can neither scale up new pods nor scale down old ones. For single-replica deployments, set maxUnavailable: 1 (allow termination before replacement) or maxSurge: 1 (allow replacement before termination).

Scaling during a rollout doesn't reset it. If you kubectl scale deployment/foo --replicas=10 while a rollout is in progress, Kubernetes spreads the new replicas proportionally across old and new ReplicaSets. The rollout continues at the new scale, not from scratch.

revisionHistoryLimit: 0 permanently eliminates rollback. Once old ReplicaSets are deleted, there's no recovery from a bad deploy short of redeploying. For production deployments, keep at least 2-3 revisions.

Only .spec.template changes trigger rollouts. kubectl annotate deployment/foo kubernetes.io/change-cause="..." does not trigger a rollout. It modifies the Deployment metadata, not the pod template. Annotate after updating the image, not as a separate operation expecting pod cycling.

progressDeadlineSeconds fires, nothing happens automatically. Already covered above, but worth repeating as a separate mental model: a failed progress condition is information only. You own the remediation.

Practice Scenarios

These cover the CKA exam patterns for deployments and rollouts:

Scenario 1: Update image and document the change

kubectl set image deployment/web-server nginx=nginx:1.25 -n production
kubectl annotate deployment/web-server \
  kubernetes.io/change-cause="upgrade nginx to 1.25" -n production
kubectl rollout status deployment/web-server -n production

Scenario 2: Roll back to a specific revision

kubectl rollout history deployment/web-server -n production
kubectl rollout undo deployment/web-server --to-revision=2 -n production
kubectl rollout status deployment/web-server -n production

Scenario 3: Debug a stalled rollout

kubectl get deployments -n production
kubectl describe deployment/web-server -n production  # read Conditions and Events
kubectl get pods -n production                         # identify pod states
kubectl describe pod <failing-pod> -n production       # read pod events

Scenario 4: Configure strategy for a single-replica deployment to avoid deadlock

kubectl patch deployment/web-server -n production --type=merge -p \
  '{"spec":{"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxSurge":1,"maxUnavailable":0}}}}'

maxUnavailable: 0 with maxSurge: 1 ensures the single replica is never terminated before its replacement is ready. The new pod starts, passes readiness, then the old one terminates.

Wrap-up

Rolling updates are the intersection of Workloads and Troubleshooting on the CKA. The mechanism is simple; the operational skill is reading status conditions accurately and knowing which kubectl commands surface the right information quickly.

When a rollout stalls: read the deployment conditions first, then the ReplicaSet state, then the pod events. That sequence works for every failure mode listed above.

Next up: ConfigMaps and Secrets, which cover injecting runtime configuration into the same pods you just deployed, without triggering unnecessary rollouts.

deployments rolling-updates kubernetes cka rollbacks

Workloads and Scheduling (1 of 18)

StatefulSets: Ordered, Stateful Workload Management →

Alexis Kinsella

Languages (Rust, Go & Python), container orchestration (Kubernetes), data and cloud providers (AWS & GCP) lover. Runner & Cyclist.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.