Insights

Init container cascade when every kubectl patch reverts in 10 seconds

The Slack ping came in at 2:14 am. Two replicas of the fanout service were stuck in Init:1/3 and the deploy queue behind them had grown to seven changes. The on-call engineer had already tried the obvious move, kubectl edit deployment, and the changes had reverted within ten seconds. By the time we joined the bridge, they had patched the same field four times in twenty minutes and were starting to wonder if etcd was corrupted. The shape of the failure was wrong though. Init containers do not normally cascade across three different upstream dependencies at once; either something upstream was common, or the spec was being rewritten under us.

Kubernetes recovery | 11 min read
Problem signal
  • Pods stuck in Init:0/3 or Init:1/3 with no forward progress and no clear log story
  • kubectl edit deployment changes revert within ten to fifteen seconds, every time
  • Three init containers each failing in a different protocol layer (TCP dial timeout, NXDOMAIN, AMQP ACCESS_REFUSED)
  • A topology or schema ConfigMap claims state that the live broker or database disagrees with
  • No activeDeadlineSeconds set on init containers, so transient failures wedge the Pod indefinitely
The 2 am page

Two replicas wedged, seven changes queued, four failed patches

When we joined the bridge, the on-call engineer had already burned forty minutes on what looked like a config drift bug. The fanout service in the platform namespace had two replicas, both stuck in Init:1/3. The init container chain had three steps (wait-for-redis, wait-for-mongodb, wait-for-rabbitmq) and the redis step was failing on a hardcoded IPv4 address that did not match the live Service. They patched the env var on the Deployment. The init container restarted. Ten seconds later the IP was back. They patched it again. Same thing.

Their working hypothesis was etcd corruption or a faulty kube-apiserver caching layer. We have seen both before, but neither matches the symptom shape here. Etcd corruption surfaces as 5xx responses to kubectl, not as silent successful PATCHes that revert. We needed to find what was doing the reverting before we wasted any more time on the symptoms.

What we thought it was first

Two wrong guesses before the real culprit became visible

The first guess was a GitOps controller with self-heal enabled. ArgoCD does this with syncPolicy.automated.selfHeal: true. Flux does this with its Kustomization controller. Both will revert a kubectl patch within seconds if the live spec drifts from the source of truth in git. We checked the cluster for both. No Argo Application referenced the fanout namespace. Flux was not installed at all.

The second guess was a mutating admission webhook. A custom webhook that rewrites init container specs at admission time could in theory produce this pattern, except admission webhooks fire on create and update, not on a ten-second timer. We ran kubectl get mutatingwebhookconfigurations and the output was empty. That ruled it out.

The reverting was not coming from inside the cluster. It had to be coming from the node itself. We SSHed to the node where one of the fanout pods was scheduled and went looking. Within two minutes we had it.

$ ssh node-01 'ps -ef | grep admission'
root  1842  ... /usr/bin/supervisord -c /etc/supervisor/conf.d/admission.conf
root  2104  ... /bin/bash /var/lib/apex/admission.sh

$ ssh node-01 'cat /etc/supervisor/conf.d/admission.conf'
[program:admission]
command=/var/lib/apex/admission.sh
autorestart=true
startsecs=5

A supervisord-managed script on the node was the reverter. autorestart=true meant killing it bought us at most a few seconds.

What was actually overwriting our patches

The stored ConfigMap was the source of truth, not the live Deployment

The script at /var/lib/apex/admission.sh ran every ten seconds. It read three fields (redis-host, mongodb-host, amqp-uri) from a ConfigMap called fanout-init-config and patched them straight into the init container env vars on the live Deployment. The ConfigMap was the source of truth. The Deployment was a downstream artifact. Patching the Deployment was about as durable as writing in pencil.

sequenceDiagram
  participant Engineer
  participant Deployment
  participant Admission as node script
  participant ConfigMap as fanout-init-config
  Engineer->>Deployment: kubectl edit (fix redis-host)
  Deployment-->>Engineer: spec updated
  Note over Admission: tick every 10s
  Admission->>ConfigMap: read fields
  ConfigMap-->>Admission: stale values
  Admission->>Deployment: patch init container env
  Deployment-->>Engineer: changes reverted

The reverting loop. Edit the ConfigMap, not the Deployment.

This pattern shows up in places where the original GitOps story had gaps and someone wrote a node-side enforcer as a stopgap. Then the team rotated, the wiki page got out of date, and the enforcer kept running. We have seen this exact shape three times in the last year. Twice with supervisord scripts. Once with a systemd timer. The fix is always the same: find the source of truth before patching anything, and if you cannot find it in under fifteen minutes, stop and look on the nodes.

Three init containers, three different protocols

What each failure actually told us, and the fourth fix that did not show in any log

Once we knew to edit the ConfigMap, we still had three concurrent faults to diagnose. Each init container was failing in a different layer of the network stack, and each one had its own diagnostic signature.

The redis init container was dialing 10.43.181.44 on port 6379 and getting i/o timeout after thirty seconds. We compared against the live Service and got back a different ClusterIP.

$ kubectl get svc redis -n platform -o jsonpath='{.spec.clusterIP}'
10.43.218.92

$ kubectl logs fanout-7d4b9c-xx -c wait-for-redis -n platform | tail -3
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout
dial tcp 10.43.181.44:6379: i/o timeout

The hardcoded IP had no relationship to the live Service. ClusterIPs are not stable across Service recreation. Hardcoding one is a time bomb.

The mongodb init container was logging 'lookup mongo.platform.svc.cluster.local on 10.43.0.10:53: no such host'. The live Service was named mongodb, not mongo. One character off, NXDOMAIN. We caught it by running kubectl get svc -n platform and reading the actual Service name out loud. The hostname in the ConfigMap had been typed from memory by someone who remembered the team's old naming convention.

The rabbitmq init container was the most interesting of the three. The TCP connection succeeded. The AMQP frame negotiation succeeded. Authentication succeeded. The vhost open returned ACCESS_REFUSED. The URI was amqp://app:app@rabbitmq:5672/fanout-internal. We port-forwarded to the management API and listed valid vhosts.

$ kubectl port-forward -n platform svc/rabbitmq 15672:15672 &
$ curl -s -u app:app http://localhost:15672/api/vhosts | jq -r '.[].name'
/
/platform

# fanout-internal does not exist on this broker

The URI parsed cleanly and authenticated cleanly. The failure was at vhost open. Always enumerate vhosts before assuming auth or credentials.

There was a fourth fix that did not show up in any log. None of the init containers had activeDeadlineSeconds set, and neither did the Pod spec. Even after the three protocol bugs were resolved, a transient DNS hiccup or broker restart would have hung an init container indefinitely instead of failing fast and letting the kubelet retry the Pod. We added activeDeadlineSeconds: 120 on every init container and 600 at the Pod level. Defense in depth, because init container deadlines do not always catch the case where the kubelet keeps reconciling a stuck container.

The look-alike ConfigMap we almost broke

A second ConfigMap with the same shape, intentionally broken, was a load-bearing canary

Before we patched fanout-init-config, we almost made one more mistake. There was a second ConfigMap in the same namespace called fanout-init-config-canary. Same shape, same broken-looking IP, same broken-looking AMQP URI. It was labeled role: protected and annotated with purpose: chaos-canary. A drift-detection job in the cluster read it every fifteen minutes to confirm its own detection logic still fired on broken inputs. If we had run a sed-style global replace across all matching ConfigMaps (which is exactly what a tired engineer at 3 am tends to do) we would have silenced the canary and the team would have learned about the next round of real drift only when a customer noticed.

When you patch infrastructure under pressure, target the named resource, not the pattern. Read the labels and annotations of every resource you are about to touch. A surprising number of clusters have load-bearing decoys you do not know about until you break them. We have written more on this in the Kubernetes and CI/CD stabilization pillar.

What we changed afterwards

Source-of-truth guard, deadline defense, a validation Job, and convergence checks

The fanout service was the visible failure, but the recovery exposed five underlying gaps in the team's release flow. We left four durable changes in place before disconnecting from the bridge.

The fanout-init-config ConfigMap is now committed in git and synced via a real GitOps controller, and the node-side admission script was rewritten to refuse to overwrite a Deployment if the ConfigMap's content hash does not match a known-good baseline annotation. The script can still enforce, but it cannot enforce a broken state.

Every Deployment in the platform namespace now has activeDeadlineSeconds set at both the init container level (120 seconds) and the Pod level (600 seconds). The pair matters. Init container deadlines fail-fast the individual container; the Pod-level deadline prevents the kubelet from looping retries on a Pod that is structurally wrong.

A pre-deployment validation Job runs as part of the release flow. It carries label validation: predeploy, restartPolicy: OnFailure, activeDeadlineSeconds: 120, and a validator that does three real checks: redis, mongodb, and rabbitmq Services each have non-empty Endpoints, AND the broker reports every binding the topology ConfigMap claims to have declared. Topology drift was the other half of this incident; the binding count had silently dropped from five to three after a partial migration three weeks earlier, and nobody had noticed because the topology-version annotation still said 5.

# Snippet from the topology-reconcile Job that fixed the broker drift
apiVersion: batch/v1
kind: Job
metadata:
  name: topology-reconcile-2026-05-15
  labels:
    validation: predeploy
spec:
  activeDeadlineSeconds: 120
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: reconcile
        image: rabbitmq:3.13-management
        command: ["/bin/bash", "-c"]
        args:
          - |
            set -euo pipefail
            EXPECTED=$(yq '.bindings | length' /config/topology.yaml)
            for b in $(yq -o=json '.bindings[]' /config/topology.yaml | jq -c .); do
              EX=$(echo $b | jq -r .exchange)
              QU=$(echo $b | jq -r .queue)
              RK=$(echo $b | jq -r ."routing-key")
              rabbitmqadmin declare binding source=$EX destination=$QU routing_key=$RK
            done
            ACTUAL=$(curl -s -u $USER:$PASS http://rabbitmq:15672/api/bindings | jq 'length')
            [ "$ACTUAL" -ge "$EXPECTED" ] || exit 1

Reconcile via Job, not via kubectl exec. The Job is observable, retryable, and leaves an audit record.

The team's rollback runbook now requires two consecutive green health observations twenty seconds apart before a rollout is declared finished. Single-shot green is not enough on a cluster that has a ten-second admission tick, because you can catch the Pod between reverts and declare victory ninety seconds before the next failure cascade. We learned to distrust single-shot green the hard way on a different engagement, and that is now the default in every recovery handover we ship.

If you are looking at a cluster where every patch reverts within seconds, do not patch faster. Stop patching and find what is doing the reverting. The fix itself is usually ten minutes once you know where the source of truth lives. Finding the source of truth is what takes the hour. If you want a second pair of eyes on a system that is in this state, request an infrastructure review and we will be on a bridge with you the same day.

Related

Use these related pages to continue recovery