Insights

When a hardening rollout breaks 8 layers and your own reconciler fights you

The first thing the on-call team tried was patching the status ConfigMap. Five apps showed Progressing in the platform's bleater-status object, the dashboard had been red for four hours, and somebody figured a kubectl patch on the status keys would at least quiet the pages while they investigated. The patch lasted about ten seconds. The internal reconciler running in the bleater-system namespace rewrote the ConfigMap on its next tick, every key back to Progressing, and the pages started again. That was the moment they called us. The hardening rollout that had run the night before had not broken one thing. It had broken eight.

K8s reliability | 12 min read
Problem signal
  • Apps stuck in Progressing or Degraded for hours after a hardening or security pass, with no single obvious cause in the events stream
  • Status ConfigMaps written by an in-cluster reconciler get rewritten within 10 to 15 seconds of any kubectl patch
  • kubectl delete job hangs on suspended PreSync Jobs because a hook-cleanup finalizer is still attached
  • Migration init containers crash-loop with pg_isready succeeding but a schema verification step failing
  • kubectl patch on a RoleBinding returns 'field is immutable' when trying to change roleRef
The patch that survived ten seconds

Why editing the status ConfigMap was the wrong instinct

The team had built a small in-house control plane the year before. A Python reconciler Pod in bleater-system watched the managed workloads, computed health from live cluster signals, and wrote a bleater-status ConfigMap every ten to fifteen seconds. Five apps reported there: an auth service, a profile service, a timeline service, a fanout service, and a primary application that handled the user-facing API. None of them used a full GitOps platform. The reconciler followed GitOps conventions, PreSync hook Jobs, sync windows, hook-cleanup finalizers, but it operated on raw Kubernetes primitives. ConfigMaps, Jobs, Roles. No CRDs.

That detail matters because when the on-call lead patched the status ConfigMap to mark the apps healthy, the reconciler was doing its job. It read the live cluster, saw the upstream signals were still bad, and rewrote the status. The patch was not wrong because patching ConfigMaps is wrong. It was wrong because the bleater-status object was not an input to the system. It was an output. Editing an output to fix a system is the same shape of mistake as editing a Prometheus metric to fix a service.

We have seen this pattern enough times to write it down as a rule. If a controller is rewriting your patches in under a minute, the object you are patching is derived state. Find the inputs. The reconciler source was eighty lines of Python and it took two minutes to read. The health predicate was an AND-chain across eight signals: lock state, finalizer presence, Job orphan check, schema version, RBAC capability, PVC bound, ResourceQuota headroom, NetworkPolicy egress. Any one of those returning bad meant Progressing. All eight were bad.

# the AND-chain we found in the reconciler
def app_health(app):
    if lock_status() == 'locked':
        return 'Progressing'
    if orphan_hook_present(app):
        return 'Progressing'
    if schema_declared_version() < required_version():
        return 'Progressing'
    if not migration_rbac_capable():
        return 'Progressing'
    if not pvc_bound(app):
        return 'Progressing'
    if quota_exhausted():
        return 'Progressing'
    if not egress_allows_db(app):
        return 'Progressing'
    if init_container_failing(app):
        return 'Degraded'
    return 'Healthy'

The reconciler's health function. Eight independent signals, all gating. Every patch to the output ConfigMap was wasted work until every signal flipped.

Eight failures wearing one hat

What the inventory pass turned up in the bleater namespace

We started with the inventory, because the ticket told us almost nothing. A real P1 page rarely enumerates faults; it tells you what is on fire and gives you the namespace. We ran the kind of get-everything pass we always run on a strange namespace.

kubectl get pods,configmaps,jobs,deployments,roles,rolebindings,serviceaccounts,pvc,resourcequota,networkpolicy -n bleater
kubectl get events -n bleater --sort-by=.lastTimestamp | tail -40
kubectl describe pod -n bleater | grep -A5 'Init Containers\|Events:'

The first three commands we ran. The namespace had about sixteen pre-existing platform workloads from other teams sharing label values with the five managed apps.

What came back was a layered mess. A suspended PreSync Job named auth-presync-migrate-legacy7r2x with a hook-cleanup finalizer and no hook-delete-policy. A second suspended Job named fanout-presync-validate that looked identical but carried the hook-delete-policy annotation and a bleater.io/owner label pointing at platform-team. A hook-reconciliation-lock ConfigMap with status: locked and a stale lock-reason from the night of the rollout. The primary application's pod in Init:CrashLoopBackOff with kubectl logs --previous showing the init container failing after pg_isready returned ok. A bleat-db-schema ConfigMap declaring version=2 with no tables-v3 key. A migration script that contained psql ... || exit 0 and had no set -e.

And then the governance layer, which is where the rollout had really gotten out of hand. A RoleBinding named migration-runner-binding pointed at migration-runner-role-v1, which had read-only verbs. A migration-runner-role-v2 existed alongside it, unbound, with create:jobs and patch:configmaps. A PersistentVolumeClaim named bleat-migration-pvc was Pending with an event saying storageclass.storage.k8s.io "fast-ssd-tier" not found, on a k3s cluster where the only storage class was local-path. A ResourceQuota set to pods: 1. A NetworkPolicy with egress: [] denying everything outbound including DNS.

Each one of those, taken alone, was a small fix. Taken together, they gated each other. The migration could not run because the RBAC was wrong. The repair Pods could not schedule because the quota was at one. The init container could not reach Postgres because the NetworkPolicy denied egress. The schema could not advance because the script swallowed errors. The reconciler refused to mark anything healthy until all of them resolved. The hardening rollout had tightened every knob at once and the knobs were not independent.

flowchart TD
  Q[ResourceQuota pods=1] -->|blocks scheduling of| R[Repair workloads]
  RBAC[RoleBinding to v1] -->|blocks| MIG[Migration Job creating resources]
  NP[NetworkPolicy egress empty] -->|blocks| DB[bleat-service to Postgres]
  PVC[PVC Pending fast-ssd-tier] -->|blocks| INIT[Init container mount]
  SCRIPT[migrate.sh exit 0] -->|swallows errors of| SCHEMA[schema v3]
  LOCK[hook-reconciliation-lock locked] -->|blocks| RECON[Reconciler advancing]
  ORPHAN[Orphan PreSync Job + finalizer] -->|blocks| RECON
  R --> MIG
  MIG --> SCHEMA
  INIT --> DB
  SCRIPT --> SCHEMA
  SCHEMA --> RECON

The dependency graph we drew on the bridge call. Cascade order falls out of the arrows.

Why we raised the quota before anything else

The order of repair when faults gate each other

The instinct on a multi-fault incident is to start with the most visible symptom. The CrashLoopBackOff is loud. The lock is loud. The orphan Job is loud. None of those were the right first move. The right first move was the boring one: raise the ResourceQuota, because every other fix needed to schedule a Pod, and pods: 1 meant only one Pod at a time could exist in the namespace beyond what was already running. We raised it to pods: 8, cpu: 2, memory: 1Gi. Production limits. We did not delete the quota.

Then the RBAC. We described both Roles and confirmed v2 had the verbs the migration Job needed. Patching the existing RoleBinding to swing roleRef to v2 returned the error we expected.

$ kubectl patch rolebinding migration-runner-binding -n bleater \
    --type='json' -p='[{"op":"replace","path":"/roleRef/name","value":"migration-runner-role-v2"}]'
The RoleBinding "migration-runner-binding" is invalid: roleRef: Invalid value: rbac.RoleRef{...}: cannot change roleRef

$ kubectl get rolebinding migration-runner-binding -n bleater -o yaml > /tmp/rb.yaml
# edit /tmp/rb.yaml, set roleRef.name to migration-runner-role-v2
$ kubectl delete rolebinding migration-runner-binding -n bleater
$ kubectl apply -f /tmp/rb.yaml

roleRef is immutable. The only path is delete-and-recreate, with the existing object as a template so you do not lose subjects.

PVC next. We listed storage classes, saw local-path was the only one, exported the existing PVC, changed storageClassName, deleted, reapplied. The PVC sat Pending for a few more seconds until we scheduled a consumer Pod against it, because local-path on k3s binds on first consumer. Then the NetworkPolicy. We did not delete it. The deny-by-default posture was the right posture; the rollout had just forgotten to allow anything. We added three explicit egress rules: same-namespace for the Postgres reach, kube-system on UDP 53 for DNS, and the metrics endpoints the reconciler scraped. The deny-all stayed in place for everything else.

Then the lock and the orphan. The lock was a one-line patch to set status: unlocked and to replace lock-reason with resolved-2024-hardening-rollback. We left an audit value rather than blanking the field. The orphan Job hung on delete because of the finalizer. The strip-then-delete sequence is muscle memory at this point but it is worth showing because plenty of teams reach for --force first, which is the wrong tool.

# strip the finalizer first, then delete cleanly
kubectl patch job auth-presync-migrate-legacy7r2x -n bleater \
  --type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]'
kubectl delete job auth-presync-migrate-legacy7r2x -n bleater

# do NOT touch fanout-presync-validate. it has hook-delete-policy set,
# carries bleater.io/owner=platform-team, and the reconciler manages it.
kubectl get job fanout-presync-validate -n bleater \
  -o jsonpath='{.metadata.annotations.argocd\.argoproj\.io/hook-delete-policy}'
# => HookSucceeded

Strip then delete. The decoy Job looks identical to the orphan from a distance; the discriminator is the hook-delete-policy annotation and the ownership label.

We have written more on cleaning up GitOps-style state safely in our Kubernetes and CI/CD stabilization playbook, including the finalizer-strip pattern and how to tell a managed Job from an orphaned one without guessing.

The fixes that had to be repairs, not deletes

Don't weaken governance to silence alarms

Halfway through the recovery the client's platform lead asked the obvious question. Why not just delete the ResourceQuota and the NetworkPolicy until things stabilize, then put them back? It would have shaved twenty minutes. We said no, and the reason is worth writing down, because it is the part of incident work that teams under pressure get wrong most often.

Governance controls exist for a reason. Someone put pods: 1 on that ResourceQuota originally because something had blown up the namespace before. Someone put the deny-all egress on because the auth service should not be able to call random external endpoints. The rollout had mangled the values, not the intent. Deleting the controls would have restored the workloads and silenced the alarms. It would have also removed two of the few real defenses that namespace had, with no scheduled work item to put them back. We have watched teams do this in March and find the controls still missing in November. The graveyard of post-incident TODOs is full of governance restore tickets that never got worked.

So we repaired. The quota went up to production limits in place. The NetworkPolicy got explicit allow rules added while the default-deny stayed. The PVC got a real storage class while the claim itself stayed at the same name and the same size. The orphan Job got deleted, because a stale suspended PreSync Job genuinely is garbage, but the cascade infrastructure stayed. Same controls, working values.

The migration script was the other repair-not-delete case. The version we found had this pattern:

# what we found
#!/bin/bash
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -f /migrations/v3.sql || exit 0
echo "migration complete"

# what we replaced it with
#!/bin/bash
set -euo pipefail
psql -h $DB_HOST -U $DB_USER -d $DB_NAME \
  -v ON_ERROR_STOP=1 \
  -f /migrations/v3.sql
echo "migration complete"

|| exit 0 is the single worst three characters in any migration script. set -e and ON_ERROR_STOP=1 together mean a failing SQL statement actually fails the Job, which is what the reconciler was waiting to see.

After the script was patched, the migration Job ran successfully under the new RoleBinding, applied the v3 schema, and we read the tables back out of Postgres directly rather than trusting the script's exit code. The bleat-db-schema ConfigMap got tables-v3 written from observed pg_tables output. Not from the migration's stated intent. From the live database. If you ever find yourself writing schema declarations from anything other than what is actually in the database, you are setting up the next incident.

If your control plane is gaslighting your operators

When in-house reconcilers and hardening rollouts collide

The hard part of this kind of incident is not any single fault. The hard part is that an internal control plane is opinionated about state in ways that are not documented anywhere except in the reconciler's source code. When five apps are red and the dashboard says nothing changed, your team can spend an hour patching outputs that get reverted before they understand the inputs. Hardening rollouts make this worse, because they touch ResourceQuotas and NetworkPolicies and RBAC in the same change window, and the rollback path almost never accounts for the case where the controls themselves were the right idea but the values were wrong.

We run these recovery engagements every week. The in-house reconciler pattern shows up at almost every SaaS company past Series A that decided not to run ArgoCD or Flux directly. The shape of the failure is always the same: a small Python or Go service that watches a namespace and writes a status object, an operations team that does not own the reconciler code, and a control plane that fights every cosmetic fix because that is what it was built to do. We have seen the RoleBinding immutability case four times this quarter alone. The NetworkPolicy egress-without-DNS case shows up after every security audit cycle.

If you are watching a namespace where the status object keeps reverting your changes, or where a hardening pass cascaded across half a dozen layers and your team is debating whether to delete the controls to get back to green, book an infrastructure review with our team and we will be on a bridge call with you the same day. We will read your reconciler, draw the dependency graph for the cascade, and walk the repair order with your on-call. The goal is not to get the dashboard green by morning. The goal is to get it green without leaving a graveyard of governance restore tickets behind it.

Related

Use these related pages to continue recovery