Insights

GitOps drift triage checklist for production teams

GitOps drift becomes expensive when teams cannot tell whether Git, rendered manifests, or cluster state is authoritative anymore. This checklist helps production teams triage drift fast before sync problems turn into a release backlog or a rollback incident.

GitOps recovery | 9 min read
Problem signals
  • ArgoCD shows recurring OutOfSync state after teams apply emergency cluster fixes.
  • Rendered manifests differ between CI, local tooling, and the GitOps controller.
  • One application recovers while related services remain stuck on older assumptions.
  • Every drift incident becomes an argument about which source of truth should win.
Why teams get stuck

Drift triage fails when ownership is vague

Most GitOps drift incidents are not caused by GitOps itself. They happen because runtime changes, manifest generation, and release approval paths evolved separately. Once those paths drift apart, the controller is only surfacing the disagreement.

Triage needs to answer three questions in order: what changed, which source of truth should win, and what can be reconciled safely without creating fresh release risk.

Triage sequence

Five checks that restore signal fast

Use this checklist in order on the highest-impact application first.

1. Freeze low-priority sync churn

Pause non-critical promotions so new drift does not enter while you inspect the failing app set.

2. Capture the last healthy revision

Record the last known healthy commit, sync result, and app health state before changing anything.

3. Compare rendered outputs

Diff CI-rendered manifests, Git-stored definitions, and ArgoCD-rendered output for the same target revision.

4. Trace cluster-side mutations

List hotfixes, kubectl patches, admission mutations, or controller side effects that bypassed Git.

5. Reconcile one bounded slice

Fix one application or dependency chain completely before widening scope to the whole environment.

Artifact

GitOps drift ownership matrix

Drift source                      Recovery owner
Manifest values mismatch          Platform or release owner
Controller/plugin render mismatch Platform engineering
Manual cluster patch              On-call with reconciliation SLA
Unexpected runtime mutation       Platform + workload owner

Drift gets easier to close when each class of mismatch has a named owner. Without that, teams keep fixing symptoms while the same drift source reappears on the next release.

Common mistakes

What makes GitOps drift worse

  • Resyncing repeatedly before comparing rendered output and cluster mutations.
  • Allowing emergency cluster patches without a reconciliation deadline back into Git.
  • Treating every OutOfSync result as equivalent instead of classifying the drift source.
  • Trying to restore all applications at once instead of one bounded recovery slice.