GitOps drift triage checklist for production teams
GitOps drift becomes expensive when teams cannot tell whether Git, rendered manifests, or cluster state is authoritative anymore. This checklist helps production teams triage drift fast before sync problems turn into a release backlog or a rollback incident.
- ArgoCD shows recurring OutOfSync state after teams apply emergency cluster fixes.
- Rendered manifests differ between CI, local tooling, and the GitOps controller.
- One application recovers while related services remain stuck on older assumptions.
- Every drift incident becomes an argument about which source of truth should win.
Drift triage fails when ownership is vague
Most GitOps drift incidents are not caused by GitOps itself. They happen because runtime changes, manifest generation, and release approval paths evolved separately. Once those paths drift apart, the controller is only surfacing the disagreement.
Triage needs to answer three questions in order: what changed, which source of truth should win, and what can be reconciled safely without creating fresh release risk.
Five checks that restore signal fast
Use this checklist in order on the highest-impact application first.
1. Freeze low-priority sync churn
Pause non-critical promotions so new drift does not enter while you inspect the failing app set.
2. Capture the last healthy revision
Record the last known healthy commit, sync result, and app health state before changing anything.
3. Compare rendered outputs
Diff CI-rendered manifests, Git-stored definitions, and ArgoCD-rendered output for the same target revision.
4. Trace cluster-side mutations
List hotfixes, kubectl patches, admission mutations, or controller side effects that bypassed Git.
5. Reconcile one bounded slice
Fix one application or dependency chain completely before widening scope to the whole environment.
GitOps drift ownership matrix
Drift source Recovery owner Manifest values mismatch Platform or release owner Controller/plugin render mismatch Platform engineering Manual cluster patch On-call with reconciliation SLA Unexpected runtime mutation Platform + workload owner
Drift gets easier to close when each class of mismatch has a named owner. Without that, teams keep fixing symptoms while the same drift source reappears on the next release.
What makes GitOps drift worse
- Resyncing repeatedly before comparing rendered output and cluster mutations.
- Allowing emergency cluster patches without a reconciliation deadline back into Git.
- Treating every OutOfSync result as equivalent instead of classifying the drift source.
- Trying to restore all applications at once instead of one bounded recovery slice.
Use these pages to continue GitOps recovery
If drift is already blocking release confidence, move from triage into a focused review.