Failed Terraform apply incident response checklist
A failed Terraform apply can become a multi-hour outage if response is ad hoc. This checklist gives teams a repeatable incident sequence to contain impact and recover safely.
- `terraform apply` aborts after partial resource changes.
- Provider/API errors leave state and runtime misaligned.
- Unexpected dependency failures occur mid-apply.
- Rollback path is unclear during release pressure.
Contain first, reconcile second, optimize later
The primary objective after a failed apply is to contain blast radius. Do not immediately retry broad applies. First establish what changed, what failed, and what customer-facing risk exists right now.
Incident response succeeds when ownership is explicit and communication is fast. Assign one incident lead and one Terraform operator. Avoid multi-operator improvisation.
Failed apply incident response sequence
Use this checklist in order during high-pressure incidents.
1. Pause further applies
Freeze non-critical Terraform changes until state and runtime are understood.
2. Capture evidence
Save apply logs, state snapshot, and provider/API errors immediately.
3. Assess runtime impact
Identify affected services, user-facing symptoms, and critical dependencies.
4. Contain blast radius
Apply targeted mitigations for impacted resources only, with owner approval.
5. Reconcile state intentionally
Import/move/remove only after confirming runtime truth and desired state.
6. Run scoped validation
Use targeted plan checks before re-enabling normal apply flow.
Incident communication template
Incident: Failed Terraform apply Status: Containment in progress / Reconciliation in progress / Resolved Impact: services/users affected Scope: resource groups and environments involved Next update: timestamp + owner
Communication lag often causes more business damage than the original apply failure. Timebox updates and assign one owner for external status.
Post-incident controls to prevent recurrence
- Document root cause and missing pre-apply checks.
- Add policy or pipeline guardrails for the failed pattern.
- Update rollback runbook with concrete decision points.
- Schedule a short reconciliation review within 72 hours.
Use these related pages to continue incident recovery
Repeated failed applies usually indicate state, module, and process debt at the same time.