Insights

Failed Terraform apply incident response checklist

A failed Terraform apply can become a multi-hour outage if response is ad hoc. This checklist gives teams a repeatable incident sequence to contain impact and recover safely.

IaC incident response | 10 min read

When to trigger

`terraform apply` aborts after partial resource changes.
Provider/API errors leave state and runtime misaligned.
Unexpected dependency failures occur mid-apply.
Rollback path is unclear during release pressure.

Response principle

Contain first, reconcile second, optimize later

The primary objective after a failed apply is to contain blast radius. Do not immediately retry broad applies. First establish what changed, what failed, and what customer-facing risk exists right now.

Incident response succeeds when ownership is explicit and communication is fast. Assign one incident lead and one Terraform operator. Avoid multi-operator improvisation.

Checklist

Failed apply incident response sequence

Use this checklist in order during high-pressure incidents.

1. Pause further applies

Freeze non-critical Terraform changes until state and runtime are understood.

2. Capture evidence

Save apply logs, state snapshot, and provider/API errors immediately.

3. Assess runtime impact

Identify affected services, user-facing symptoms, and critical dependencies.

4. Contain blast radius

Apply targeted mitigations for impacted resources only, with owner approval.

5. Reconcile state intentionally

Import/move/remove only after confirming runtime truth and desired state.

6. Run scoped validation

Use targeted plan checks before re-enabling normal apply flow.

Artifact

Incident communication template

Incident: Failed Terraform apply
Status: Containment in progress / Reconciliation in progress / Resolved
Impact: services/users affected
Scope: resource groups and environments involved
Next update: timestamp + owner

Communication lag often causes more business damage than the original apply failure. Timebox updates and assign one owner for external status.

After-action

Post-incident controls to prevent recurrence

Document root cause and missing pre-apply checks.
Add policy or pipeline guardrails for the failed pattern.
Update rollback runbook with concrete decision points.
Schedule a short reconciliation review within 72 hours.

Use these related pages to continue incident recovery

Repeated failed applies usually indicate state, module, and process debt at the same time.

I'm in trouble now Get checklist PDF Show me examples