Insights

Failed Terraform apply incident response checklist

A failed Terraform apply can become a multi-hour outage if response is ad hoc. This checklist gives teams a repeatable incident sequence to contain impact and recover safely.

IaC incident response | 10 min read
When to trigger
  • `terraform apply` aborts after partial resource changes.
  • Provider/API errors leave state and runtime misaligned.
  • Unexpected dependency failures occur mid-apply.
  • Rollback path is unclear during release pressure.
Response principle

Contain first, reconcile second, optimize later

The primary objective after a failed apply is to contain blast radius. Do not immediately retry broad applies. First establish what changed, what failed, and what customer-facing risk exists right now.

Incident response succeeds when ownership is explicit and communication is fast. Assign one incident lead and one Terraform operator. Avoid multi-operator improvisation.

Checklist

Failed apply incident response sequence

Use this checklist in order during high-pressure incidents.

1. Pause further applies

Freeze non-critical Terraform changes until state and runtime are understood.

2. Capture evidence

Save apply logs, state snapshot, and provider/API errors immediately.

3. Assess runtime impact

Identify affected services, user-facing symptoms, and critical dependencies.

4. Contain blast radius

Apply targeted mitigations for impacted resources only, with owner approval.

5. Reconcile state intentionally

Import/move/remove only after confirming runtime truth and desired state.

6. Run scoped validation

Use targeted plan checks before re-enabling normal apply flow.

Artifact

Incident communication template

Incident: Failed Terraform apply
Status: Containment in progress / Reconciliation in progress / Resolved
Impact: services/users affected
Scope: resource groups and environments involved
Next update: timestamp + owner

Communication lag often causes more business damage than the original apply failure. Timebox updates and assign one owner for external status.

After-action

Post-incident controls to prevent recurrence

  • Document root cause and missing pre-apply checks.
  • Add policy or pipeline guardrails for the failed pattern.
  • Update rollback runbook with concrete decision points.
  • Schedule a short reconciliation review within 72 hours.
Related

Use these related pages to continue incident recovery

Repeated failed applies usually indicate state, module, and process debt at the same time.