How we recovered tfstate after force-unlock raced a CI apply
The engineer pinged us at 4:48 pm on a Thursday. They had been trying to push a small IAM change to staging, terraform apply had failed with Error acquiring the state lock, and they did what most of us have done at least once: they ran terraform force-unlock with the ID from the error message and re-ran apply. The apply went through. Ten minutes later a teammate on a different branch ran terraform plan and the plan output wanted to destroy and recreate 38 resources that were sitting healthy in AWS, returning 200s, serving traffic. By the time we joined the bridge, the original engineer was halfway convinced they needed to let Terraform rebuild the whole staging environment. They did not. The cloud was fine. The state file was the thing that was broken.
- terraform plan shows -/+ destroy and recreate for resources nobody touched and that are healthy in the cloud
- Teammates see Error: state snapshot was created by Terraform v1.5.7, which is newer than current v1.5.4
- S3 bucket versioning shows two or three tfstate writes inside a 60 to 90 second window
- The DynamoDB lock table is empty but the state file timestamps do not line up with anyone's apply log
- Someone on the team ran terraform force-unlock in the last hour
A stale lock from a dead CI job
The first wrong model was reasonable. The engineer saw Error acquiring the state lock, looked at the lock ID, did not recognize it, and assumed it was a leftover from a CI job that had crashed earlier in the week. They had seen stale locks before. The fix last time was force-unlock. So they ran it again.
What they did not check was whether the lock holder was actually still alive. The CI job that held the lock was a scheduled terraform plan cycle running on a 15-minute cadence, and that particular run was on the slow side because the workspace had grown to about 600 resources. It was not stuck. It was just working. The force-unlock removed the lock entry from DynamoDB while the CI process was still very much holding an in-memory version of the state file, mid-refresh. Two writers, no coordination.
When the engineer's apply finished, it wrote its version of the state to S3. About forty seconds later, the CI run finished its refresh and wrote its version of the state to S3 on top of that. Two non-linear writes, each thinking it had the latest state, each clobbering parts of the other. S3 versioning preserved both, but the live state pointer was pointing at a Frankenstein.
Three S3 versions in 90 seconds, and a plan that wanted to destroy healthy infrastructure
We pulled the S3 object versions for the state file first. That is the single most useful command in a Terraform state incident, and most teams do not run it until someone external suggests it.
aws s3api list-object-versions \ --bucket acme-tfstate-staging \ --prefix env/staging/terraform.tfstate \ --query 'Versions[?LastModified>=`2024-01-18T16:45:00Z`].[VersionId,LastModified,Size]' \ --output table # Output (abridged): # VersionId LastModified Size # 9f3aV2.JqL... 2024-01-18T16:51:12Z 412847 # 8h2nB1.KpM... 2024-01-18T16:50:31Z 408992 # 7g1mA0.LoN... 2024-01-18T16:49:48Z 411203 # 6f0lZ9.MnO... 2024-01-18T16:42:15Z 411198 <-- last known good
Three writes inside 84 seconds. The 16:42 version was the last clean write before the collision.
Three writes in 84 seconds was the smoking gun. A healthy workspace writes state once per apply, and the next write is usually hours away. Three writes that close together meant at least two processes had been racing. We cross-checked against the CI logs and the engineer's shell history and confirmed: the CI plan cycle had been refreshing state from 16:49:48 onwards, the engineer's force-unlock landed at 16:50:18, the engineer's apply wrote state at 16:50:31, and the CI refresh wrote its stale view back at 16:51:12. The 16:51 write was the one Terraform was now reading, and it had been built from a refresh that started before half the engineer's changes existed.
That explained the plan output. The state Terraform was reading said the resources had attributes that did not match reality. Plan diffed state against the cloud, saw the mismatch, and proposed the only thing it knows how to propose: destroy and recreate. The cloud was correct. The state was lying. If we had let the apply run, we would have taken a healthy staging environment offline for somewhere between 40 minutes and two hours to rebuild things that did not need rebuilding.
Restore the pre-collision state version, then import only what actually drifted
The recovery had two parts and an order that mattered. First, replace the corrupted live state with the last clean S3 version. Second, figure out which resources genuinely changed during the collision window and re-import only those. Skipping the second step is how teams end up with the same incident a week later, because real changes from the engineer's apply have been silently rolled back.
Before touching anything we pulled a local backup of the current (broken) state. If our restore went wrong, we wanted a way back.
# 1. Backup the current broken state to local disk aws s3api get-object \ --bucket acme-tfstate-staging \ --key env/staging/terraform.tfstate \ ./tfstate.broken.$(date +%s).json # 2. Restore the last known good version in place aws s3api copy-object \ --bucket acme-tfstate-staging \ --key env/staging/terraform.tfstate \ --copy-source 'acme-tfstate-staging/env/staging/terraform.tfstate?versionId=6f0lZ9.MnO...' \ --metadata-directive REPLACE # 3. Confirm the active version is now the restored one aws s3api head-object \ --bucket acme-tfstate-staging \ --key env/staging/terraform.tfstate \ --query 'VersionId'
The copy-object call writes the old version as a new current version. Do not delete versions; you want the audit trail intact.
With the state restored, we ran terraform plan. The output was much shorter, around six resources, and they were the ones the engineer had actually changed in their apply. That was the divergence window: changes that had been made for real in AWS but that the restored state did not know about. Each of those needed a terraform import to reattach the live resource to the state. We did them one at a time, ran plan between each, and watched the diff shrink.
# Example: the engineer had created a new IAM role during their apply. # The restored state predates it, but the role exists in AWS. terraform import \ module.platform.aws_iam_role.svc_runner \ acme-staging-svc-runner # After each import, re-run plan and confirm the resource is no longer in the diff. terraform plan -out=/tmp/plan.out # Repeat for each resource genuinely changed during the divergence window: # - 1 IAM role # - 1 IAM role policy attachment # - 2 security group rules # - 1 SSM parameter # - 1 Lambda permission
Import surgically. Do not bulk-import; you want a clean plan after each step so you can spot collateral damage.
After the sixth import, terraform plan returned No changes. That was the success signal. The state matched the cloud, the engineer's intended changes were preserved, and nothing healthy had been destroyed. Total time on the bridge from first page to clean plan was 2 hours 40 minutes. About 45 minutes of that was the investigation; the rest was careful, slow imports with verification between each one.
flowchart TD
A[terraform plan shows mass destroy/recreate] --> B{Are the resources actually broken in cloud?}
B -- No, healthy --> C[State file is the problem, not cloud]
B -- Yes, broken --> Z[Different incident; investigate cloud-side]
C --> D[list-object-versions on tfstate]
D --> E{Multiple writes in short window?}
E -- Yes --> F[Identify last clean version pre-collision]
E -- No --> Y[Investigate other corruption causes]
F --> G[Backup current broken state locally]
G --> H[copy-object to restore clean version]
H --> I[terraform plan: short diff = divergence window]
I --> J[terraform import each drifted resource]
J --> K{Plan empty?}
K -- No --> J
K -- Yes --> L[Recovery complete; write postmortem]Decision flow we use for any state-collision incident. The first branch matters most: confirm the cloud is healthy before touching state.
Two tempting shortcuts that would have made it worse
Two shortcuts came up on the bridge that we ruled out. They are worth naming because both of them sound reasonable when you are tired.
1. Let terraform apply rebuild everything
The plan was already there. Just type yes. This would have caused 30 to 90 minutes of staging downtime for resources that did not need rebuilding, broken any data-layer resources with state of their own, and lost the audit trail of what had actually changed.
2. terraform refresh to fix the state
Refresh updates state from the live infrastructure for known resources. It does not learn about resources the state has forgotten, and it cannot undo a structurally corrupted state. Refresh on a Frankenstein state can deepen the damage by writing the merged view back as the new truth.
We have written about the broader pattern in the Terraform state recovery playbook, specifically the rule we now apply on every state incident: the state file is the suspect until proven otherwise. Cloud is healthy until you have evidence it is not. That ordering keeps you from running destructive applies under time pressure.
A pre-apply lock check that prints the holder's age
The team made two changes the week after the incident. Both are small. Both have already paid for themselves.
The first change is a pre-apply wrapper script that reads the DynamoDB lock table before terraform apply runs. If a lock exists, the script prints the lock holder, when the lock was acquired, and how long ago that was. If the lock is younger than the workspace's typical apply duration plus a safety margin, the script refuses to run and tells the engineer to wait. If the lock is genuinely old (older than any plausible live process), the script still does not force-unlock automatically; it prints the exact force-unlock command and makes the engineer paste it. The friction is the point.
#!/usr/bin/env bash
# pre-apply-lock-check.sh
set -euo pipefail
WORKSPACE="${1:?workspace name required}"
LOCK_TABLE="acme-tfstate-locks"
MAX_PLAUSIBLE_APPLY_SECONDS=1800 # 30 minutes
LOCK_ITEM=$(aws dynamodb get-item \
--table-name "$LOCK_TABLE" \
--key "{\"LockID\":{\"S\":\"acme-tfstate-staging/env/${WORKSPACE}/terraform.tfstate-md5\"}}" \
--output json 2>/dev/null || echo '{}')
if [[ "$(echo "$LOCK_ITEM" | jq -r '.Item // empty')" == "" ]]; then
echo "No lock. Safe to proceed."
exit 0
fi
HOLDER=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Who + " @ " + .Operation')
CREATED=$(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.Created')
AGE=$(( $(date +%s) - $(date -d "$CREATED" +%s) ))
echo "Lock present."
echo " Holder: $HOLDER"
echo " Created: $CREATED"
echo " Age: ${AGE}s"
if (( AGE < MAX_PLAUSIBLE_APPLY_SECONDS )); then
echo
echo "REFUSING TO PROCEED. Lock is younger than max plausible apply duration."
echo "Wait for the current holder to finish, or confirm out-of-band that it is dead."
exit 1
fi
echo
echo "Lock is older than ${MAX_PLAUSIBLE_APPLY_SECONDS}s. It may be stale."
echo "To force-unlock, run manually (do NOT automate this):"
echo " terraform force-unlock $(echo "$LOCK_ITEM" | jq -r '.Item.Info.S' | jq -r '.ID')"
exit 2We run this from CI and from a pre-apply git hook on engineer laptops. Same script, same rules, both places.
The second change is operational. The team's runbook now says: if you ever run force-unlock, page the on-call channel immediately with the lock ID and the reason. That single message would have caught this incident before it became one. The CI job would have replied within seconds that it was still running, and the engineer would have known to wait the eight minutes instead of clobbering the state.
We have stopped recommending that teams treat force-unlock as a routine command. It is a recovery command. It belongs in the same mental category as DROP TABLE: technically available, occasionally necessary, never the first thing you reach for. The TTL on the lock is generous on purpose. Wait it out, or confirm the holder is dead. Those are the only two paths.
When the state file is the suspect and the clock is running
The hard part of state-collision incidents is not the recovery commands. The commands are mechanical once you know the shape of the problem. The hard part is the 20 minutes before that, when an apply plan is sitting in your terminal showing 30+ destroys, someone senior is asking on Slack whether you can just run it, and you have to decide whether the cloud is broken or the state is. Get that wrong under pressure and you cause the outage you were trying to prevent.
We run these recovery engagements every week. The force-unlock-collision pattern has shown up four times this quarter alone, in three different shapes: a CI plan racing an engineer apply (this one), two engineers applying simultaneously after a Slack misunderstanding, and a long-running import operation that an engineer killed because they thought it had hung. The recovery shape is the same. The diagnostic discipline of confirming the cloud is healthy before touching state is the same. The thing that changes is which version of state is the right one to restore to, and that takes practice to spot quickly.
If you are staring at a terraform plan that wants to destroy resources you know are healthy, do not run apply. Book an infrastructure review with our team and we will be on a bridge with you the same day to work through the state restore and the surgical imports. We have done this enough times that we can usually have you back to an empty plan inside three hours.