Insights

Why a Terraform apply hangs 90 minutes on a custom provider with no timeout

Two hundred destroys that needed 40 seconds of real work hung for 90 minutes. The platform team kicked off a terraform apply to remove stale config entries from an internal service, watched the progress bar stop at minute 12, and then stared at a frozen terminal until someone finally ran kill -9. By that point the state file was half-updated, the DynamoDB lock was still held, and nobody was sure which of the 200 entries had actually been deleted. The custom Terraform provider doing the destroys had a synchronous HTTP call with no context timeout, and the backend behind it was rate-limiting at 5 RPS. Neither side was wrong on its own. The contract between them was broken.

Terraform state recovery | 11 min read
Problem signal
  • terraform apply prints no output for 20+ minutes after destroys begin, no progress, no errors
  • The backend service is healthy on its dashboard but throttling requests at a low RPS limit
  • kill -9 on the terraform process leaves the DynamoDB state lock held forever
  • After force-unlock, terraform state list shows resources that no longer exist in the cloud
  • The custom provider in use was written internally and has no timeouts {} block support documented
Forty seconds of work, ninety minutes of silence

What the team thought was happening, and what was actually happening

The first assumption was that the internal config service was hung. It was not. Its dashboard showed it healthy and serving requests, just slowly. The second assumption was that terraform was making progress and just not printing anything. That one was half true. Terraform was making progress, at exactly 5 deletes per second, which is the rate limit the backend was enforcing. With 200 entries that is 40 seconds of real work. The team waited 90 minutes.

The reason for the gap was a custom Terraform provider written by a previous platform team. Its DeleteResource function looked roughly like the snippet below. No context. No timeout. No retry-with-backoff. No progress emission back to Terraform's UI layer. When the backend returned a 429, the provider's HTTP client did its own internal retry, swallowed the error, and tried again. Forever. Because the provider never returned from Delete, Terraform's supervisor saw a working call and waited.

func resourceConfigEntryDelete(d *schema.ResourceData, meta interface{}) error {
    client := meta.(*ConfigClient)
    id := d.Id()

    // No context. No timeout. No bound on retries.
    for {
        err := client.DeleteEntry(id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            time.Sleep(1 * time.Second)
            continue
        }
        return err
    }
}

The shape of the broken Delete function (reconstructed from the provider source)

What this should have been is below. The schema.ResourceTimeout block lets users set a timeouts {} block on the resource. The context carries that deadline. When the deadline expires, the provider returns an error and Terraform marks the resource as tainted, not as silently in-progress for the rest of human history.

func resourceConfigEntryDelete(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
    client := meta.(*ConfigClient)
    id := d.Id()

    return retry.RetryContext(ctx, d.Timeout(schema.TimeoutDelete), func() *retry.RetryError {
        err := client.DeleteEntryWithContext(ctx, id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            return retry.RetryableError(err)
        }
        return retry.NonRetryableError(err)
    })
}

What the Delete function should look like

Why kill -9 left us worse off

The half-updated state and the stuck DynamoDB lock

When the engineer finally ran kill -9 on the terraform process, two things happened that compounded the problem. First, the DynamoDB lock entry stayed exactly where it was. Terraform releases its lock on graceful shutdown, not on SIGKILL. So the next person who ran terraform plan got the familiar error and assumed someone else was still working on it. They were not. The lock was a ghost.

Second, because the destroys had been happening serially at 5 RPS for the 12 minutes before the hang became obvious (the team realized later they had actually waited longer than they thought before noticing the silence), roughly 60 of the 200 entries had actually been deleted from the backend. Terraform had updated the state file in memory as each delete returned, but it had not yet flushed state to the remote backend, because in the default terraform workflow state is written at the end of the apply, not after each resource. So all 60 of those successful deletes were lost from the state file. The cloud was missing 60 entries that tfstate still claimed existed.

Before doing anything else we confirmed the terraform process was actually dead on the operator's machine. ps aux | grep terraform, on the actual machine, not a tmux pane from yesterday. We have force-unlocked locks that turned out to belong to a process still doing useful work, and the damage is worse than a stuck lock. Once confirmed dead, terraform force-unlock with the lock ID from the error message released DynamoDB.

# 1. Confirm no terraform process is running on the operator's machine
ssh operator-host 'ps aux | grep -v grep | grep terraform'

# 2. Release the lock (lock ID comes from the error message)
terraform force-unlock 7c4a3e22-1b9d-4e8a-b6d7-9f2a8c5e4d11

# 3. See what state thinks vs what the cloud actually has
terraform plan -refresh-only

# 4. Apply the refresh so state matches reality
terraform apply -refresh-only

The recovery sequence after confirming the process is dead

Reconciling state against a half-finished destroy

Scripting state rm and import for 200 entries

After the refresh-only apply, state and cloud agreed on what existed. But the original goal, deleting all 200 entries, was still only partially done. We now had two populations to handle: entries that still existed both in tfstate and in cloud (the destroy had not gotten to them), and entries that had been removed from cloud during the hung apply but were no longer in tfstate either (the refresh had cleaned them up). The first group we could destroy normally. The second group needed nothing further.

Where it got annoying was a third population we discovered later: a handful of entries that had been deleted from cloud by the hung apply, but where the refresh had failed to notice because the provider's Read function had the same no-timeout bug and was returning stale cached data. Those entries were ghosts in tfstate. For each one we had to run terraform state rm by address. With 47 of them, we scripted it from a diff.

# Pull current tfstate resource list
terraform state list | grep config_entry > tfstate_entries.txt

# Pull live entries from the backend (after rate-limit-aware fetch)
curl -s --rate-limit 5 "$CONFIG_API/entries" | jq -r '.[].id' > live_entries.txt

# Entries in tfstate but not in cloud: these are ghosts
comm -23 <(sort tfstate_entries.txt) <(sort live_entries.txt | sed 's|^|module.config.config_entry.|') > ghosts.txt

# Remove them from state
while read addr; do
  terraform state rm "$addr"
done < ghosts.txt

Generating the state rm commands from a diff between tfstate and the live backend

For the inverse case (entry exists in cloud but not in tfstate), the recovery is terraform import. We did not hit this on this incident but we have hit it on similar ones, and the same diff approach works in the other direction. The general pattern for any half-finished Terraform operation against a custom provider is laid out in our Terraform state recovery playbook.

What the provider should have done

The contract every custom Terraform provider has to honor

A custom Terraform provider is a contract. Terraform's whole supervision model assumes the provider plays by it. The contract is short: Create, Read, Update, and Delete each accept a context, each respect the user's timeouts {} block, each emit clear errors when something goes wrong, and each return in bounded time. When a provider violates the contract, Terraform's user-facing behavior degrades in ways that look like Terraform bugs but are not.

Internal providers skip the contract more often than vendor ones, because the team that writes the provider also runs the backend it talks to, and they convince themselves they have full visibility. They do not. terraform-cli is a separate process. It cannot see your retry loop. It cannot see your in-flight HTTP call. All it sees is a function that has not yet returned. The fix for this provider was three changes:

1. Accept context on every CRUD function

Migrate from the legacy schema.CreateFunc signatures to the context-aware schema.CreateContextFunc variants. This is a non-optional change on terraform-plugin-sdk v2.

2. Declare and honor timeouts on every resource

Add a Timeouts: &schema.ResourceTimeout{Create: schema.DefaultTimeout(5 * time.Minute), Delete: schema.DefaultTimeout(5 * time.Minute)} block on every resource schema. Use d.Timeout(schema.TimeoutDelete) inside the function.

3. Replace internal retry loops with retry.RetryContext

The retry helper respects the context deadline and surfaces retryable vs non-retryable errors cleanly. Hand-rolled for-loops over time.Sleep do not.

4. Pin the fixed version via .terraform.lock.hcl

Release a new patch version of the provider, update the lockfile, and remove the old version from your internal registry so nobody can fall back to it.

The apply pattern itself also needed a change. Destroying 200 entries in one shot against a 5 RPS backend is asking for trouble even with a correct provider, because a 5-minute timeout per resource is generous when one resource genuinely takes 200ms but useless when the queue ahead of you is 199 other deletes. We split future bulk operations into batches of 10 using -target, or we push the backend team to expose a bulk delete endpoint. The provider then wraps the bulk endpoint as a single resource operation instead of looping.

flowchart TD
  A[terraform apply] -->|calls Delete with context| B[Custom provider]
  B -->|HTTP DELETE with deadline| C[Internal config service]
  C -->|429 rate limit| B
  B -->|RetryableError until ctx deadline| A
  A -->|on deadline: mark tainted, log error, release lock| D[Clean failure state]
  style D fill:#1f3a1f,color:#fff

The relationship that broke and what fixes each side

If you are looking at a hung apply right now

When a custom provider has left your state in an unknown shape

Hung Terraform applies against internal providers are the kind of incident that sounds boring in a postmortem and feels terrifying in the moment. You cannot tell if the apply is still doing useful work or stuck forever. You cannot kill it without risking a half-finished state. You cannot force-unlock until you are certain the process is dead. And once you do recover, you do not actually know which resources got modified and which did not, because the provider did not emit progress and the state file was not flushed.

We run these recovery engagements often enough that the script above is templated. The no-timeout custom provider pattern shows up in maybe one in five of the Terraform recoveries we have done this year, almost always with internal providers written years ago by an engineer who has since left. The fix is mechanical once you know the shape of the failure: confirm process death, force-unlock, refresh-only plan, diff state against cloud, reconcile with state rm and import, then patch the provider so it cannot happen again.

If you are staring at a hung apply right now and you are not sure whether to kill it, book an infrastructure review with our team and we will be on a bridge with you the same day. If the apply is already dead and you are sorting through the wreckage, the same engagement covers the state reconciliation and the provider fix together.

Related

Use these related pages to continue recovery