Insights

Why one shared Terraform module made every PR a 14-service change

The PR that shipped the bug had three approvals and a comment that read "LGTM, plans look normal." The plans were not normal. They were 14 separate terraform plan outputs stacked in the CI log, each touching 80 to 120 resources, totaling around 1,400 resource changes for what the author described as a typo fix in a shared module. Buried somewhere in plan number nine was a change to an IAM policy attachment that broke three services on apply. Nobody had read past plan three. The team had spent six months congratulating themselves on collapsing 8,000 lines of Terraform into 1,200, and the bill for that consolidation had just arrived.

Terraform & IaC debt | 11 min read

Problem signal

Every PR touching a shared module shows N service plans in CI, each with 50+ resource changes
Reviewers approve with 'plans look normal' without scrolling through them
A single shared module has accumulated 25 to 40 input variables to handle per-service edge cases
CI plan time grows superlinearly because each consumer plan loads its own remote state
A bug in the module breaks multiple unrelated services in the same apply window

The PR that broke three services had a clean LGTM

How a 1,400-resource plan output stopped being read

The original consolidation was, on paper, exactly the refactor every platform team is told to do. Fourteen service-specific Terraform configs, each maintained by a different feature team, each with its own subtle drift from the others. The platform team pulled the common shape out into one service-stack module, parameterized the differences, and pointed all 14 services at it. Eight thousand lines of HCL became twelve hundred. A change to add a shared observability sidecar landed across all 14 services in a single PR. Everyone celebrated.

The failure mode took six months to surface because the early signals looked like wins. Module changes shipped faster than the per-service changes they replaced. The platform team felt productive. What nobody tracked was that the CI plan output for every module PR had grown from one service's plan to fourteen, and the reviewers had silently adapted by reading the first plan, skimming the second, and rubber-stamping the rest.

Then a module-level change to how IAM policies were attached introduced a subtle bug: for services that overrode the default policy document, the new code path replaced rather than merged. Three of the 14 services overrode that default. The plan output showed the destruction and recreation of those policy attachments quite clearly, on lines somewhere around 870 of the GitHub diff view. The PR had three approvals.

Title: fix: typo in service-stack variable description

Diff: 1 line changed (a comment)

CI: terraform-plan-all ✓
  - service-a/plan: 84 changes
  - service-b/plan: 91 changes
  - service-c/plan: 102 changes
  - service-d/plan: 88 changes
  ... (10 more)

Total: 1,388 resource changes

Reviews:
  @platform-lead   approved 'LGTM, plans look normal'
  @service-b-eng   approved 'looks fine'
  @service-g-eng   approved

What the PR description looked like, paraphrased from the post-mortem

What we thought it was, what it actually was

The fix that was not reviewer discipline

The first instinct, and the one the team had spent two weeks pursuing before we got involved, was that this was a code review hygiene problem. They had written a PR template that required reviewers to acknowledge they had read each plan. They had a Slack bot that posted a daily "unreviewed plan changes" count. The platform lead had given a brown bag talk titled "Read Your Plans." None of it stuck, because none of it could stick. Asking a human to read 1,400 lines of plan output for a one-character comment fix is asking them to do something nobody should do, and they will not do it for long even if you make them feel guilty about it.

The actual problem was structural. The module had become a dependency surface that 14 consumers were forced to redeploy together, on every change, whether the change affected them or not. That is not a code review problem. That is the same coupling problem distributed systems people argue about with monoliths and microservices, except it had snuck in through the back door of a Terraform refactor. The cost of coupling does not show up the day you consolidate. It shows up the first time a small change has to ship and the blast radius is the entire fleet.

We have written more on the broader pattern in the Terraform and IaC debt pillar, but the specific recovery for this shape of problem has three layers, and they have to land in order.

Pinning the module per service was the bleeding stopper

How we cut the blast radius from 14 to 1 in an afternoon

The immediate move was to stop every consumer from being forced to re-plan on every module change. The mechanism is dumb and effective: pin each service's module reference to an explicit git ref instead of letting them all track main.

# Before: every consumer floats on main
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack"
  name   = "auth-api"
  # ...
}

# After: every consumer is pinned to an explicit version
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack?ref=v1.4.2"
  name   = "auth-api"
  # ...
}

Before and after: the module block in each service's Terraform config

After the pin, a module change ships as a tagged release in the modules repo, then ships to consumers one at a time via a per-service PR that bumps the ref. Each of those PRs shows exactly one service's plan, and that plan is short enough to read. The reviewers can do their job again. The author has to think about which services they actually want this change in, in what order, and on what schedule.

There is a real cost to this, and we want to name it honestly: you have given up some of the consolidation win. You can no longer ship an observability change to all 14 services in one PR. You can ship it in one tagged module release plus 14 small bump PRs, which is more clicks. We have not had a client regret the trade once they lived with it for a month. The clicks are cheap; the missed bug in plan number nine is not.

flowchart LR
  subgraph Before
    M1[Module main branch] --> S1A[service-a plan]
    M1 --> S1B[service-b plan]
    M1 --> S1C[14 plans total in one PR]
  end
  subgraph After
    M2[Module v1.4.2 tag] --> R[Module repo PR + tag]
    R --> P1[service-a bump PR: 1 plan]
    R --> P2[service-b bump PR: 1 plan]
    R --> P3[bump rolls per service]
  end

The change-propagation shape before and after pinning

Tiering inputs and splitting by change velocity

Why the 30-input module became 5 inputs plus an advanced object

Pinning bought time. It did not fix the underlying reason the module had become hard to change. We sat with the platform team and looked at all 30 inputs the module had grown. Most services used 5 of them. The other 25 existed because, over six months, individual services had asked for an escape hatch ("can the module take a custom IAM policy document?", "can we override the security group rules?", "can we set a node selector?") and the module owners had said yes, every time, because saying no felt like blocking a teammate. The module had become an everything-bagel.

We refactored the input surface into two tiers. The common five became first-class top-level inputs. The other 25 went into an optional advanced object with optional() fields, so a normal consumer never sees them and an exotic consumer has to opt in deliberately.

# Common path: every service uses these
variable "name"        { type = string }
variable "image"       { type = string }
variable "replicas"    { type = number }
variable "environment" { type = string }
variable "port"        { type = number }

# Escape hatch: explicit, optional, and visible in code review
variable "advanced" {
  type = object({
    custom_iam_policy_json = optional(string)
    extra_sg_rules         = optional(list(object({ ... })))
    node_selector          = optional(map(string))
    # ... 22 more rarely-used knobs
  })
  default = {}
}

variables.tf after the tiering refactor

Then we did the harder work: splitting the module along change-velocity boundaries. The monitoring submodule was changing roughly weekly, the database submodule once a quarter, and the networking submodule about twice a year. Bundling them together meant every monitoring tweak forced a re-plan of database and networking resources for all 14 consumers. We pulled them apart into separate modules with separate version pins, so a consumer can bump monitoring from v2.1 to v2.2 without touching database at all.

This is the same argument microservice advocates make about service boundaries, and the rule of thumb is the same: couple things that change together, decouple things that change at different rates. The cost of getting it wrong in Terraform is not latency or distributed-transaction pain. It is plan output nobody reads, and bugs that ship because of it.

Pin per consumer

Each service references the module at an explicit git tag. Module changes ship as releases, then propagate per service.

Tier the inputs

Five common inputs stay first-class. The long tail moves into an optional advanced object so escape hatches are explicit.

Split by change velocity

Submodules that change at different rates become separate modules with separate version pins. A weekly change does not drag a quarterly one along.

Gate multi-service plans

An OPA or CI check fails any PR whose plan touches more than three workspaces unless the description includes allow-multi-service: yes.

The OPA gate is worth its own sentence. It is twenty lines of Rego that counts distinct workspaces touched by the plan and fails the PR over a threshold unless the author explicitly opts in. It does not prevent fleet-wide changes; it forces the author to acknowledge they are making one. That single check has caught two accidental fleet-wide PRs at the clients who have adopted it, both of which would have shipped under the old regime.

If your shared modules feel like this right now

When the consolidation win has become a coupling tax

The hard part of this kind of recovery is not the technical work. Pinning a module ref is ten minutes of typing per service. Splitting a module along change-velocity lines is a weekend. The hard part is convincing the team that the consolidation they are proud of has become a liability, and doing the unwind in an order that does not cause an outage. We have done this engagement at four SaaS platforms in the last year, and the pattern of "reviewer fatigue followed by a buried bug" shows up in three of the four. The fourth caught it before the bug shipped, only because their CI plan output had grown past GitHub's diff size limit and forced the conversation.

We run these recovery engagements every week. If your platform team is shipping module changes that produce thousand-line plan outputs and your reviewers have started writing "plans look normal" without reading them, the next bug is already on its way. Book an infrastructure review with our team and we will spend a 30-minute diagnostic call this week mapping your module consumer graph and naming the first three pins to put in place.

Use these related pages to continue recovery

Request Review Download Checklist Case Studies