Skip to content

Infrastructure review and recovery for SaaS teams

Insights

Clear, opinionated recovery notes for SaaS infrastructure

This is the discovery library for teams who want to understand failure patterns, recovery logic, and the kind of review InfraForge runs before they submit.

Request Review Download Checklist

Terraform and IaC recovery

Kubernetes and GitOps stability

Migration and audit readiness

Decision aids, not filler

How to use this page

Read by issue cluster, not by publish order

The useful question is not "what was published last?" It is "which problem class matches the pressure the team is under right now?"

Best way to navigate

Terraform and IaC reliability.
Kubernetes, GitOps, and release stability.
Migration recovery, audit readiness, and control design.

When to stop reading

If the failure pattern is already familiar, request the review.
If you need proof, use case studies next.
If you just need the checklist, download the PDF directly.

Start here

Featured recovery notes

A short set of strong entry points before you go broader.

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

GitOps recovery | 11 min readArgoCD CVE-2022-24348: a Secret leak that hid in log volumeHow a ConfigMap path traversal under ArgoCD CVE-2022-24348 leaked a cross-namespace Keycloak Secret for 3 days, and the recovery sequence that actually stopped it.

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Migration recovery | 11 min readWhy Grafana OnCall acknowledgments hang after a Helm upgrade migrationA partial Django migration left Grafana OnCall with a missing column. Acks returned 500 for 72 hours while alerts piled into zombie incidents. Here's the fix.

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Cloud cost triage | 10 min readWhy a deleted backup Lambda kept billing 9,400 EBS snapshotsAn EBS Snapshot line of $1,830 a month came from a Lambda deleted a year earlier. Here is how we found the 9,408 orphans and the tagging rule we wrote.

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Terraform & IaC debt | 11 min readWhy one shared Terraform module made every PR a 14-service changeA consolidated Terraform module turned every PR into 14 service plans with 1,400 resource changes. How we pinned, tiered, and split it back apart.

All recovery notes

Browse the full library

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

ArgoCD CVE-2022-24348: a Secret leak that hid in log volume

How a ConfigMap path traversal under ArgoCD CVE-2022-24348 leaked a cross-namespace Keycloak Secret for 3 days, and the recovery sequence that actually stopped it.

GitOps recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why Grafana OnCall acknowledgments hang after a Helm upgrade migration

A partial Django migration left Grafana OnCall with a missing column. Acks returned 500 for 72 hours while alerts piled into zombie incidents. Here's the fix.

Migration recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why a deleted backup Lambda kept billing 9,400 EBS snapshots

An EBS Snapshot line of $1,830 a month came from a Lambda deleted a year earlier. Here is how we found the 9,408 orphans and the tagging rule we wrote.

Cloud cost triage | 10 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why one shared Terraform module made every PR a 14-service change

A consolidated Terraform module turned every PR into 14 service plans with 1,400 resource changes. How we pinned, tiered, and split it back apart.

Terraform & IaC debt | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

When ArgoCD shows Healthy but Keycloak silently strips JWT claims

ArgoCD synced a Keycloak realm ConfigMap with OVERWRITE strategy and silently stripped JWT claims across six clients. Here is how we recovered without dropping sessions.

GitOps recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why a Terraform apply hangs 90 minutes on a custom provider with no timeout

A 200-entry destroy hung for 90 minutes because a custom Terraform provider skipped context timeouts. How we recovered the half-updated state and fixed the provider.

Terraform state recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Grafana dashboards went blank post-migration and every fix reverted in minutes. Here is how we found the reconcilers and restored the observability stack.

K8s reliability | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

When MinIO Deny Wins Cause Silent Upload Failure

A MinIO bucket policy with an explicit Deny silently swallowed 12k uploads while the SDK returned 200. Here is how we found it and the audit alert we added.

Object storage recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

ArgoCD Drift: Three Namespaces, One JWT Hotfix

A JWT rotation hotfix left three ConfigMaps in three different states and Git stale. Here is how we found the canonical truth and committed it back without breaking auth.

GitOps recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

How we recovered tfstate after force-unlock raced a CI apply

A force-unlock collided with a running CI apply and corrupted tfstate. Here is how we restored the S3 version and re-imported the drifted resources.

Terraform state recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why terraform apply fails when plan passes: the map(any) trap

A 15th map(any) input collided with an existing key three module layers down. plan passed, apply failed. Here is how we traced it and untangled the root.

IaC recovery | 11 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Why a forgotten RDS replica added $8,600 to one AWS bill

How a cross-AZ RDS read replica left over from a load test retried writes every 50ms and quietly tripled an AWS bill in six days.

Cost spike triage | 9 min read

INFRA

Infrastructure insight

InfraForge Note

recovery guidance for SaaS teams

riskrecoveryowners

Init container cascade when every kubectl patch reverts in 10 seconds

Three init containers stuck in cascade and every kubectl patch reverted within ten seconds. Here is how we found the source of truth and fixed it.

Kubernetes recovery | 11 min read

CONTROL

Audit readiness

Change Control

approvals, rollback, evidence

approvalrollbackevidence

Infrastructure change control checklist for audit-ready SaaS teams

A practical checklist for approvals, validation evidence, rollback discipline, and audit-ready change control.

Audit readiness | 9 min read

GITOPS

GitOps recovery

Drift Triage

classify, compare, reconcile

renderclustergit

GitOps drift triage checklist for production teams

A fast triage sequence for classifying GitOps drift, comparing rendered output, and restoring sync trust.

GitOps recovery | 9 min read

CUTOVER

Migration recovery

Blast Radius

dependencies, cutover, risk

routingidentitydata

Migration blast radius mapping framework for SaaS platforms

A practical framework to map hidden migration dependencies and contain post-cutover reliability risk.

Migration recovery | 10 min read

ARGOCD

Kubernetes reliability

ArgoCD Sync

out of sync -> safe again

syncrenderrollback

ArgoCD sync failed recovery playbook for production teams

A recovery sequence for repeated ArgoCD sync failures, drift reconciliation, and safer release flow.

Kubernetes reliability | 11 min read

DECISION

IaC strategy

Terragrunt vs Terraform

ownership and change safety

layersenvsowners

Terragrunt vs Terraform for growth-stage SaaS: decision framework

A practical decision framework for selecting IaC structure based on ownership and change safety.

IaC strategy | 12 min read

EVIDENCE

Audit readiness

Evidence Pack

proof that survives scrutiny

logschangeowners

Audit evidence pack for SaaS infrastructure teams

A practical evidence-pack structure for audit readiness without slowing product delivery.

Audit readiness | 10 min read

SPEND

Cost control

Spend Triage

find drivers before broad cuts

computedataegress

Cloud cost spike triage framework for engineering leads

A systems-first triage flow to isolate spend drivers and reduce cost safely.

Cost control | 10 min read

RELEASE

Kubernetes reliability

Release Runbook

promote, verify, roll back

builddeployverify

Kubernetes release stabilization runbook

A practical runbook to make rollouts deterministic and rollback paths reliable.

Kubernetes reliability | 11 min read

POSTCUT

Migration recovery

Stabilization

30-day recovery sequence

week 1week 2week 4

Post-migration stabilization checklist for SaaS teams

A 30-day stabilization sequence for teams whose platform got shakier after migration.

Migration recovery | 10 min read

DETECT

IaC prevention

Drift Detection

workflow teams actually keep up

detecttriagereconcile

Terraform drift detection workflow teams actually maintain

A practical drift detection workflow with ownership, triage, and reconciliation rules that hold up under pressure.

IaC prevention | 11 min read

REFACTOR

IaC scalability

Module Refactor

lower coupling in phases

splittesthandoff

Terraform module refactor strategy for growth-stage SaaS

A phased module refactor strategy that lowers coupling and avoids production disruptions.

IaC scalability | 12 min read

INCIDENT

IaC incident response

Apply Incident

contain, inspect, recover

freezestaterecover

Failed Terraform apply incident response checklist

A practical incident sequence to contain impact, reconcile state, and prevent repeat failures.

IaC incident response | 10 min read

STATE

IaC recovery

State Recovery

repair trust in state

locksownersimports

Terraform state recovery playbook for SaaS teams

A practical sequence to repair state trust, reduce blast radius, and restore predictable infrastructure changes.

IaC recovery | 12 min read

GUARDRAIL

IaC safety

Apply Guardrails

review and rollback discipline

reviewplanrollback

Safe Terraform apply guardrails for production SaaS

A guardrail system for CI/CD, review, and rollback that makes Terraform applies boring again.

IaC safety | 11 min read

REVIEW

Review checklist

Review Prep

send the right signals first

critical pathriskowners

Infrastructure review checklist for SaaS teams under pressure

A fast decision guide for when to request a review and what to prepare so the response is actionable.

Review checklist | 8 min read

RECOVERY

IaC recovery

Drift Recovery

safe change control restored

freezeinventoryapply

Terraform drift recovery: stabilize IaC without stalling delivery

A practical recovery plan for drift, fear-of-apply, and brittle modules with guardrails that last.

IaC recovery | 9 min read

Checklist

Prefer the PDF?

Use the checklist when you want a short review aid without reading through the full article library.

The Infrastructure Review Checklist is public and ready to download.

Use it to map critical paths, drift signals, release safety questions, and evidence gaps before you request the review.

Download Checklist Request Review

Need a review?

If the platform feels fragile, stop reading and request the review.