Insights

Why a forgotten RDS replica added $8,600 to one AWS bill

The finance lead forwarded the AWS bill on a Monday morning with three question marks in the subject line. The number had gone from a steady $3,200/month to $11,800 in six days. The on-call engineer's first guess, sensible enough, was that a data scientist had left a cross-region Athena job running over the weekend. It was not. It was an RDS read replica in a different AZ from its primary, provisioned a month earlier for a one-off load test, never decommissioned, retrying a replication-stream write every 50 milliseconds because somebody had flipped the primary's binlog format mid-stream. Nobody had read from the replica in three weeks. It had been quietly burning cross-AZ data transfer the whole time.

Cost spike triage | 9 min read

Problem signal

AWS bill jumped 2-4x in under a week with no traffic or feature change
Cost Explorer concentrates the spike on DataTransfer-Regional-Bytes and RDSInstance line items
An RDS read replica sits in a different AZ than its primary and shows jagged ReplicaLag (spikes to 30s, drops to 0.5s, repeats)
No application config or BI tool actually points at the replica's endpoint
Recent schema or replication change on the primary that nobody coordinated with replica owners

What we thought it was first

Chasing the analytics query that did not exist

Almost every cost spike I have seen in the last three years gets blamed on analytics first. There is usually a junior data person, a notebook, a forgotten SELECT *, and a story everyone tells themselves. So we did the natural thing. We pulled the Athena query history for the previous ten days. Nothing unusual. We checked Redshift, which the team barely uses. Idle. We checked the data warehouse cluster's autoscaling history. Flat.

The clue was in Cost Explorer, but only when we grouped by usage type instead of by service. The RDS line item was up, sure, but the line item that had really moved was DataTransfer-Regional-Bytes. That is the meter for cross-AZ traffic inside a single region. Analytics queries do not typically light that meter up unless somebody has put a compute node in one AZ and the data in another, which would have been a much weirder problem.

Cross-AZ data transfer at that volume meant something was constantly shipping bytes between two availability zones. The shape of the bill said: find the thing that talks to itself across AZs at high frequency.

The diagnostic turn

How we found the orphan replica

We listed every RDS instance in the account and compared the AZ of each replica to its primary. One read replica was in us-east-1b while its primary was in us-east-1a. That alone is not a problem; cross-AZ replicas exist for legitimate HA reasons. What was odd was that this replica was tagged with nothing. No Owner. No Purpose. No Environment. Just the default Name tag, which read load-test-replica-temp.

# List replicas with their AZ and their primary's AZ
aws rds describe-db-instances \
  --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`].[DBInstanceIdentifier,AvailabilityZone,ReadReplicaSourceDBInstanceIdentifier,DBInstanceStatus]' \
  --output table

# Then for each primary, get its AZ
aws rds describe-db-instances \
  --db-instance-identifier <primary-id> \
  --query 'DBInstances[0].AvailabilityZone'

The two commands that surfaced the orphan in about 30 seconds.

The replica's CloudWatch ReplicaLag metric was the giveaway that this was not a healthy idle replica. It would spike to 30 seconds, drop to 0.5 seconds, spike again, every minute or so. That sawtooth pattern means the replication thread is failing and retrying. We pulled the replica's error log and found the same line repeating roughly every 50 milliseconds: a binlog format mismatch. Someone had changed the primary from MIXED to ROW format three weeks earlier, and the replica had been retrying the broken stream ever since.

Every retry shipped a chunk of binlog across the AZ boundary. At 50ms intervals, 24 hours a day, for three weeks. That was the bill.

What we did before deleting anything

The five-minute check that prevents the worse outcome

The instinct, when you have found the thing burning money, is to kill it immediately. We did not. The worse outcome here is not 'replica costs another hour of cross-AZ transfer'. The worse outcome is 'replica gets deleted, a quarterly BI dashboard breaks on Friday, and finance is back in your inbox with a different question'.

So we did the cheap verification first. We grepped the application monorepo for the replica's endpoint hostname. Zero hits. We checked the BI tool's data sources (Metabase in this case). Nothing pointed at it. We checked the data team's Airflow DAGs. Clean. We checked Terraform state to see how it had been created. It was in a workspace tagged load-test that had not been touched in a month, and the engineer who created it had left the company three weeks earlier.

If something had pointed at it

The right move would have been to keep the replica, fix the binlog format, and decide whether the read pattern actually justified cross-AZ. Deletion would have caused a worse incident than the cost spike.

Nothing pointed at it

Delete with --skip-final-snapshot. The replica was already corrupted by the binlog mismatch; a final snapshot was worthless. Cost stopped accruing within minutes.

aws rds delete-db-instance \
  --db-instance-identifier load-test-replica-temp \
  --skip-final-snapshot

The actual delete, once we were confident nothing depended on the replica.

What we changed afterwards

Tag hygiene, expiration sweeps, and an anomaly budget that would have caught this on day 2

Forgotten resources are the largest single category of cloud waste I see in client accounts. Bigger than oversized instances. Bigger than reserved-instance gaps. The fix is mechanical. Every cost-generating resource needs three tags: Owner, Purpose, ExpiresAt. ExpiresAt is the one most teams skip and the one that does the work.

We deployed a small Lambda on a weekly schedule that walks RDS, EC2, ELB, ElastiCache, and OpenSearch, finds resources past their ExpiresAt date or missing tags entirely, and posts to a Slack channel pinging the Owner. The owner has two weeks to either re-tag with a new ExpiresAt or delete. Resources with no Owner go to the platform team's queue. The first sweep flagged 47 resources across the account. Six of them were costing real money.

flowchart TD
  A[Weekly Lambda runs] --> B{Resource has<br/>Owner, Purpose,<br/>ExpiresAt tags?}
  B -- no --> C[Post to platform team queue]
  B -- yes --> D{ExpiresAt<br/>in past?}
  D -- no --> E[Skip]
  D -- yes --> F[DM the Owner in Slack]
  F --> G{Owner responds<br/>within 14 days?}
  G -- extends --> H[Update ExpiresAt]
  G -- no response --> I[Auto-tag for deletion<br/>review next sweep]

The sweep logic. About 180 lines of Python in practice.

The second change was AWS Budgets with anomaly detection scoped per service. The team had a single account-wide budget set at $5,000/month, which is useless for catching this kind of incident because the spike was concentrated in one service and the account total only crossed $5,000 on day five. A per-service budget on RDS set at $4,000 with a 20% variance threshold would have fired on day 2. The alert that matters is the one that fires before you have spent the money, not after.

The third change was a process one. The original binlog format change had been an uncoordinated database tweak from a senior engineer who had not realized a replica existed. Schema and replication changes now require a checklist that includes 'list all replicas of this primary and confirm they support the new config' as a pre-flight step. It is not glamorous. It would have prevented the entire incident.

If your AWS bill just jumped and you do not know why

Where cost spike triage gets stuck

The hard part of a cost spike is not finding the resource. It is being confident enough to delete it. Most teams we work with have at least one orphan RDS, ElastiCache, or NAT gateway they are afraid to touch because nobody remembers what depends on it. The triage takes a day; the courage to act takes a week of meetings. By then the bill has run another $2,000.

We run cost spike triage engagements every month. We have seen the orphan-replica case four times this year, the NAT-gateway-in-the-wrong-AZ case more often than that, and a half dozen variants of 'load test that never got cleaned up' across CloudWatch Logs, OpenSearch, and Aurora Serverless. The pattern is almost always the same: a resource that nobody owns, a tag policy that was never enforced, and a budget alert tuned too coarse to catch concentration in a single service. We have written more on the underlying patterns in the cloud cost spikes problem brief and across our services.

If your AWS bill jumped this month and you cannot point at the resource with confidence, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week. Cost stops accruing the day we find the orphan.

Use these related pages to continue recovery

Request Review Download Checklist Case Studies