Insights

Why Grafana OnCall acknowledgments hang after a Helm upgrade migration

The call did not come from our on-call rotation. It came from a customer who noticed two unrelated degradations on their side and asked why we had not paged. We had not paged because Grafana OnCall had been silently swallowing alerts for roughly 72 hours. Every new firing alert was being deduplicated into the same zombie incident, and every attempt to acknowledge or resolve that incident returned HTTP 500. The on-call engineer who first tried to clear it that morning had assumed the spinner was a UI bug and moved on. The thing meant to wake us up was the thing that was broken.

Migration recovery | 11 min read
Problem signal
  • OnCall UI Acknowledge and Resolve buttons spin and time out with a generic 500
  • New alerts from real degradations get deduplicated into an incident that cannot be cleared
  • OnCall pod logs show ORM errors referencing a column that does not exist in the table
  • The Helm post-upgrade migration job reported success but Postgres logs show a lock_timeout on one ALTER TABLE
  • There is no Prometheus alert on OnCall's own API error rate, so the regression went undetected
The alerting platform was the incident

72 hours of swallowed alerts and one zombie incident absorbing all of them

When we got on the bridge, OnCall's incident list looked almost healthy. Two incidents in firing state, both from three days earlier, both with zero acknowledgment events. That should have been impossible. The on-call rotation had been live the whole time, and the runbook said any firing incident over 15 minutes old gets escalated. Nothing had been escalated because nothing new had appeared. Every alert fired by Prometheus Alertmanager in those 72 hours had been deduplicated by labels and folded into one of those two zombies.

The first thing we tried was the obvious one. Click Acknowledge in the UI. The spinner ran for about 20 seconds and the page returned a 500. Same for Resolve. Same for Snooze. Same when we called the API directly with curl. The web pods were up, the database was reachable, Redis was fine. Nothing in any dashboard suggested a problem, because nobody had built a dashboard that watched OnCall itself.

$ curl -s -X POST -H "Authorization: Bearer $TOKEN" \
    https://oncall.internal/api/v1/alert_groups/I8KZ.../acknowledge/
{"detail": "Internal server error"}

# from the oncall-engine pod
$ kubectl logs deploy/oncall-engine -c engine --tail=50 | grep -A2 ERROR
DatabaseError: column alerts_alertgroup.acknowledged_by_confirmation_phone does not exist
LINE 1: ...ledged_by_user_id", "alerts_alertgroup"."acknowledged_by_co...

The ORM was reaching for a column the table did not have.

Why the migration job exited 0 with a half-finished schema

A silent ALTER TABLE timeout the Helm hook never noticed

Our first guess was a bad release. The previous Helm upgrade had bumped OnCall by a minor version, and we assumed the new application code was looking at a field that genuinely had not shipped yet. That was wrong. The release notes said the column had been added in this version, and django_migrations on the OnCall database said the migration had been applied. Both things were true, and the column was still not there.

The clue was in Postgres logs from three days earlier, exactly when the Helm post-upgrade hook ran the migration job. One line, easy to miss, in the middle of dozens of normal statement logs:

2024-XX-XX 02:14:07 UTC ERROR:  canceling statement due to lock timeout
2024-XX-XX 02:14:07 UTC STATEMENT:  ALTER TABLE alerts_alertgroup
    ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
2024-XX-XX 02:14:07 UTC LOG:  duration: 30001.114 ms

alerts_alertgroup is one of the highest-write tables in OnCall. At 02:14 a backlog of inserts was holding row locks, the ALTER hit the lock_timeout we had set globally to 30 seconds (a sensible default we put in years ago to stop one bad migration from wedging the whole database), and Postgres killed the statement. The migration script caught the exception, logged it to stderr, moved on to the next statement, and finished. The Helm hook checked the job's exit code, saw 0, and marked the release Succeeded. ArgoCD synced. The new pods rolled. And from that moment, every code path that touched the new column returned 500.

The migration was also blocked from completing on a retry because the previous attempt had left a trigger in place on alerts_alertgroup, which we only found by checking pg_trigger directly. Without dropping that trigger first, re-running the migration would have hit the same lock window and failed the same way.

flowchart TD
  A[Helm upgrade] --> B[Post-upgrade migration job]
  B --> C{ALTER TABLE alerts_alertgroup}
  C -->|lock_timeout 30s| D[Statement cancelled]
  D --> E[Script catches exception, logs to stderr]
  E --> F[Migration script exits 0]
  F --> G[Helm marks release Succeeded]
  G --> H[New pods deployed, expect new column]
  H --> I[Every ack/resolve returns 500]
  I --> J[Alerts deduplicated into zombie incidents]
Drop the trigger, add the column, then unstick the zombies

Why we forward-fixed instead of rolling the Helm release back

We considered rolling back to the previous OnCall version. It looked clean on paper: the old image did not need the missing column, so the schema would match again and acks would work. We talked ourselves out of it for two reasons. First, the new pods had been running for three days and had written data shaped for the new version, including new fields in adjacent tables. A rollback would have meant either accepting writes that the old code did not understand or restoring a 72-hour-old database snapshot, which would erase three days of incident history including the zombies we wanted to clean up. Second, the next upgrade would just hit the same lock_timeout the same way. We would be back here in a week.

Forward-fix it was. The sequence had to be careful, because the table was still taking writes and we were going to ALTER it. We picked a low-write window, paused Celery workers that wrote to alerts_alertgroup (not the web tier, which we wanted up so the API stayed responsive), and ran the work inside one transaction:

-- 1. confirm the column is genuinely missing
SELECT column_name FROM information_schema.columns
WHERE table_name = 'alerts_alertgroup'
  AND column_name = 'acknowledged_by_confirmation_phone';
-- (0 rows)

-- 2. find the blocking trigger left over from the failed attempt
SELECT tgname FROM pg_trigger
WHERE tgrelid = 'alerts_alertgroup'::regclass
  AND tgname LIKE 'pgtrigger_%';

-- 3. drop it inside the same transaction we ALTER in
BEGIN;
SET LOCAL lock_timeout = '5min';
DROP TRIGGER IF EXISTS pgtrigger_oncall_protect_finished
  ON alerts_alertgroup;
ALTER TABLE alerts_alertgroup
  ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
COMMIT;

Raise lock_timeout for this transaction only; do not touch the global.

We did not change the global lock_timeout. Setting it LOCAL inside the transaction lets this one ALTER wait up to five minutes, and any other migration that runs in normal conditions still gets the 30-second guard. Once the column existed, we unpaused the Celery workers and watched the engine pod logs. The 500s stopped within seconds.

That left the zombies. Acknowledging them was not enough. An acknowledged incident still sits in the firing state from OnCall's deduplication perspective, so new alerts would still fold into it. We had to mark them resolved. We did it through the API first to make sure the lifecycle hooks fired and downstream integrations got the resolved webhook, and only fell back to a direct UPDATE for two records that the API still refused for an unrelated reason (their integration had been deleted, so the API could not look up the routing). For those, we set resolved=TRUE and resolved_at to the current timestamp in the database directly, with a note in the incident's raw payload explaining the manual close.

We then fired a synthetic alert from Alertmanager and watched a new incident appear, ack it from the UI in under two seconds, resolve it, and confirm a follow-up alert created a fresh incident instead of folding into the resolved one. That was the real all-clear.

What we wired up so the next silent migration trips an alarm

Meta-monitoring for the platform that does the monitoring

The thing that kept us up afterward was not the migration. Migrations fail. Database locks happen. The thing that kept us up was that OnCall had been broken for three days and not one signal in our monitoring stack had told us. We had alerts on Prometheus being down, on Alertmanager being down, on Grafana being down, on every customer-facing service. We had nothing watching the incident management platform itself.

We added two rules the same week. The first is a straight error-rate alert on OnCall's API. If more than 1% of requests to /api/v1/ return 5xx for five minutes, page the platform team at critical severity. Five minutes is short enough that a real outage gets caught but long enough that a single bad deploy rolling does not page. We picked critical because if OnCall is degraded, nothing else paging matters; alerts get swallowed.

groups:
- name: oncall-meta
  rules:
  - alert: OncallApiErrorRateHigh
    expr: |
      sum(rate(django_http_responses_total_by_status_total{job="oncall",status=~"5.."}[5m]))
      /
      sum(rate(django_http_responses_total_by_status_total{job="oncall"}[5m]))
      > 0.01
    for: 5m
    labels:
      severity: critical
      service: oncall
    annotations:
      summary: "OnCall API returning >1% 5xx for 5m"
      runbook: "https://internal/runbooks/oncall-api-errors"

  - alert: OncallMigrationJobStderr
    expr: |
      sum(increase(kube_job_status_failed{namespace="oncall"}[10m])) > 0
      or
      sum(increase(log_messages_total{namespace="oncall",app="migration",level="ERROR"}[10m])) > 0
    for: 1m
    labels:
      severity: critical
      service: oncall

The second rule catches a migration job that logs errors even if its exit code is 0.

The second rule is the lesson from this specific incident. Helm trusts the exit code. Django migrations swallow individual statement errors and continue. The only place the truth lives is in the job's log stream. We now alert on ERROR-level log lines from any pod with the migration label in the oncall namespace, regardless of whether the job reported success. We have caught two real issues with this rule in the months since (neither as bad as this one, both worth knowing about within minutes instead of days).

The broader pattern, and one we now apply on every recovery engagement we run, is that any tool you depend on to notice problems needs an independent way to notice when that tool itself is the problem. We have written more about this category of failure in our migration recovery work, because the same shape appears in database cutovers, queue platform upgrades, and identity provider migrations: the system you rely on to tell you the truth is the system that has stopped telling the truth, and you only find out from a customer.

If your OnCall is doing this right now

When acks are silently 500ing and you cannot tell what data is real

The hard part of this incident is not the SQL. The hard part is making the call between forward-fix and rollback when your incident history, your zombie state, and your live alert routing are all entangled in a database that is currently being written to by application code that expects a schema it does not have. Roll back without a plan and you lose three days of incident records. Forward-fix without checking for leftover triggers and migration locks and your second attempt fails the same way as the first. Run an ALTER on a hot table during business hours and you find out what your application's actual timeout tolerance is.

We do these engagements every few weeks. Partial Django migrations on Grafana OnCall is the specific case we have now seen three times this year, twice from lock_timeout and once from a custom trigger that blocked the ALTER outright. Adjacent variants we have handled: Sentry post-deploy migrations that left a column nullable when the code expected NOT NULL, Mattermost upgrades where one index creation timed out, Keycloak realm migrations that completed on the primary but failed on a replica. The pattern is identical and the recovery sequence rhymes.

If your team is staring at a 500 on every ack and trying to decide whether to roll back the Helm release, book an infrastructure review with our team and we will be on a bridge with you the same day. We will help you confirm the schema delta, plan the forward-fix or the rollback with the data implications spelled out, and clean up the zombie incidents without losing the history you need for the postmortem.

Related

Use these related pages to continue recovery