Insights

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive. By the time we opened a second tab, the ConfigMap was back to 9999. That was the real incident. The 'No Data' dashboards were a symptom of an observability stack that someone, or something, was actively re-corrupting from at least seven places we had not yet found.

K8s reliability | 11 min read
Problem signal
  • Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes
  • Prometheus targets page is empty or stuck on a namespace that does not exist anymore
  • ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed
  • ps aux shows kworker-looking processes with elevated CPU that hold open file descriptors to a kubeconfig
  • kubectl get cronjobs -A shows entries in namespaces nobody on the platform team remembers creating
The fix that lasted 90 seconds

Why we stopped fixing config and started looking for what was undoing it

The team that called us had been at this for nine hours. After a cluster migration, every Grafana dashboard was blank. The on-call had walked through the obvious things. The Prometheus datasource in Grafana pointed at port 9999. The Loki datasource pointed at port 3199. The Prometheus scrape config had annotation keys nobody recognized (prometheus_io_metrics_enabled instead of prometheus_io_scrape) and targeted a namespace that did not exist. The Grafana deployment had a config-validator init container running sleep 3600. Each one of those was a real bug. Each one of those, fixed in isolation, would revert before the next pod rolled out.

The shape of what they were describing was not a botched migration. A botched migration leaves bad state. This was bad state being re-applied. When manual kubectl edits revert in minutes, the question is no longer 'what is wrong with the manifest', it is 'what process has write access and is reconciling against a corrupt source of truth'. We told them to stop fixing config until we had inventoried every actor that could write to the cluster.

This sounds obvious written down. In the middle of an incident, with executives asking for an ETA on dashboards, the instinct is to keep patching. We have run this play enough times now to know the patching never converges. You burn three more hours and your changes still revert. The only path out is persistence-first triage.

A kworker thread holding a kubeconfig

Seven places state was being rewritten from

We started on the nodes. ps auxf on each worker showed a process named [kworker/u8:2-events_unbound]. Square brackets usually mean a kernel thread, and you learn early not to touch kernel threads. We almost moved on. The thing that snagged our attention was CPU: a real kernel worker thread on an idle-ish node should not be sitting at 12 percent. We pulled its open file descriptors.

$ ls -l /proc/$(pgrep -f 'kworker/u8:2')/fd/ 2>/dev/null | head
lr-x------ 1 root root 64 ... 3 -> /root/.kube/config
lrwx------ 1 root root 64 ... 7 -> socket:[884213]
lr-x------ 1 root root 64 ... 9 -> /opt/.reconciler/state.json
$ cat /proc/$(pgrep -f 'kworker/u8:2')/comm
kworker/u8:2-events_unbound
$ readlink /proc/$(pgrep -f 'kworker/u8:2')/exe
/opt/.reconciler/agent

Kernel threads do not hold kubeconfigs or have an exe link. This was a userspace binary with a spoofed comm name.

That was reconciler one. The same trick was on every node, with comm names rotating through plausible kworker patterns (flush-dm-0, mm_percpu_wq). We collected the binary, killed every instance, removed the systemd unit that was respawning it, and moved on. Then we did the boring sweep nobody wants to do in the middle of an incident.

  • kubectl get cronjobs -A surfaced config-audit in kube-system and prometheus-metrics-federation in cattle-monitoring-system. Neither was ours. Both ran every 60 seconds and wrote ConfigMaps.
  • systemctl list-timers on each node showed k8s-health-monitor.timer firing every two minutes against the API server with a node-local kubeconfig.
  • ls /etc/cron.d/ had a host cron entry running a script under /opt/.reconciler/ once a minute as a belt-and-braces backup to the systemd timer.
  • kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations turned up pod-policy-webhook, namespace-policy-webhook, and the one that hurt us most, rbac-policy-enforcer.
  • chattr was set +i on /etc/cron.d/k8s-health and on the corrupted ConfigMap manifests staged on disk. Edits failed silently with 'operation not permitted'.
  • Finalizers on the CronJobs prevented kubectl delete from completing until we patched them off.
  • PodSecurity labels on cattle-monitoring-system were set to enforce a baseline that blocked our debug pods from running.

Seven places. Any one of them, left running, would have re-corrupted the stack within minutes of our fixes. Some teams have a reconciler. This cluster had a mesh of them, each one a backup for the others. That is not a thing healthy infrastructure does; it is a thing a previous incident or a hostile takeover does. Either way, the response is the same.

Why we deleted the webhooks before touching RBAC

The order we neutralized things, and why order matters

There is a trap in this kind of cleanup. If you fix the visible problem before you neutralize the actor reverting it, you have wasted a fix and burned credibility with the room. The worst version of this in our case was the RBAC webhook. The Prometheus ClusterRoleBinding had been deleted entirely, and the deployment had been swapped to the default service account. The obvious move was to recreate the CRB and patch the deployment back to a proper SA.

We tried it once, in a scratch namespace, just to see. The CRB came back with roleRef pointing at a ClusterRole that did not exist. The mutating webhook was matching anything with 'prometheus' or 'monitoring' in the name and silently rewriting the roleRef. If we had run that against the real CRB in production with the team watching, we would have looked like we did not know what we were doing, and the fix would not have worked.

flowchart TD
  A[Find reconcilers] --> B[Remove finalizers and chattr -i]
  B --> C[Delete admission webhooks]
  C --> D[Stop CronJobs, timers, host cron]
  D --> E[Kill host reconciler processes]
  E --> F[Verify nothing writes for 60s]
  F --> G[Fix config: Prometheus, Grafana, Loki]
  G --> H[Recreate RBAC and ServiceAccount]
  H --> I[Restart pods, observe 2 min stability window]

Neutralize first, then fix. RBAC and any 'monitoring'-named resource go last because the webhook would mutate them on creation.

So the order was: strip finalizers from the CronJobs, chattr -i on the immutable files, delete the three webhook configurations, suspend and delete the CronJobs in kube-system and cattle-monitoring-system, mask the systemd timer, remove the host cron entry, kill the userspace reconciler processes on every node and remove their systemd unit. Then we sat for 60 seconds and watched. No ConfigMap mutations. No Deployment patches. Quiet cluster. That was the first time in nine hours the cluster had been quiet, and you could feel the room exhale.

The order we put it back together

Restoring the observability stack once writes were ours alone

With the reconcilers gone, the config fixes were the easy part. We did them top-down by data flow: scrape config, then service routing, then the consumers.

# 1. Prometheus ConfigMap: restore annotation keys, fix namespace, drop interval
kubectl -n monitoring get cm prometheus-config -o yaml > /tmp/prom-cm.yaml
# edit: prometheus_io_metrics_* -> prometheus.io/scrape, /metrics, port
#       namespaces: [bleater-nonexistent] -> the real app namespace
#       scrape_interval: 300s -> 30s
kubectl apply -f /tmp/prom-cm.yaml

# 2. Prometheus Service: targetPort 9099 -> 9090
kubectl -n monitoring patch svc prometheus --type=json \
  -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":9090}]'

# 3. Service account and RBAC (webhooks already deleted)
kubectl -n monitoring create sa prometheus
kubectl create clusterrolebinding prometheus \
  --clusterrole=prometheus --serviceaccount=monitoring:prometheus
kubectl -n monitoring set serviceaccount deploy/prometheus prometheus

# 4. Prometheus readiness probe: port 9099 /-/healthz -> 9090 /-/ready
# 5. Loki: drop -server.http-listen-port=3199 arg, fix svc selector loki-server -> loki
# 6. Grafana: remove init container, fix probe ports, drop GF_SERVER_HTTP_PORT,
#    fix volume refs (-v2 -> base name), reset admin secret
# 7. Delete NetworkPolicy grafana-egress-restrict
kubectl -n monitoring delete networkpolicy grafana-egress-restrict

We applied these as separate kubectl operations on purpose, not a single helm rollout, so we could verify each one stuck before moving on.

After every step we waited 30 seconds and re-read the resource. Nothing reverted. We rolled the Grafana deployment, watched it come up clean with no init container blocking startup, hit the Prometheus targets page and saw 11 active up series including the application pods, then loaded a dashboard. Data. The two-minute stability window passed with no drift. We held the bridge for another 20 minutes anyway, because the team needed to see it not break more than they needed us to leave.

What we changed in our own playbook

Persistence-first triage is now the default for post-migration observability failures

We have changed how we open any incident where fixes do not stick. The first 15 minutes are no longer spent on config. They are spent on a sabotage sweep: cronjobs in every namespace (not just the obvious ones, cattle-monitoring-system bit us and we have seen it bite others), systemd timers on every node, /etc/cron.d, validating and mutating webhooks, finalizers on resources we expect to delete, immutable file attributes on staged manifests, and a ps auxf on every node with an eye on anything in square brackets that has an exe link.

We also changed how we think about kubectl edit during a live incident. If a change has to land and the cluster has any chance of having a reconciler we have not yet found, we apply through git and watch the apply, not edit the live object. It is slower by 90 seconds and saves you from spending an hour wondering why your fix evaporated. We have written more on the same instinct in our notes on Kubernetes release failures and on ArgoCD self-heal traps, which is the friendly version of this same pattern.

The non-obvious lesson from this incident is that hostile or accidental reconcilers do not announce themselves. The kworker spoof was the cleverest piece; it would have survived a casual ps. The cattle-monitoring-system namespace looked legitimate to anyone who had ever run Rancher. The webhook had a name (rbac-policy-enforcer) that sounded like something a security team would install. In each case the move that surfaced it was boring: enumerate the category exhaustively, then ask which entries the team can account for. Anything they cannot account for is the answer.

If your post-migration monitoring keeps un-fixing itself

When fixes revert, the problem is not the fix

The hard part of incidents like this is not the Prometheus annotation key or the Grafana port. Those take 20 minutes once the cluster stops fighting you. The hard part is having the discipline to stop patching and inventory every actor that can write to your cluster, especially when leadership is asking for an ETA and your instinct is to keep typing. The hard part is also knowing what the categories of reconciler are. If you have never had to look for a mutating webhook that rewrites RBAC, or a host process pretending to be a kworker, the search takes hours. If you have seen it before, it takes 15 minutes.

We run these recovery engagements every week. We have seen the kworker spoof twice this year, the cattle-monitoring-system CronJob trick three times, and the RBAC-mutating webhook in two unrelated post-migration incidents. The playbook is portable; the patience to run it before patching is the part teams in the middle of an outage struggle with, and that is usually why they call us.

If your dashboards are blank after a migration and your fixes are not sticking, book an infrastructure review with our team and we will be on a bridge with you the same day. Bring node SSH access, kubectl with cluster-admin, and a list of every namespace you can name. We will handle the rest.

Related

Use these related pages to continue recovery