Insights

Why MinIO uploads return 200 and never land: a deny-wins IAM trap

The dashboards were green. The api-gateway logged 12,400 successful media POSTs over six hours, the storage service SDK reported 200 on every PutObject, and the fanout queue happily processed every notification. The MinIO bucket had gained zero new objects in the same window. Users were seeing broken image tiles in their feeds and the on-call team had spent three hours chasing the fanout service because that was the only place the symptom was visible. The actual problem was an explicit Deny on s3:PutObject sitting inside a bucket policy that had been added during a security hardening sprint two days earlier, and MinIO was doing exactly what S3 IAM semantics say it should do: deny wins, even when the user policy says Allow.

Object storage recovery | 11 min read
Problem signal
  • Upload endpoints return HTTP 200 but the object never appears in the bucket
  • Bucket notification webhooks fire and downstream consumers process phantom events
  • Grafana shows upload throughput as healthy because SDK success metrics dominate the panel
  • Users report broken image links while every service-level dashboard is green
  • A recent IAM or bucket policy change correlates in time with the start of phantom uploads
12,400 successful uploads, zero new objects

The discrepancy that should have been the first alert

We came in on the third hour of the incident. The team had been chasing the fanout consumer because user reports were all of the form 'my avatar is broken' and the only service touching media after upload was fanout. Their working theory was that fanout was racing the CDN, or that the notification payload was missing a key, or that signed URLs were expiring early. They had three engineers staring at fanout-service logs and finding nothing wrong, because there was nothing wrong with fanout-service.

The question we asked, which is the question we always ask first when an upload pipeline misbehaves: how many objects has the bucket actually gained in the last hour? Not how many uploads the API recorded. Not how many notifications fanout received. How many real objects exist now that did not exist sixty minutes ago. We ran the listing against the MinIO admin API and the answer was zero. The bucket had not gained a single object since 02:14 that morning, which lined up almost exactly with the merge time of a security hardening PR the platform team had landed two days prior.

# count objects added in the last hour
mc find local/bleater-media --newer-than 1h | wc -l
# 0

# meanwhile the storage-service success counter
curl -s http://prometheus/api/v1/query \
  --data-urlencode 'query=sum(increase(storage_service_put_object_success_total[1h]))'
# {"status":"success","data":{"result":[{"value":[..., "2074"]}]}}

Two views of the same hour. The SDK was confident. The bucket was not.

Once we had that gap on a shared screen the room changed. The fanout investigation got paused. The new question was: why is the SDK reporting success for writes that never persisted?

What the SDK thought, and what the server actually did

Where the 200 came from when the object never landed

This is the part of the story that is worth understanding even if you never touch MinIO. The storage service was using a streaming PutObject path. The client opens a connection, the server accepts headers and begins reading the body, and the bucket notification configuration is wired to fire on the API receipt of the PutObject call. In a healthy run, the server then writes the object, the response is 200, and the notification correctly reflects a real write. In our broken run, the server accepted the headers, fired the notification, evaluated the IAM policies, hit the explicit Deny, and closed the stream. The client SDK saw the connection close after headers were ack'd and treated it as success because the response framing looked clean enough at the transport layer. The notification had already gone out. The audit log recorded the deny. Nobody was reading the audit log.

Enabling the MinIO audit target was the diagnostic turn. Two commands and the lie unwound itself.

mc admin config set local audit_webhook:1 \
  endpoint="http://collector:8080/minio-audit" enable=on
mc admin service restart local

# tail the collector for a few seconds
# {"api":{"name":"PutObject","bucket":"bleater-media",
#        "object":"avatars/u-83421.jpg","status":"AccessDenied",
#        "statusCode":403},
#  "requestClaims":{"accessKey":"storage-service"},
#  "error":{"message":"Access Denied.",
#           "source":["cmd/auth-handler.go:checkRequestAuthTypeCredential"]}}

Audit log showed 403 AccessDenied on every PutObject from the storage-service identity. The client never saw it.

The storage-service identity had a user policy that explicitly granted s3:PutObject on arn:aws:s3:::bleater-media/*. We confirmed this in two seconds. Which meant the deny had to be coming from somewhere else.

Where the explicit Deny was hiding

The bucket policy nobody had read since the hardening PR

MinIO, like S3, evaluates IAM in two layers. The user (or service account) policy attached to the identity is one layer. The bucket policy attached to the resource is the other. An explicit Deny in either layer overrides any Allow in either layer. The hardening PR had added a bucket policy intended to lock down a different identity, an analytics reader that had been overprovisioned, and the author had used a wildcard Principal with a NotPrincipal exception that was wrong. The effective rule said: deny s3:PutObject on this bucket for everyone who is not the analytics-reader identity. Which of course included the storage service.

curl -s -u $ADMIN:$SECRET \
  http://minio:9000/minio/admin/v3/get-bucket-policy?bucket=bleater-media \
  | jq .

# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "RestrictWritesToAnalyticsReader",
#       "Effect": "Deny",
#       "NotPrincipal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
#       "Action": ["s3:PutObject"],
#       "Resource": ["arn:aws:s3:::bleater-media/*"]
#     }
#   ]
# }

The bucket policy that swallowed every write. NotPrincipal with Deny is a footgun in any S3-compatible IAM.

We have seen NotPrincipal misused in three separate engagements this year. It reads as if it means 'apply this rule to everyone except this principal' the same way a NotAction would, but the semantics interact badly with cross-account and service-account identities. If you are writing a Deny that you want scoped to a specific identity, write the Deny with Principal naming the identity you mean to block. Do not invert it. The blast radius of a wrong inversion is the entire bucket.

Before we touched anything we wanted to rule out the obvious adjacent causes, because removing a security-hardening policy at 06:00 without confirmation is the kind of fix that becomes its own incident. We checked credential expiry on the storage-service service account (valid for another 47 days), checked network policy for any new egress restrictions from the storage-service namespace (none), and confirmed bucket versioning was off so we were not chasing delete markers. The audit log had already told us the answer; we just wanted the rollback to be unambiguous when we wrote it up.

Removing the Deny without re-opening the bucket

The four-minute patch and the queue we had to reconcile

Two questions before patching. First, did we want to fix the bucket policy in place, or revert the hardening PR entirely? We chose patch in place. The hardening PR had also tightened three other identities correctly, and reverting would have undone work that was real. Second, did we want to leave the analytics-reader restriction in some form? Yes, but written correctly. We rewrote the statement as an explicit Deny on the analytics-reader principal for write actions, which is what the author had intended.

cat > /tmp/bleater-media-policy.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BlockAnalyticsReaderWrites",
      "Effect": "Deny",
      "Principal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::bleater-media/*"]
    }
  ]
}
EOF

curl -s -u $ADMIN:$SECRET \
  -X PUT \
  --data-binary @/tmp/bleater-media-policy.json \
  "http://minio:9000/minio/admin/v3/set-bucket-policy?bucket=bleater-media"

# validate with a real write from the storage-service identity
curl -s -X PUT -T /tmp/canary.bin \
  -H "Authorization: ...storage-service-sigv4..." \
  http://minio:9000/bleater-media/canary/$(date +%s).bin

mc ls local/bleater-media/canary/ | tail -1
# [2024-...] 4.0KiB STANDARD 1717420831.bin

Replace the inverted NotPrincipal with an explicit Principal Deny, then prove with a canary that the storage-service identity can write.

The canary landed. Real uploads from the application resumed within the next minute as new requests came in. That fixed the forward path. It did not fix the past six hours.

The phantom notification problem was harder to bound. The fanout service had processed roughly 12,400 notification events for objects that did not exist, which meant 12,400 user timelines contained references to media that would 404 forever. We pulled the notification log from the RabbitMQ stream and diffed against the actual object listing in the bucket. The count of phantom references came in at 12,387. We pushed a one-shot reconciliation job that re-emitted upload prompts to the affected users for any media uploaded in that window, because we had no way to recover the original bytes; the storage service had streamed them to a connection that was closed before persistence.

sequenceDiagram
  participant C as Client
  participant SS as storage-service
  participant M as MinIO
  participant N as RabbitMQ
  participant F as fanout-service
  C->>SS: POST /media (image bytes)
  SS->>M: PutObject (stream)
  M->>N: bucket notification (API receipt)
  N->>F: notify object created
  M->>M: evaluate IAM, hit Deny
  M-->>SS: connection closed (SDK reads as 200)
  SS-->>C: 200 OK
  F->>M: GET object for processing
  M-->>F: 404 (object never persisted)
  Note over F: phantom notification, broken link in feed

The notification fires before the deny evaluation completes. Every layer below MinIO sees success.

The synthetic that would have caught this in 90 seconds

What we changed so the next deny-wins conflict is not silent

The deeper lesson here is not about MinIO. It is that SDK success and server persistence are different facts, and most observability stacks conflate them. Every metric on the storage service dashboard came from the SDK return code. Every metric on the fanout dashboard came from notification receipt. Nothing in the stack was sourced from the only ground truth that mattered, which was the count of objects actually present in the bucket. The hardening PR could have done much worse than this and we would still have been blind.

We made three changes after this incident. First, a synthetic that writes a canary object every 60 seconds and then lists the bucket to confirm the canary is there. The metric is the gap between writes and confirmed reads, and it alerts at gap greater than two intervals. This is the kind of probe we now build into every object-storage path we touch. Second, the MinIO audit webhook now ships to the log aggregation pipeline with a Loki alert rule on any sustained rate of statusCode 403 for PutObject, scoped per identity. Third, we wrote a pre-merge check for bucket policy changes that flags any statement using NotPrincipal with Effect Deny and requires an explicit reviewer sign-off.

# Loki alert: deny-wins on PutObject for any service identity
- alert: MinioPutObjectDenied
  expr: |
    sum by (accessKey) (
      rate({job="minio-audit"}
        | json
        | api_name = "PutObject"
        | api_statusCode = "403"
        [5m])
    ) > 0
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "MinIO denying PutObject for {{ $labels.accessKey }}"
    runbook: "Check bucket policy and user policy for explicit Deny statements."

The alert that would have paged the on-call within five minutes of the hardening PR rolling out.

If your bucket notifications drive downstream business logic, you have the same shape of risk we did. The notification path and the persistence path are not the same path, and the IAM evaluation sits between them. Assume nothing about server persistence based on SDK return codes. Read the audit log.

If your object store is quietly lying to your monitors

When a hardening PR silently revokes write access in production

This class of incident is hard for a specific reason: every monitoring surface a normal team has built reports healthy, because every normal monitoring surface reads from the layer above the failure. The teams we work with that have hit this pattern were not careless. They had dashboards, they had alerts, they had error budgets. None of those instruments were positioned to see a server-side deny that the SDK swallowed. The fix is a small synthetic and an audit log alert, and they take an afternoon to build. Getting to the point of knowing you need them usually takes one bad incident.

We run object-storage and IAM recovery engagements often enough that this exact shape, a hardening PR introducing a deny-wins conflict against a service account, has come up three times this year on three different stacks (MinIO, Ceph RGW, and AWS S3 with a SCP). The mechanics are the same in all three. If your team is staring at green dashboards and broken user reports, the gap between SDK success and ground-truth persistence is the first place to look. If you want a second set of eyes on a hardening rollout before it lands, or you are inside one of these incidents right now, book an infrastructure review with our team and we will be on a bridge with you the same day. We also document the audit-log and synthetic patterns in more depth on the infrastructure audit readiness page if you want to read ahead.

Related

Use these related pages to continue recovery