Auditing RBAC Changes Across Tenants

Every grant, revoke, and role edit in a multi-tenant system must produce a tamper-evident record that is scoped to exactly one tenant and reconstructable on demand. This page sits under Role-Based Access Control Per Tenant, which defines how roles are assigned in the first place; here we focus on recording every mutation to those assignments so that "who granted whom which role, when, and why" has a single authoritative answer per tenant.

Problem Framing

RBAC mutations are the highest-leverage events in a SaaS application. A single ROLE_GRANTED row can hand a user the ability to read another customer's data, change billing, or export the entire database. Auditors treat RBAC changes as a primary control, and incident responders reach for the role-change log first. If that log is wrong, missing, or contaminated with another tenant's events, you cannot answer the only question that matters during a breach: did this privilege change actually happen, who caused it, and was it authorized.

Three failure shapes recur. Silent loss: the role mutation commits but the audit write fails (queue down, async worker crashed), so operational state and audit state diverge. Cross-tenant bleed: an audit query or compliance export omits a tenant_id predicate and returns rows belonging to other customers, which is itself a reportable incident. Tampering: an attacker with database access edits or deletes a row to hide an escalation, and nothing detects the gap. The defense against all three is the same architecture: capture the change inside the same transaction that mutates the role, store it append-only, and chain each event to its predecessor so deletions and edits become detectable.

The auditing problem is harder in a multi-tenant system than in a single-tenant one for two reasons. First, the blast radius of a wrong record is larger: a leaked audit query exposes one customer's administrative activity to another, and that is exactly the trust boundary the product sells. Second, the volume is uneven. A handful of large tenants will generate the bulk of role churn, so a naive single-table design suffers hot-partition contention and slow per-tenant queries during the very compliance export that a large customer demands. Partitioning by tenant_id and chaining per tenant solves both at once: each tenant gets an independent, prunable, independently verifiable chain. The cost is that the chain head must be read and advanced under a per-tenant lock, which is cheap when amortized over the relay but fatal if you skip it.

A second design choice is what to record in the diff. Recording only "role X granted" is insufficient for forensics; you cannot tell whether the grant was idempotent or genuinely new, nor what the user could do before and after. Capture the full before/after set of effective roles (or permissions, if your model is permission-granular) so that replay reconstructs exact state without consulting any other table. This makes the audit store self-contained, which matters when the operational user_roles table has itself been mutated many times since the event.

The flow below shows the path from a role mutation to a verifiable, tenant-partitioned record.

The role mutation and its audit record commit atomically; a relay then hashes and chains each event into a tenant-partitioned, append-only store.

The shape of each event depends on how you model permissions; if your roles carry composite or hierarchical scopes, read designing tenant-scoped permission models before fixing the diff format, because the audit diff should mirror the permission structure you actually grant.

Step-by-Step Guide

1. Define an append-only event table partitioned by tenant

Use a composite uniqueness constraint on (tenant_id, id) and partition or apply row-level security on tenant_id. The prev_hash column anchors the tamper-evidence chain.

CREATE TABLE rbac_audit_event (
  id          UUID        NOT NULL DEFAULT gen_random_uuid(),
  tenant_id   TEXT        NOT NULL,
  actor_id    TEXT        NOT NULL,
  action      TEXT        NOT NULL CHECK (action IN ('ROLE_GRANTED','ROLE_REVOKED','ROLE_MODIFIED')),
  target_user TEXT        NOT NULL,
  diff        JSONB       NOT NULL,
  prev_hash   TEXT,
  this_hash   TEXT        NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (tenant_id, id)
) PARTITION BY LIST (tenant_id);

CREATE INDEX rbac_audit_tenant_time ON rbac_audit_event (tenant_id, created_at);

2. Capture the mutation in the same transaction via an outbox

Write the role change and the audit row inside one transaction. If the transaction rolls back, neither persists; operational and audit state can never diverge.

// services/grantRole.ts
import { PrismaClient } from "@prisma/client";
const prisma = new PrismaClient();

export async function grantRole(tenantId: string, actorId: string, targetUser: string, role: string) {
  return prisma.$transaction(async (tx) => {
    const before = await tx.userRole.findMany({ where: { tenantId, userId: targetUser } });
    await tx.userRole.create({ data: { tenantId, userId: targetUser, role } });
    const after = await tx.userRole.findMany({ where: { tenantId, userId: targetUser } });

    await tx.auditOutbox.create({
      data: {
        tenantId,
        actorId,
        action: "ROLE_GRANTED",
        targetUser,
        diff: { before: before.map((r) => r.role), after: after.map((r) => r.role) },
      },
    });
  });
}

3. Relay the outbox and chain each event

A separate worker drains the outbox, computes the hash from the previous event in the same tenant, and writes the immutable record. This keeps the request path fast and the chain per-tenant.

# relay.py
import hashlib, json

def chain_hash(event: dict, prev_hash: str | None) -> str:
    body = json.dumps(event, sort_keys=True, separators=(",", ":")).encode()
    base = hashlib.sha256(body).hexdigest()
    return hashlib.sha256(f"{prev_hash or ''}{base}".encode()).hexdigest()

def relay(row, db):
    prev = db.fetchone(
        "SELECT this_hash FROM rbac_audit_event "
        "WHERE tenant_id = %s ORDER BY created_at DESC LIMIT 1",
        (row["tenant_id"],),
    )
    prev_hash = prev["this_hash"] if prev else None
    this_hash = chain_hash(row["payload"], prev_hash)
    db.execute(
        "INSERT INTO rbac_audit_event "
        "(tenant_id, actor_id, action, target_user, diff, prev_hash, this_hash) "
        "VALUES (%(tenant_id)s, %(actor_id)s, %(action)s, %(target_user)s, "
        "%(diff)s, %(prev_hash)s, %(this_hash)s)",
        {**row, "prev_hash": prev_hash, "this_hash": this_hash},
    )
    db.execute("DELETE FROM audit_outbox WHERE id = %s", (row["id"],))

4. Enforce tenant scoping at the database, not the query

Application-level WHERE tenant_id = ... clauses get forgotten. Add row-level security so a forgotten predicate cannot leak across tenants, even from a reporting tool.

ALTER TABLE rbac_audit_event ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON rbac_audit_event
  USING (tenant_id = current_setting('app.tenant_id', true));

5. Stream changes for alerting

Emit a structured record to your SIEM on each relay so bulk grants or off-hours escalations trigger alerts. The detailed retention and export side of this belongs to the broader tenant audit logging architecture; the role-change feed is one input into it.

{"ts":"2026-06-21T14:02:11Z","tenant_id":"acme","actor_id":"u_91","action":"ROLE_GRANTED","target_user":"u_44","role":"billing_admin"}

Verification

Two things must hold: every mutation produced exactly one event, and the chain is intact. Verify the chain by recomputing each this_hash from its predecessor and the stored payload, scoped to one tenant.

# verify_chain.py
import hashlib, json

def chain_hash(event, prev_hash):
    body = json.dumps(event, sort_keys=True, separators=(",", ":")).encode()
    base = hashlib.sha256(body).hexdigest()
    return hashlib.sha256(f"{prev_hash or ''}{base}".encode()).hexdigest()

def verify(events):  # events ordered by created_at, single tenant
    prev = None
    for e in events:
        expected = chain_hash(e["payload"], prev)
        assert e["this_hash"] == expected, f"chain broken at event {e['id']}"
        prev = e["this_hash"]
    return True

Run the chain verifier as a scheduled job, not only on demand. A drift between the last verified position and the current head that grows without the job completing is itself a signal: either a tenant is producing events faster than the verifier can keep up (raise concurrency, verify per tenant in parallel) or the verifier is crashing on a genuine break. Treat a verification failure as a security event and freeze further role mutations for the affected tenant until the break is explained, because a broken chain means you can no longer trust any later event in that tenant.

A SQL spot check confirms no event references a missing predecessor within its tenant:

SELECT e.tenant_id, e.id
FROM rbac_audit_event e
LEFT JOIN rbac_audit_event p
  ON p.this_hash = e.prev_hash AND p.tenant_id = e.tenant_id
WHERE e.prev_hash IS NOT NULL AND p.id IS NULL;

An empty result set means every chained event has a valid parent in the same tenant partition.

Failure Modes & Gotchas

Null tenant_id on async-relayed events. Symptom: events land with empty or undefined tenant. Root cause: the worker reads tenant from request-local state that no longer exists. Fix: carry tenant_id explicitly in the outbox row payload, never from ambient context.
Chain forks under concurrent grants. Symptom: two events share the same prev_hash. Root cause: the relay reads the latest hash without serializing per tenant. Fix: take a per-tenant advisory lock (pg_advisory_xact_lock(hashtext(tenant_id))) during relay.
Audit row missing after a successful grant. Symptom: operational state changed but no event exists. Root cause: audit written outside the mutation transaction. Fix: insert into the outbox inside the same $transaction as the role change.
Cross-tenant rows in a compliance export. Symptom: an export contains another customer's actors. Root cause: report ran without setting app.tenant_id, so RLS allowed everything. Fix: set the session variable per request and make it mandatory; never run exports as a superuser that bypasses RLS.

FAQ

Can this audit trail satisfy SOC2 Type II? Yes, when each event captures actor identity, tenant scope, timestamp, and the exact permission diff, and the records are immutable and retained for 12+ months. The hash chain provides the tamper-evidence that auditors expect for a change-management control.

Why an outbox instead of logging directly to a queue? A direct queue write can succeed or fail independently of the role mutation, which reintroduces divergence. The outbox couples the audit row to the mutation in one transaction, and the relay delivers it afterward with at-least-once semantics.

How do I reconstruct a user's permissions at a past point in time? Sort that tenant's events by created_at, apply each diff in order up to the target timestamp, and snapshot periodically so replay only covers deltas since the last snapshot.