Role-Based Access Control Per Tenant

Per-tenant RBAC is the discipline of resolving and enforcing role-to-permission grants that are scoped to a single tenant, so that an administrator in one tenant can never act on another tenant's resources; it operates within the broader Auth Isolation & Cross-Tenant Access Control framework that governs how identity, sessions, and tokens stay separated across tenants.

The hard part is not defining roles — it is making sure that every permission check answers two questions at once: what can this role do and which tenant is it allowed to do it in. A role named admin means nothing on its own. It only becomes a grant when it is resolved against a specific tenant_id, and that resolution has to happen on every request, in every service, without a single code path that forgets the tenant dimension. Get the resolution model wrong and you do not get a bug — you get a cross-tenant breach.

This page walks the full implementation: the prerequisites, a step-by-step build of the resolution and enforcement layers, the query-scoping and connection mechanics that keep checks honest, the security layer that closes escalation paths, and the operational metrics that tell you whether your cache and policy engine are keeping up. Two decisions deserve their own deep dives — how you shape the role and permission schema itself, covered in designing tenant-scoped permission models, and how you prove what changed and when, covered in auditing RBAC changes across tenants.

Prerequisites

Before wiring per-tenant RBAC into a live system, confirm the following are in place. Each one is load-bearing; skipping any of them pushes enforcement to a layer that cannot see the tenant boundary.

Step-by-Step Implementation

The flow has five ordered stages. Each runs before the next; a failure at any stage is a hard stop, never a fall-through to a default.

1. Resolve the tenant from a trusted claim

Tenant resolution comes first because every later check is scoped by it. Prefer the signed token claim over any routing hint. If you also read a subdomain or X-Tenant-ID header for routing, treat them as untrusted and require them to match the token claim — a mismatch is an attack signal, not a recoverable state.

import { Request, Response, NextFunction } from 'express';
import { AsyncLocalStorage } from 'node:async_hooks';

export const tenantContext = new AsyncLocalStorage<{ tenantId: string; userId: string }>();

export function resolveTenant(req: Request, res: Response, next: NextFunction) {
  const claimTenant = req.auth?.tenantId;          // verified JWT claim
  const routedTenant = (req.headers['x-tenant-id'] as string) || req.subdomains[0];

  if (!claimTenant) {
    return res.status(401).json({ error: 'No tenant claim in token' });
  }
  if (routedTenant && routedTenant !== claimTenant) {
    // Someone is asking to act in a tenant their token does not authorize.
    return res.status(403).json({ error: 'Tenant routing/claim mismatch' });
  }

  tenantContext.run({ tenantId: claimTenant, userId: req.auth.userId }, () => next());
}

2. Load the user's role assignments for that tenant

Roles are looked up by the composite key. A user may hold different roles in different tenants, so the query must filter on both tenant_id and user_id. Never cache role assignments without the tenant in the cache key.

SELECT role
FROM   tenant_role_assignments
WHERE  tenant_id = $1
  AND  user_id   = $2;
-- Backing index:
-- CREATE INDEX idx_tra_lookup ON tenant_role_assignments (tenant_id, user_id);

3. Compile the role into a permission set

Roles are indirection; permissions are what you actually check. Compile each role into a flat, hashable permission set so that the runtime check is a constant-time membership test rather than a graph walk. A permission is an action:resource pair scoped implicitly by the tenant context already on the request.

type Permission = `${string}:${string}`; // e.g. "invoice:read"

const ROLE_PERMISSIONS: Record<string, Permission[]> = {
  admin:  ['invoice:read', 'invoice:write', 'member:invite', 'member:remove'],
  editor: ['invoice:read', 'invoice:write'],
  viewer: ['invoice:read'],
};

export function compilePermissions(roles: string[]): Set<Permission> {
  const set = new Set<Permission>();
  for (const role of roles) {
    for (const perm of ROLE_PERMISSIONS[role] ?? []) set.add(perm);
  }
  return set; // O(1) membership checks downstream
}

4. Evaluate the access decision

The evaluation step takes the compiled set, the action being attempted, and the resource type. The default branch must be deny — any unknown action, missing role, or parse error returns false. The example uses an in-process evaluator; for declarative policy across many services, route the same inputs through OPA.

def evaluate_access(permissions: set[str], action: str, resource: str) -> bool:
    """Constant-time check. Unknown inputs deny by construction."""
    return f"{resource}:{action}" in permissions

For policy-as-code, the equivalent Rego keeps the tenant explicit in the input document and defaults to deny:

# policy.rego
package rbac

default allow = false

allow {
    input.tenant_id == input.resource.tenant_id        # never cross the boundary
    perm := sprintf("%s:%s", [input.resource.kind, input.action])
    input.permissions[perm]
}

5. Enforce, then audit the decision

The guard wraps a route or RPC handler. Every decision — allow and deny alike — is written to the audit sink with the tenant, user, action, and outcome. Denied checks are the early-warning system for probing; do not drop them.

export function requirePermission(perm: Permission) {
  return (req: Request, res: Response, next: NextFunction) => {
    const { tenantId, userId } = tenantContext.getStore()!;
    const allowed = req.permissions.has(perm);
    audit.emit({ tenantId, userId, action: perm, allowed, at: Date.now() });
    if (!allowed) return res.status(403).json({ error: 'Forbidden' });
    next();
  };
}

The figure below shows how a single request threads these stages, and where the tenant boundary is enforced at each hop.

Choosing an Evaluation Model

Three approaches dominate per-tenant RBAC. The right one depends on how many services need to share policy and how dynamic your rules are.

Model Where it runs Tenant scoping Latency Best fit
In-process Set lookup Inside each service Implicit via request context Sub-microsecond Monoliths, single hot path, few rules
OPA (Rego) Sidecar or library Explicit in input document 1–3 ms (cached bundles) Many services sharing one policy
AWS Cedar Library / Verified Permissions Explicit in entity store 1–2 ms Hierarchical resources, formal analysis

For most teams the pragmatic path is in-process Set lookups behind a cache for the hot read path, with OPA or Cedar reserved for policies that must be authored once and enforced across many services. Mixing them is fine as long as the deny-by-default contract is identical in both.

There is a second axis that the table does not capture: how dynamic the grants are. Static role-to-permission maps — admin, editor, viewer — compile cleanly into Sets and rarely change, so an in-process lookup with a long cache TTL is ideal. Attribute-based rules — a user may approve an invoice only below their own spend limit, or only during business hours in the tenant's region — cannot be flattened into a Set ahead of time because the decision depends on request-time attributes. Those belong in OPA or Cedar, where the policy receives the full input document and can reason over it. The mistake is forcing attribute logic into the Set model with an explosion of synthetic permission strings; the maintenance cost grows quadratically and the audit trail becomes unreadable. Keep coarse role grants in the fast path and push genuinely conditional logic into the policy engine.

Whatever model you pick, the role and permission schema underneath it is the decision that ages worst if rushed — role hierarchies, permission granularity, and how to avoid a combinatorial matrix all need to be settled before the first grant is issued.

Dynamic Query Scoping & Connection Handling

A permission check that passes the application layer but lets a query read another tenant's rows is worthless. Enforcement has to reach the data layer, and the cleanest way is to inject the tenant filter where queries are built rather than trusting every caller to add a WHERE tenant_id = ... clause. This is the same principle that governs the broader tenant-aware data routing & query scoping layer.

A Prisma client extension reads the active tenant from context and refuses to run any query that lacks it — the absence of a tenant is a thrown error, never a silent full-table scan.

import { PrismaClient } from '@prisma/client';
import { tenantContext } from './tenant-context';

const base = new PrismaClient();

export const prisma = base.$extends({
  query: {
    $allModels: {
      async $allOperations({ args, query }) {
        const tenantId = tenantContext.getStore()?.tenantId;
        if (!tenantId) throw new Error('Refusing unscoped query: no tenant context');
        (args as any).where = { ...(args as any).where, tenantId };
        return query(args);
      },
    },
  },
});

For defense in depth, pair the application filter with PostgreSQL row-level security so the database rejects a cross-tenant read even if the application layer is bypassed. Set the tenant on the session at checkout:

ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON invoices
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

Connection handling matters here: with transaction-mode pooling (PgBouncer), SET app.tenant_id must use SET LOCAL inside the transaction so the setting does not leak to the next borrower of the pooled connection. A session-level SET on a pooled connection is a classic cross-tenant bug — the next request inherits the previous tenant's scope.

Security Enforcement & Access Control

Per-tenant RBAC fails open in subtle ways. The controls below close the most common escalation paths.

Layer Control What it stops
Token Tenant claim signed and verified Header-spoofed tenant switching
Routing Reject claim/route mismatch Acting in an unauthorized tenant
Role lookup Composite (tenant_id, user_id) key Inheriting another tenant's roles
Evaluation Default deny on any error Implicit grant from a parse failure
Data RLS + SET LOCAL per transaction Cross-tenant reads via pooled connections
Audit Log allow and deny Undetected probing and escalation

Two rules deserve emphasis. First, the token claim is authoritative — any routing hint that disagrees with it is rejected, never reconciled. Second, revocation must be immediate. When a role is removed, the cached permission set has to be invalidated before the next request can use it; a stale grant is an open door. Tie revocation into session handling so that a role change also forces re-evaluation of active sessions, as described in the session isolation & state management layer. Map external identity-provider groups to internal roles through SSO mapping & identity federation, and keep that mapping tenant-scoped so an IdP group never grants a role outside its own tenant.

The cache layer is where revocation usually goes wrong. Compiled permission sets live in Redis under a tenant-and-role key with a bounded TTL, and a grant or revoke publishes an invalidation message that every node consumes.

import json, redis

r = redis.Redis(host="localhost", port=6379, db=0)
TTL = 900  # 15 minutes; the ceiling on staleness, not the norm

def get_permissions(tenant_id: str, role: str) -> dict:
    key = f"rbac:{tenant_id}:{role}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    matrix = fetch_matrix_from_db(tenant_id, role)
    r.setex(key, TTL, json.dumps(matrix))
    return matrix

def revoke(tenant_id: str, role: str) -> None:
    key = f"rbac:{tenant_id}:{role}"
    r.delete(key)
    r.publish("rbac:invalidate", key)  # other nodes drop their local copy

Operational Overhead & Scaling Metrics

Per-tenant RBAC is cheap when cached and expensive when it stampedes. Track these metrics and act at the thresholds.

Metric Healthy threshold Mitigation when breached
Permission-check p99 latency < 2 ms Move evaluation in-process; precompile sets
Cache hit ratio > 95% Raise TTL or warm hot tenant/role pairs
Invalidation lag (publish to drop) < 100 ms Co-locate Redis; use a dedicated pub/sub channel
DB role-lookup QPS < cache backstop capacity Add jittered TTL to prevent synchronized expiry
Denied-check rate per tenant Baseline + alert on spike Treat spikes as probing; rate-limit the principal

The dominant cost in a microservice fleet is not the check itself but propagating tenant context across service hops; budget for it in gRPC or HTTP metadata and version that schema deliberately. Cache invalidation storms are the other scaling cliff: stagger TTLs with jitter so thousands of keys do not expire in the same second and stampede the database.

Two failure shapes are worth instrumenting explicitly because they masquerade as healthy systems. The first is the silent stale grant: invalidation lag creeps above its threshold, a revoked role keeps working for tens of seconds, and nothing in your dashboards flags it because every check still returns a clean allow. Measure invalidation lag directly — timestamp the publish, timestamp the local drop, and alarm on the gap — rather than inferring it from cache hit ratio. The second is the denied-check spike: a sudden rise in 403s for one tenant is rarely a bug in your code. It is almost always a principal probing for permissions it does not have, often after a partial credential compromise. Route denied-check counts per tenant and per principal into your alerting, and wire a spike to a rate limit on the offending principal so probing is throttled rather than merely logged.

Cost scaling is close to linear in the number of services that must perform a check, because each adds one context-propagation hop and one cache lookup. It is sub-linear in tenant count as long as the cache key includes the tenant and hot tenants dominate traffic — a handful of large tenants will hold most of the working set, so warming their role pairs at deploy time keeps the cold-start tail short without preloading the entire estate into memory.

Pitfalls & Anti-Patterns

Frequently Asked Questions

How do I stop a tenant admin from acting on another tenant's resources? Bind the tenant to a signed token claim, reject any request where a routing hint disagrees with the claim, and key every role lookup and cache entry on (tenant_id, ...). The boundary then travels with the request and no code path can resolve a role outside its own tenant.

Should I use OPA or an in-process check for permission evaluation? Use an in-process Set lookup on the hot path for lowest latency, and reach for OPA or Cedar when the same policy must be authored once and enforced across many services. The deny-by-default contract must be identical whichever you choose.

How fast must role revocation take effect? By the next request. Delete the compiled permission set from the cache and publish an invalidation so every node drops its local copy; tie this into session re-evaluation so an active session cannot keep using a removed role.

Does row-level security replace application-layer checks? No — it backs them up. RLS catches a cross-tenant read if the application filter is bypassed, but it does not express action-level permissions like invoice:write. Run both, and use SET LOCAL so the tenant setting never leaks across pooled connections.