Using Redis for Tenant Session Isolation
Redis is the default session store for high-throughput SaaS, but a single shared instance will leak one tenant's sessions into another unless you enforce isolation at both the key-naming and the protocol layer. This guide sits within Session Isolation & State Management and shows exactly how to scope keys, lock them down with Redis 6+ ACLs, and run an atomic session lifecycle that survives cluster resharding.
Problem Framing
A session store holds the authenticated identity for every request that follows login. If a lookup for tenant A can ever return tenant B's session, you have a cross-tenant authentication bypass — the highest-severity failure in a multi-tenant system. Three things break in practice.
First, naive key naming. A flat key like sess:{session_id} carries no tenant boundary, so any code path that guesses or replays a session ID reads it regardless of origin. Second, trusting the application layer alone. Prefixing keys with sess:{tenant_id}:{session_id} is necessary but not sufficient — a compromised service account or a buggy SDK can still issue GET sess:other-tenant:*. Third, non-atomic lifecycle operations. Reading a session, checking its TTL, and extending it in three separate round-trips opens a TOCTOU window where a session revoked mid-flight is silently renewed.
The flow below shows the request path that keeps these boundaries intact: the gateway resolves the tenant before any Redis connection is acquired, and the connection it acquires is bound to an ACL user that physically cannot address another tenant's namespace.
Step-by-Step Guide
1. Define a deterministic, collision-resistant key schema
Use sess:{tenant_id}:{session_id}. The tenant ID is the first variable segment so ACL patterns can target it directly, and the structure gives O(1) lookups that survive cluster resharding. Do not use logical databases (SELECT 0, SELECT 1) for isolation — they are unsupported in Redis Cluster, cannot be scoped by ACL key patterns, and break backup tooling.
export function sessionKey(tenantId: string, sessionId: string): string {
if (!/^[a-z0-9-]{1,64}$/.test(tenantId)) {
throw new Error(`invalid tenant_id: ${tenantId}`);
}
return `sess:${tenantId}:${sessionId}`;
}
2. Create sessions atomically with NX and PX
Idempotent creation uses SET key value PX <ttl> NX. NX rejects a write if the key already exists, and PX sets expiration at creation time so no background cleanup job is needed.
const key = sessionKey(tenantId, sessionId);
const payload = JSON.stringify({ userId, roles, iat: Date.now() });
const ttlMs = 3_600_000;
// NX: only create if absent. PX: TTL in milliseconds set atomically.
const result = await redis.set(key, payload, "PX", ttlMs, "NX");
if (result !== "OK") {
throw new Error("session id collision or tenant mismatch");
}
3. Enforce per-tenant ACLs at the protocol layer
Application prefixing is bypassable; Redis 6+ ACLs are not. Define one user per tenant, scope it to that tenant's key pattern, and allow only the commands sessions need. Block KEYS, FLUSHDB, and the @dangerous category.
# /etc/redis/users.acl (loaded via: aclfile /etc/redis/users.acl)
user tenant_acme on >s3cret-acme ~sess:acme:* +get +set +expire +ttl +eval -@dangerous -keys
user tenant_globex on >s3cret-globex ~sess:globex:* +get +set +expire +ttl +eval -@dangerous -keys
user admin on >admin-pass ~* +@all
4. Bind the right ACL user at connection checkout
The pool must authenticate as the resolved tenant's user before returning a connection. The gateway supplies tenant_id; the pool maps it to credentials. A connection authenticated as tenant_acme cannot read sess:globex:* even if the application asks it to.
async function acquire(tenantId: string): Promise<Redis> {
const conn = await pool.checkout();
const creds = aclCredentials.get(tenantId);
if (!creds) throw new Error(`no ACL user for tenant ${tenantId}`);
await conn.auth(creds.username, creds.password); // AUTH <user> <pass>
return conn;
}
5. Validate and refresh in a single atomic round-trip
Reading, checking, and extending a session in separate commands creates a TOCTOU gap. A Lua script runs the whole operation atomically inside Redis, so a session revoked between the read and the renew can never be resurrected.
-- KEYS[1] = session key, ARGV[1] = sliding-window TTL floor (seconds)
local data = redis.call('GET', KEYS[1])
if not data then return {0, false} end
local ttl = redis.call('TTL', KEYS[1])
local floor = tonumber(ARGV[1])
if ttl > 0 and ttl < floor then
redis.call('EXPIRE', KEYS[1], floor)
end
return {1, data}
6. Invalidate a tenant's sessions with SCAN, never KEYS
Bulk revocation — for example when a tenant rotates credentials or a role changes — must iterate with SCAN. KEYS * blocks the single-threaded event loop and spikes latency for every tenant on the node. Coordinating bulk invalidation with role changes is covered in detail in invalidating tenant sessions on role change.
async function invalidateTenant(redis: Redis, tenantId: string): Promise<number> {
let cursor = "0";
let removed = 0;
do {
const [next, keys] = await redis.scan(
cursor, "MATCH", `sess:${tenantId}:*`, "COUNT", 200,
);
cursor = next;
if (keys.length) removed += await redis.del(...keys);
} while (cursor !== "0");
return removed;
}
Verification
Prove the boundary holds before shipping. Authenticate as one tenant's ACL user and confirm that reading another tenant's namespace is refused at the protocol level — a successful isolation setup returns a NOPERM error, not a value.
# Should succeed: tenant_acme reading its own namespace
redis-cli --user tenant_acme --pass s3cret-acme GET sess:acme:abc123
# (nil) <- key absent, but command was permitted
# Should be REFUSED: tenant_acme reaching into globex
redis-cli --user tenant_acme --pass s3cret-acme GET sess:globex:abc123
# (error) NOPERM this user has no permissions to access one of the keys
# Should be REFUSED: KEYS is blocked entirely
redis-cli --user tenant_acme --pass s3cret-acme KEYS '*'
# (error) NOPERM this user has no permissions to run the 'keys' command
In application tests, assert the same boundary and the atomic-renew contract:
test("acme connection cannot read globex sessions", async () => {
const conn = await acquire("acme");
await expect(conn.get("sess:globex:abc123")).rejects.toThrow(/NOPERM/);
});
test("renew is rejected after revocation", async () => {
await invalidateTenant(redis, "acme");
const [ok] = await redis.eval(renewScript, 1, "sess:acme:abc123", "900");
expect(ok).toBe(0); // session gone; not resurrected
});
Failure Modes & Gotchas
- Cross-tenant reads succeed. Symptom: tenant A retrieves tenant B's session. Root cause: ACL user uses
~*instead of~sess:{tid}:*, or the pool never re-authenticates per tenant. Fix: scope every ACL pattern to one tenant and callAUTHat checkout. - Sessions expire early or never. Symptom: random forced logouts or sessions that outlive logout. Root cause: a
SETpath missing thePXflag, orEXPIREmath drift across unsynchronized nodes. Fix: set TTL at creation and run NTP on every node. - Eviction storm of 401s. Symptom: a sudden burst of unauthorized responses. Root cause:
allkeys-lruevicts live sessions under memory pressure. Fix: setmaxmemory-policy volatile-lruso only keys with an explicit TTL are evicted. - Validation failures during scaling. Symptom:
MOVED/ASKerrors and dropped sessions while resharding. Root cause: a non-cluster-aware client. Fix: use a Redis Cluster client with retry/backoff and fall back to DB-backed validation during slot migration.
FAQ
Can I use Redis logical databases instead of key prefixes for tenants?
No. Logical databases are unsupported in Redis Cluster, cannot be restricted by ACL key patterns, and complicate backup and migration tooling. Use the sess:{tenant_id}:{session_id} prefix scoped by a per-tenant ACL pattern instead.
What is the safest eviction policy for a multi-tenant session store?
volatile-lru or volatile-ttl, because they only evict keys that carry an explicit TTL. This protects non-expiring keys such as tenant config and rate-limit counters while still shedding stale sessions under pressure.
How do I debug session routing without hurting production?
Use CLIENT TRACKING, INFO COMMANDSTATS, SLOWLOG GET, and structured logs that record the resolved key prefix and latency. Avoid MONITOR in production — it streams every command from every client and imposes severe CPU and latency overhead.