Enforcing Plan Limits with Tenant Quotas

A quota is only real if the code refuses the request that exceeds it before the work is done, not after a nightly job notices. This guide sits within Subscription & Plan Enforcement and shows how to hold hard and soft limits per tenant with atomic counters, distributed rate limiting, and a coherent overage-or-block policy that the UI and API can both read.

Problem Framing

A tenant quota maps an entitlement — "Pro tier gets 100,000 API calls per month and 25 seats" — onto a counter the hot path checks on every consuming request. The decision matters because the two naive implementations both fail. Reading a count, deciding, then incrementing in three steps opens a check-then-act race: two concurrent requests both read 99,999, both pass, and the tenant lands at 100,001. Counting after the fact — summing usage rows on a schedule — lets a tenant blow through a hard limit for minutes before enforcement catches up, which is a revenue leak on a paid metric and a noisy-neighbor risk on a capacity metric.

Three things break in practice. First, non-atomic reservation. Anything that separates the read from the write under concurrency overcounts or undercounts. Second, conflating two limit shapes. A cumulative quota (monthly API calls, total storage) needs a counter with a billing-period reset; a rate limit (requests per second) needs a sliding or token-bucket window. Using one mechanism for both produces either a quota that never resets or a rate limiter that leaks across the month. Third, no policy split. Treating every limit as a hard block frustrates paying customers on metrics you would rather meter and bill as overage, while treating every limit as soft turns abuse vectors into unbounded cost.

The correct shape is reserve-before-use: atomically increment the counter and decide in the same operation, so the counter is the single source of truth and the decision can never disagree with it. The flow below shows that path.

Step-by-Step Guide

1. Model the limit as data, not branches

Resolve each tenant's limits from an entitlements record keyed by plan, with per-tenant overrides for custom contracts. Each metric declares its shape (cumulative or rate), its limit, and its policy (block or overage). Keep this out of request code so a sales-negotiated override is a row change, not a deploy.

type LimitPolicy = "block" | "overage";
type LimitShape = "cumulative" | "rate";

interface Limit {
  metric: string;        // "api_calls", "seats", "storage_bytes"
  shape: LimitShape;
  limit: number;         // hard ceiling for block; soft threshold for overage
  policy: LimitPolicy;
  windowSeconds?: number; // required for rate limits
}

// Resolved per request: plan defaults merged with per-tenant overrides.
function resolveLimit(plan: Plan, overrides: Limit[], metric: string): Limit {
  return overrides.find((o) => o.metric === metric)
    ?? plan.limits.find((l) => l.metric === metric)
    ?? { metric, shape: "cumulative", limit: 0, policy: "block" };
}

2. Reserve atomically against a Redis counter

The counter key carries the tenant and the metric so it is independently addressable: q:{tenant_id}:{metric}:{period}. A Lua script increments and compares in one server-side step, eliminating the check-then-act race. For an overage metric it never refuses; it only reports whether the threshold was crossed so billing can meter the excess.

-- KEYS[1]=counter  ARGV[1]=limit  ARGV[2]=cost  ARGV[3]=ttl_s  ARGV[4]=policy
local limit  = tonumber(ARGV[1])
local cost   = tonumber(ARGV[2])
local current = tonumber(redis.call('GET', KEYS[1]) or '0')
if ARGV[4] == 'block' and current + cost > limit then
  return {0, current}            -- refused: do NOT increment
end
local total = redis.call('INCRBY', KEYS[1], cost)
if redis.call('TTL', KEYS[1]) < 0 then
  redis.call('EXPIRE', KEYS[1], tonumber(ARGV[3]))
end
local over = (total > limit) and 1 or 0
return {1, total, over}          -- allowed; over=1 means metered overage

3. Wire reservation into the request middleware

Call the script before the work runs. A block verdict short-circuits with HTTP 429 and a Retry-After; an allowed-with-overage verdict proceeds but emits a usage event so the excess is billed. The same event stream that drives invoicing is described in Usage Metering Event Pipelines — enforcement and metering must agree, so emit one event from the path that already made the decision.

async function enforce(req, res, next) {
  const limit = resolveLimit(req.tenant.plan, req.tenant.overrides, "api_calls");
  const key = `q:${req.tenant.id}:api_calls:${currentPeriod()}`;
  const [ok, total, over] = await redis.eval(
    reserveScript, 1, key,
    String(limit.limit), "1", String(secondsUntilPeriodEnd()), limit.policy,
  );
  if (!ok) {
    res.set("Retry-After", String(secondsUntilPeriodEnd()));
    return res.status(429).json({ error: "quota_exceeded", metric: "api_calls", limit: limit.limit });
  }
  if (over) emitUsageEvent(req.tenant.id, "api_calls_overage", 1);
  next();
}

4. Use a token bucket for rate limits

Cumulative counters with a period reset are wrong for per-second protection. For rate limits, run a token-bucket script that refills by elapsed time. This bounds burst and steady-state independently and is what protects shared capacity from a single noisy tenant.

-- KEYS[1]=bucket  ARGV[1]=rate/s  ARGV[2]=burst  ARGV[3]=now_ms  ARGV[4]=cost
local rate, burst = tonumber(ARGV[1]), tonumber(ARGV[2])
local now, cost   = tonumber(ARGV[3]), tonumber(ARGV[4])
local state = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(state[1]) or burst
local last   = tonumber(state[2]) or now
tokens = math.min(burst, tokens + (now - last) / 1000 * rate)
if tokens < cost then return {0, tokens} end
tokens = tokens - cost
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', KEYS[1], math.ceil(burst / rate * 1000))
return {1, tokens}

5. Reconcile capacity limits, not just request counters

Some limits guard a real resource — open database connections, seats, active workers — and the counter must track the resource, not the request. Release on completion so a crashed request does not strand a reservation forever; pair every reserve with a TTL fallback. Connection slots in particular are a shared, tier-allocated resource; how to size them per plan is covered in sizing connection pools per tenant tier.

import redis, contextlib

r = redis.Redis()

@contextlib.contextmanager
def reserve_slot(tenant_id: str, metric: str, limit: int, ttl_s: int = 60):
    key = f"q:{tenant_id}:{metric}:live"
    count = r.incr(key)
    r.expire(key, ttl_s)            # self-heal if release never runs
    if count > limit:
        r.decr(key)
        raise RuntimeError(f"{metric} concurrency limit reached for {tenant_id}")
    try:
        yield
    finally:
        r.decr(key)                 # release on success or failure

6. Surface limit state to the UI and API

A quota the client cannot see produces surprise 429s and support tickets. Return standard rate-limit headers on every response and expose a usage endpoint the dashboard polls, so the product can warn at 80% and the integrator can back off before being blocked.

res.set({
  "X-RateLimit-Limit": String(limit.limit),
  "X-RateLimit-Remaining": String(Math.max(0, limit.limit - total)),
  "X-RateLimit-Reset": String(periodEndEpoch()),
});

// GET /v1/usage  -> dashboard + integrator read the same numbers as the enforcer
app.get("/v1/usage", async (req, res) => {
  const metrics = await Promise.all(req.tenant.plan.limits.map(async (l) => {
    const used = Number(await redis.get(`q:${req.tenant.id}:${l.metric}:${currentPeriod()}`)) || 0;
    return { metric: l.metric, used, limit: l.limit, policy: l.policy };
  }));
  res.json({ tenant: req.tenant.id, period: currentPeriod(), metrics });
});

Verification

Prove the atomic path holds under concurrency before shipping. Fire more parallel requests than the limit and assert that exactly the limit is allowed — no overcount, no undercount.

test("concurrent reservations never exceed a hard limit", async () => {
  const tenant = "acme", limit = 100;
  const attempts = Array.from({ length: 250 }, () =>
    redis.eval(reserveScript, 1, `q:${tenant}:api_calls:test`,
      String(limit), "1", "60", "block"));
  const results = await Promise.all(attempts);
  const allowed = results.filter(([ok]) => ok === 1).length;
  expect(allowed).toBe(limit);                       // exactly 100, never 101
  const stored = Number(await redis.get(`q:${tenant}:api_calls:test`));
  expect(stored).toBe(limit);                        // counter matches verdicts
});

A refused request should also be observable on the wire as a 429 carrying the limit metadata:

curl -i -H "Authorization: Bearer $TOKEN" https://api.example.com/v1/widgets
# HTTP/1.1 429 Too Many Requests
# Retry-After: 1411200
# X-RateLimit-Limit: 100000
# X-RateLimit-Remaining: 0
# {"error":"quota_exceeded","metric":"api_calls","limit":100000}

Failure Modes & Gotchas

FAQ

Should I enforce quotas at the API gateway or in the application? Both, at different granularities. The gateway should run coarse per-tenant rate limiting to shed abusive traffic cheaply, but cumulative billing quotas and capacity reservations belong in the application where the resolved entitlement, the cost of the operation, and the metered event all live in one place.

How do I choose between block and overage for a given metric? Block metrics that protect shared capacity or represent abuse vectors — requests per second, concurrent connections, free-tier ceilings — because exceeding them harms other tenants or your costs. Use overage for billable consumption you would rather sell than refuse, where crossing the threshold meters extra revenue instead of returning an error.

Does the atomic counter become a bottleneck at high request rates? A single Redis key is one hot shard. For very high-rate tenants, shard the counter into N sub-keys per tenant and sum them on read, or pre-allocate batches of reservations to each app instance and reconcile periodically — trading a small accuracy window for throughput.