Benchmarking Shared vs Isolated Database Costs

Spreadsheet estimates of multi-tenant database cost are almost always wrong, because the dominant cost driver shifts as tenant count grows and only a controlled benchmark reveals where. This page gives you a reproducible method to measure the real total cost of ownership (TCO) of a shared database with row-level security against schema-per-tenant and database-per-tenant, and to locate the break-even tenant count for your workload. It sits inside the broader Cost vs Security Tradeoff Analysis for selecting an isolation model.

Problem Framing

Cost comparisons fail for three structural reasons. First, the cheapest model changes with scale: shared RLS wins at low tenant counts because one cluster amortizes fixed overhead, but its per-query policy evaluation and connection ceiling become the bottleneck as tenants multiply. Second, the headline number on a cloud bill omits the expensive parts — backup orchestration, snapshot storage, provisioning automation, and the engineering hours each model consumes per incident. Third, synthetic load with a uniform tenant distribution hides the two effects that actually drive shared-database cost: index bloat from cross-tenant scans and noisy-neighbor CPU contention.

The result is that teams pick an architecture from a back-of-envelope per-tenant figure, then discover the curve crosses far earlier or later than assumed. A benchmark that measures compute, storage, backup, and operational labor under production-shaped load is the only way to draw the curve. The shape below is what you are trying to reproduce and quantify with real numbers for your own query mix.

The benchmark exists to find the tenant count where the gold and white curves cross for your workload — that single point decides the architecture.

Step-by-Step Guide

1. Pin an identical baseline across both topologies

Provision the same instance class, storage type, and provisioned IOPS for the shared cluster and for the isolated cluster nodes, so the only variable is the isolation model. Record the fixed monthly cost of each topology before any load. Seed both with the same dataset shaped to production skew — an 80/20 Pareto distribution of rows per tenant, not a uniform split that would hide noisy-neighbor effects.

# Same class on both sides; capture the on-paper monthly rate up front.
aws rds create-db-instance \
  --db-instance-identifier bench-shared \
  --db-instance-class db.r6g.2xlarge \
  --engine postgres --allocated-storage 500 --iops 3000 --storage-type io1
aws pricing get-products --service-code AmazonRDS \
  --filters "Type=TERM_MATCH,Field=instanceType,Value=db.r6g.2xlarge" \
  --region us-east-1 --query 'PriceList[0]' --output text

2. Define one query mix and replay it identically

Lock a single workload definition — 70% point reads, 20% writes, 10% analytical scans — and drive it through pgbench against both topologies at matched concurrency. Replaying the same script removes workload variance from the cost delta.

-- bench_mix.sql, weighted by \set in the driver
\set tid random(1, 100000)
SELECT * FROM orders WHERE tenant_id = :tid AND created_at > now() - interval '30 days';
UPDATE orders SET status = 'shipped' WHERE id = :tid;
SELECT date_trunc('day', created_at), count(*) FROM orders
  WHERE tenant_id = :tid GROUP BY 1;

pgbench -h bench-shared -U app -f bench_mix.sql -c 200 -j 8 -T 600 -P 30 appdb

3. Measure shared-RLS compute overhead directly

The cost premium of the shared model is the extra CPU spent evaluating the row-security policy on every plan plus the buffer churn from cross-tenant scans. Capture both with pg_stat_statements, comparing row_security on versus off.

SELECT query, calls, total_exec_time,
       shared_blks_hit, shared_blks_read,
       round(total_exec_time / calls, 3) AS ms_per_call
FROM pg_stat_statements
WHERE query LIKE '%orders%'
ORDER BY total_exec_time DESC
LIMIT 5;

4. Measure isolated multiplication and connection cost

Isolation trades policy compute for infrastructure sprawl and connection fragmentation: every tenant pool reserves connections whether idle or not. Database-per-tenant also duplicates system catalogs and parallelizes backups. The connection math is the part teams underestimate, and it is covered in depth in database-per-tenant vs schema-per-tenant connection pool cost.

; pgbouncer.ini — front isolated tenants so reserved conns don't OOM the cluster
[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 20
reserve_pool_size = 5
idle_transaction_timeout = 10

5. Price the backup and operational tail

Pull actual snapshot storage and per-snapshot job counts; isolated topologies run one backup job per database, which inflates both storage and the labor of restore drills. Convert recurring engineering toil — migrations, exports, restore rehearsals — into dollars with an internal labor rate.

# Backup + labor differ per model; fold them into the curve, not a footnote.
def monthly_tco(tenants, fixed_infra, per_query_cpu_usd, backup_usd_per_db,
                ops_hours_per_100, labor_rate=180.0, isolated=False):
    compute = per_query_cpu_usd * tenants
    backup = backup_usd_per_db * (tenants if isolated else 1)
    labor = (tenants / 100) * ops_hours_per_100 * labor_rate
    return round(fixed_infra + compute + backup + labor, 2)

6. Sweep tenant counts and find the crossover

Run steps 2–5 at several tenant counts (for example 250, 1000, 4000) and store one row per model per point. The break-even is the count where shared TCO first exceeds isolated TCO.

points = [250, 1000, 4000]
for n in points:
    shared = monthly_tco(n, 1200, 0.40, 90, 1.5, isolated=False)
    iso    = monthly_tco(n, 1800, 0.05, 90, 0.4, isolated=True)
    print(n, shared, iso, "ISOLATED WINS" if iso < shared else "SHARED WINS")

Below is the kind of table the sweep produces. Numbers are illustrative — yours come from your own runs.

Tenant count	Shared TCO/mo	Isolated TCO/mo	Dominant driver
0–500	$1,300	$2,000	RLS policy compute
500–2,000	$4,800	$6,400	connection pool limits
2,000+	$13,000	$11,200	backup/restore toil

Verification

Confirm the benchmark itself is sound before trusting the curve. The shared run must show RLS actually engaged and an index seek, not a sequential scan; a missing tenant_id-leading index silently turns the comparison into noise.

SET row_security = on;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE tenant_id = 't_123'
  AND created_at > now() - interval '30 days';
-- PASS criteria:
--   plan node contains "Index Scan using idx_orders_tenant_created"
--   "Filter: ... (current_setting('app.tenant_id'))" appears (policy active)
--   shared_blks_read close to 0 on a warm cache (no thrash)

If the plan shows Seq Scan or shared_blks_read dominates shared_blks_hit, the shared numbers are inflated by a benchmarking defect, not by RLS, and the result is invalid.

Failure Modes & Gotchas

Uniform synthetic data. Symptom: shared TCO looks artificially flat and cheap at high tenant counts. Root cause: even row distribution suppresses index bloat and noisy-neighbor contention. Fix: replay anonymized production logs or seed an 80/20 Pareto distribution.
Counting only the compute line. Symptom: isolated looks permanently more expensive. Root cause: backup storage, snapshot jobs, and per-incident labor are excluded. Fix: add backup_usd_per_db and an ops-labor term to both sides of the model.
Session pooling on the isolated side. Symptom: connection-refused storms and OOM as tenant pools multiply. Root cause: session mode pins a server connection per client. Fix: set pool_mode = transaction and cap default_pool_size per tenant tier.
Cold cache skew between runs. Symptom: the first topology measured looks slower and costlier. Root cause: buffer cache not warmed equally. Fix: run a discard-results warm-up pass, then measure the steady-state window only.

FAQ

At what tenant count does an isolated database usually become cheaper than shared RLS? For typical OLTP SaaS workloads the curves cross somewhere between 1,500 and 3,000 tenants, where shared connection-pool saturation and backup toil overtake the linear isolated base — but the only reliable answer comes from sweeping your own query mix as in step 6.

Can I benchmark this without provisioning real isolated infrastructure? You can model the isolated side from one representative node and multiply, but connection fragmentation and parallel-backup behavior do not scale linearly, so validate at one mid-range tenant count with real multi-node provisioning before trusting the extrapolation.

Why measure backup cost separately instead of folding it into storage? Backup cost grows with the number of databases, not total bytes, so database-per-tenant pays a per-tenant snapshot-job and restore-drill penalty that a single storage figure hides entirely.