Benchmarks
Per-environment numbers backing the Overview. All measurements use Python 3.11 with the GIL (the production default). For the free-threaded Python 3.14t comparison, see Free-Threaded Python.
Where the gap lives
Mean latency advantage of aerospike-py vs official (Python 3.11 + GIL)
A) Pure DB client ████████████████████ −80% (108 → 22 ms) 🔥
B) uvicorn ASGI █████ −21% (290 → 228 ms)
C) uvicorn + DLRM ███████████ −42% p95 (324→189 ms) 🔥
p95 advantage holds even when mean compresses (C): the GIL serializes
official-client result conversion at the tail.
- A — Pure DB client
- B — uvicorn ASGI only
- C — uvicorn + DLRM (real serving)
Environment A — Pure DB Client
What this isolates. No FastAPI, no model inference, no HTTP loadgen. A Python loop drives batch_read directly — the surrounding stack is as thin as possible, so aerospike-py's advantage is largest here.
Setup
| Item | Value |
|---|---|
| Source | benchmark/results/20260416_134243/report.md |
| Clients | official (sync C), official-async (sync wrapped via executor), py-async (aerospike-py) |
| Sets / batch sizes | 9 / 50, 200 |
| Concurrency / iterations | 10 / 30 |
Aggregate result (avg over 9 sets × 2 batch sizes)
| Client | avg mean (ms) | avg p99 (ms) | avg TPS |
|---|---|---|---|
| official | 107.56 | 195.34 | 138.2 |
| official-async | 110.64 | 211.33 | 125.7 |
| py-async (aerospike-py) | 22.45 | 120.67 | 373.7 |
| py-async vs official | 4.8× faster latency | 1.6× | 2.7× higher TPS |
| py-async vs official-async | 4.9× faster | 1.8× | 3.0× |
Per-set speedup distribution (aerospike-py vs official)
Batch 50: mean speedup across 9 sets 4.4× ──────── 7.8× (median ~5.8×)
Batch 200: mean speedup across 9 sets 3.1× ──────── 6.6× (median ~4.5×)
Outlier: set_8 (0% found rate) → fast not-found path, not real read latency.
set_8 is the outlier0% found rate — the official client returns errors faster than success paths, so this row reflects "fast not-found" rather than real read latency. All other sets show 4–8× mean speedup.
official-async (the sync C client wrapped with loop.run_in_executor) is slightly slower than the bare sync client across every set — each request pays a thread-pool hop. aerospike-py's gap over official-async is therefore consistently larger than its gap over official.
Full per-set table (18 rows)
| set | batch | official mean (ms) | official p99 | aerospike-py mean | aerospike-py p99 | mean speedup |
|---|---|---|---|---|---|---|
| set_1 | 50 | 110.23 | 200.14 | 19.74 | 105.46 | 5.6× |
| set_1 | 200 | 127.06 | 206.19 | 30.53 | 194.05 | 4.2× |
| set_2 | 50 | 121.76 | 210.24 | 15.57 | 34.68 | 7.8× |
| set_2 | 200 | 110.86 | 194.82 | 24.98 | 124.86 | 4.4× |
| set_3 | 50 | 108.00 | 184.32 | 18.53 | 103.22 | 5.8× |
| set_3 | 200 | 128.85 | 220.15 | 23.35 | 110.90 | 5.5× |
| set_4 | 50 | 109.15 | 206.11 | 18.18 | 116.92 | 6.0× |
| set_4 | 200 | 118.69 | 195.77 | 23.12 | 109.32 | 5.1× |
| set_5 | 50 | 113.21 | 195.27 | 25.57 | 301.28 | 4.4× |
| set_5 | 200 | 122.26 | 197.89 | 26.85 | 148.11 | 4.6× |
| set_6 | 50 | 115.30 | 210.74 | 18.87 | 103.98 | 6.1× |
| set_6 | 200 | 123.93 | 261.41 | 30.47 | 122.36 | 4.1× |
| set_7 | 50 | 115.34 | 190.68 | 17.83 | 44.86 | 6.5× |
| set_7 | 200 | 126.81 | 215.87 | 41.17 | 133.61 | 3.1× |
| set_8 | 50 | 13.92 | 96.27 | 15.20 | 97.95 | 0.9× ⚠ |
| set_8 | 200 | 19.95 | 102.12 | 16.27 | 116.03 | 1.2× ⚠ |
| set_9 | 50 | 111.80 | 217.14 | 16.92 | 95.59 | 6.6× |
| set_9 | 200 | 139.05 | 210.95 | 20.95 | 108.81 | 6.6× |
Environment B — uvicorn ASGI Only
What this isolates. Add FastAPI and uvicorn around batch_read but no model inference. The "REST API in front of a key-value lookup" shape.
Setup
| Item | Value |
|---|---|
| Source | benchmark/results/asgi_20260416_134730/asgi-report.md |
| Concurrency / iterations | 5 / 50 |
Pipeline breakdown (latency in ms)
| Client | total mean | p50 | p90 | p95 | p99 | aerospike step | inference | TPS |
|---|---|---|---|---|---|---|---|---|
| official | 289.56 | 289.36 | 429.74 | 461.15 | 473.60 | 280.00 | 1.55 | 16.6 |
| aerospike-py | 228.49 | 189.56 | 457.42 | 468.03 | 497.09 | 221.12 | 1.46 | 19.4 |
- Total mean: −21% (290 → 228 ms)
- Aerospike step: −21% (280 → 221 ms) — most wall time is the DB call, so the client gain transfers to E2E
- TPS: +17% (16.6 → 19.4 req/s)
- p99 is roughly equivalent (474 vs 497 ms) — at concurrency=5, the Tokio model isn't yet stressing the GIL hard enough to widen the tail
Higher concurrency exposes backpressure
A second run at concurrency=10, 200 iter (benchmark/results/asgi_20260416_135234/asgi-report.md):
| Client | total mean | p95 | TPS | errors |
|---|---|---|---|---|
| official | 465.81 | 885.64 | 20.9 | 0 |
| aerospike-py | 580.70 | 986.93 | 13.9 | 56 |
This run saturated the test harness (56 errors from request rejections / timeouts under load that the official client absorbed differently). It's not apples-to-apples — it's evidence that without proper backpressure tuning, native async clients can issue more in-flight work than the server pool can absorb. See Bottleneck Analysis for the gather→single recipe that resolves this.
Environment C — uvicorn + DLRM (Real Serving)
What this isolates. Full production-style pipeline:
HTTP request → key extraction → batch_read(9 sets × 200 keys)
→ feature build → DLRM inference (PyTorch CPU) → response
Closest measurement to a real recsys serving pod.
Setup
| Item | Value |
|---|---|
| Source | benchmark/results/k6-runtime-client-comparison.md (config "3.11 + GIL, stage OFF") |
| Image | aerospike-benchmark:latest (Python 3.11.14) |
| Loadgen | k6 10 VUs × 60s per scenario, 4 scenarios |
| Pods | 2 replicas, 4 CPU / 4 GiB request, 8 CPU / 8 GiB limit |
Both endpoints (/predict/official/sample and /predict/py-async/sample) live in the same pod, called alternately by k6 — server state and network are identical.
k6 client-side latency
single mode (10 VUs × 60s)
p95 aerospike-py ███████████████ 189 ms
official █████████████████████████ 324 ms −42% 🔥
p90 aerospike-py █████████████ 173 ms
official ██████████████████████ 293 ms −41% 🔥
avg aerospike-py █████████ 118 ms
official ████████████ 146 ms −19%
| Mode | aerospike-py p95 | official p95 | aerospike-py advantage |
|---|---|---|---|
| single | 189 | 324 | −42% 🔥 |
| gather (9× fan-out) | 234 | 266 | −12% |
| merge_gather (aero-py only) | 202 | — | — |
| stress (0→20→50 VUs ramp) | 592 | — | — |
The gather gap shrinks because both clients hit GIL serialization on the result conversion step — see Bottleneck Analysis.
Server-side Prometheus (same run, single mode)
| Metric | aerospike-py | official |
|---|---|---|
predict_duration_seconds p95 (FastAPI E2E) | 202 ms | 274 ms |
aerospike_batch_read_all_duration_seconds p95 | 137 ms | 252 ms |
Server- and client-side numbers move in the same direction — the ~13 ms gap between them is network RTT + gateway + connection setup.
Pattern across environments
layers added → ratio compresses, tail still wins
───────────── ───────────────── ───────────────
A) Pure DB (no HTTP/ML) mean 4.8× p99 1.6×
B) uvicorn ASGI mean 1.27× ≈ noise at C=5
C) uvicorn + DLRM mean 1.24× p95 −42% 🔥
As layers are added around the DB call, the ratio compresses (4.8× → 1.24× mean) but upper-percentile advantage holds — aerospike-py keeps the GIL released during I/O, while the official client serializes GIL acquisition through run_in_executor and spikes at the tail.
What's next:
- Free-Threaded Python — what happens when GIL is removed entirely
- Bottleneck Analysis — the
gather→singlerecipe (another −33%)