Skip to main content
Version: 0.10.6

Benchmarks

This page contains the full per-environment numbers backing the Overview. All measurements use Python 3.11 with the GIL (the production default). For the free-threaded Python 3.14t comparison, see Free-Threaded Python.

Environment A — Pure DB Client

What this isolates. No FastAPI, no model inference, no HTTP loadgen. A Python loop drives batch_read directly so the surrounding stack is as thin as possible — this is where aerospike-py's advantage is largest.

Setup

ItemValue
Sourcebenchmark/results/20260416_134243/report.md
Clientsofficial (sync C), official-async (sync wrapped via executor), py-async (aerospike-py)
Sets9
Batch sizes50, 200
Concurrency10
Iterations30

Aggregate result

Clientavg mean (ms)avg p99 (ms)avg TPS
official107.56195.34138.2
official-async110.64211.33125.7
py-async (aerospike-py)22.45120.67373.7
py-async vs official4.8× faster latency1.6×2.7× higher TPS
py-async vs official-async4.9× faster1.8×3.0×

Per-set comparison (official vs aerospike-py)

setbatchofficial mean (ms)official p99aerospike-py meanaerospike-py p99mean speedupTPS speedup
set_150110.23200.1419.74105.465.6×4.3×
set_1200127.06206.1930.53194.054.2×3.4×
set_250121.76210.2415.5734.687.8×5.8×
set_2200110.86194.8224.98124.864.4×3.2×
set_350108.00184.3218.53103.225.8×5.3×
set_3200128.85220.1523.35110.905.5×4.0×
set_450109.15206.1118.18116.926.0×5.3×
set_4200118.69195.7723.12109.325.1×3.5×
set_550113.21195.2725.57301.284.4×3.0×
set_5200122.26197.8926.85148.114.6×3.8×
set_650115.30210.7418.87103.986.1×5.2×
set_6200123.93261.4130.47122.364.1×3.6×
set_750115.34190.6817.8344.866.5×5.0×
set_7200126.81215.8741.17133.613.1×2.1×
set_85013.9296.2715.2097.950.9×0.9×
set_820019.95102.1216.27116.031.2×1.0×
set_950111.80217.1416.9295.596.6×5.6×
set_9200139.05210.9520.95108.816.6×4.8×
set_8 is empty

0% found rate — the official client returns errors faster than success paths, so this row reflects "fast not-found" rather than real read latency. All other sets show 4–8× mean speedup.

official-async (the sync C client wrapped with loop.run_in_executor) is slightly slower than the bare sync client across every set — each request pays a thread-pool hop. aerospike-py's gap over official-async is therefore consistently larger than its gap over official (mean speedup 6.0–7.1× at batch=50, vs 4.4–6.6× over official).

Environment B — uvicorn ASGI Only

What this isolates. Add FastAPI and uvicorn around batch_read but no model inference. This is the "REST API in front of a key-value lookup" shape.

Setup

ItemValue
Sourcebenchmark/results/asgi_20260416_134730/asgi-report.md
Concurrency5
Iterations50

Pipeline breakdown (latency in ms)

Clienttotal meanp50p90p95p99aerospike stepinferencehttp stepTPSerrors
official289.56289.36429.74461.15473.60280.001.55300.6916.60
aerospike-py228.49189.56457.42468.03497.09221.121.46258.1019.41
  • Total mean: −21% (290 → 228 ms)
  • Aerospike step: −21% (280 → 221 ms) — most of the wall time is the DB call, so the client gain transfers to E2E
  • TPS: +17% (16.6 → 19.4 req/s)
  • p99 is roughly equivalent — at concurrency=5, the Tokio model isn't yet stressing the GIL hard enough to widen the tail

Higher-concurrency variant (concurrency=10, 200 iter)

A second run at higher concurrency surfaced a different picture (benchmark/results/asgi_20260416_135234/asgi-report.md):

Clienttotal meanp95TPSerrors
official465.81885.6420.90
aerospike-py580.70986.9313.956
Backpressure under saturation

This run saturated the test harness (56 errors on the aerospike-py side from request rejections / timeouts under load that the official client absorbed differently). It's not a clean apples-to-apples comparison — it's evidence that without proper backpressure tuning, native async clients can issue more in-flight work than the server pool can absorb. See Bottleneck Analysis for the gathersingle recipe that resolves this.

Environment C — uvicorn + DLRM (Real Serving)

What this isolates. Full production-style pipeline: HTTP request → key extraction → batch_read(9 sets × 200 keys) → feature build → DLRM inference (PyTorch CPU) → response. This is the closest measurement to what a real recsys serving pod looks like.

Setup

ItemValue
Sourcebenchmark/results/k6-runtime-client-comparison.md (config "3.11 + GIL, stage OFF")
Imageaerospike-benchmark:latest (Python 3.11.14)
Loadgenk6 10 VUs × 60s per scenario, 4 scenarios
Pods2 replicas, 4 CPU / 4 GiB request, 8 CPU / 8 GiB limit

Both endpoints (/predict/official/sample and /predict/py-async/sample) live in the same pod, called alternately by k6 — server state and network are identical.

k6 client-side latency (single mode, 10 VUs × 60s)

Metricaerospike-pyofficialaerospike-py advantage
avg118 ms146 ms−19%
median134 ms148 ms−10%
p90173 ms293 ms−41% 🔥
p95189 ms324 ms−42% 🔥

k6 client-side latency (gather mode — 9 set fan-out)

Metricaerospike-pyofficialadvantage
p95234 ms266 ms−12%

The gather mode runs 9 batch_read calls in asyncio.gather(...). The gap shrinks because both clients now hit GIL serialization on the result conversion step — see Bottleneck Analysis.

Server-side Prometheus metrics (same run)

predict_duration_seconds p95 (FastAPI E2E, measured inside the pod):

Clientp95
aerospike-py202 ms
official274 ms

aerospike_batch_read_all_duration_seconds p95 (app-level — batch_read + as_dict + demux):

Clientp95
aerospike-py137 ms
official252 ms

Server-side and client-side numbers move in the same direction — the difference between them (~13 ms) is just network RTT + gateway + connection setup.

Per-mode summary (all p95 in ms)

Modeaerospike-pyofficialaerospike-py advantage
single189324−42%
gather234266−12%
merge_gather (aerospike-py only)202
stress (0→20→50 VUs ramp)592

Pattern across environments

As layers are added around the DB call, the ratio compresses (4.8× → 1.24× mean) but upper-percentile advantage holds — aerospike-py keeps the GIL released during I/O, while the official client serializes GIL acquisition through run_in_executor and spikes at the tail. That's why p95 still wins −42% in environment C even though mean is only −19%.

What's left to investigate: see Free-Threaded Python for what happens when GIL is removed entirely, and Bottleneck Analysis for the gathersingle recipe that gives another −33%.