Version: 0.10.6

Benchmarks

This page contains the full per-environment numbers backing the Overview. All measurements use Python 3.11 with the GIL (the production default). For the free-threaded Python 3.14t comparison, see Free-Threaded Python.

Environment A — Pure DB Client

What this isolates. No FastAPI, no model inference, no HTTP loadgen. A Python loop drives batch_read directly so the surrounding stack is as thin as possible — this is where aerospike-py's advantage is largest.

Setup

Item	Value
Source	`benchmark/results/20260416_134243/report.md`
Clients	`official` (sync C), `official-async` (sync wrapped via executor), `py-async` (aerospike-py)
Sets	9
Batch sizes	50, 200
Concurrency	10
Iterations	30

Aggregate result

Client	avg mean (ms)	avg p99 (ms)	avg TPS
official	107.56	195.34	138.2
official-async	110.64	211.33	125.7
py-async (aerospike-py)	22.45	120.67	373.7
py-async vs official	4.8× faster latency	1.6×	2.7× higher TPS
py-async vs official-async	4.9× faster	1.8×	3.0×

Per-set comparison (official vs aerospike-py)

set	batch	official mean (ms)	official p99	aerospike-py mean	aerospike-py p99	mean speedup	TPS speedup
set_1	50	110.23	200.14	19.74	105.46	5.6×	4.3×
set_1	200	127.06	206.19	30.53	194.05	4.2×	3.4×
set_2	50	121.76	210.24	15.57	34.68	7.8×	5.8×
set_2	200	110.86	194.82	24.98	124.86	4.4×	3.2×
set_3	50	108.00	184.32	18.53	103.22	5.8×	5.3×
set_3	200	128.85	220.15	23.35	110.90	5.5×	4.0×
set_4	50	109.15	206.11	18.18	116.92	6.0×	5.3×
set_4	200	118.69	195.77	23.12	109.32	5.1×	3.5×
set_5	50	113.21	195.27	25.57	301.28	4.4×	3.0×
set_5	200	122.26	197.89	26.85	148.11	4.6×	3.8×
set_6	50	115.30	210.74	18.87	103.98	6.1×	5.2×
set_6	200	123.93	261.41	30.47	122.36	4.1×	3.6×
set_7	50	115.34	190.68	17.83	44.86	6.5×	5.0×
set_7	200	126.81	215.87	41.17	133.61	3.1×	2.1×
set_8	50	13.92	96.27	15.20	97.95	0.9×	0.9×
set_8	200	19.95	102.12	16.27	116.03	1.2×	1.0×
set_9	50	111.80	217.14	16.92	95.59	6.6×	5.6×
set_9	200	139.05	210.95	20.95	108.81	6.6×	4.8×

set_8 is empty

0% found rate — the official client returns errors faster than success paths, so this row reflects "fast not-found" rather than real read latency. All other sets show 4–8× mean speedup.

official-async (the sync C client wrapped with loop.run_in_executor) is slightly slower than the bare sync client across every set — each request pays a thread-pool hop. aerospike-py's gap over official-async is therefore consistently larger than its gap over official (mean speedup 6.0–7.1× at batch=50, vs 4.4–6.6× over official).

Environment B — uvicorn ASGI Only

What this isolates. Add FastAPI and uvicorn around batch_read but no model inference. This is the "REST API in front of a key-value lookup" shape.

Setup

Item	Value
Source	`benchmark/results/asgi_20260416_134730/asgi-report.md`
Concurrency	5
Iterations	50

Pipeline breakdown (latency in ms)

Client	total mean	p50	p90	p95	p99	aerospike step	inference	http step	TPS	errors
official	289.56	289.36	429.74	461.15	473.60	280.00	1.55	300.69	16.6	0
aerospike-py	228.49	189.56	457.42	468.03	497.09	221.12	1.46	258.10	19.4	1

Total mean: −21% (290 → 228 ms)
Aerospike step: −21% (280 → 221 ms) — most of the wall time is the DB call, so the client gain transfers to E2E
TPS: +17% (16.6 → 19.4 req/s)
p99 is roughly equivalent — at concurrency=5, the Tokio model isn't yet stressing the GIL hard enough to widen the tail

Higher-concurrency variant (concurrency=10, 200 iter)

A second run at higher concurrency surfaced a different picture (benchmark/results/asgi_20260416_135234/asgi-report.md):

Client	total mean	p95	TPS	errors
official	465.81	885.64	20.9	0
aerospike-py	580.70	986.93	13.9	56

Backpressure under saturation

This run saturated the test harness (56 errors on the aerospike-py side from request rejections / timeouts under load that the official client absorbed differently). It's not a clean apples-to-apples comparison — it's evidence that without proper backpressure tuning, native async clients can issue more in-flight work than the server pool can absorb. See Bottleneck Analysis for the gather→single recipe that resolves this.

Environment C — uvicorn + DLRM (Real Serving)

What this isolates. Full production-style pipeline: HTTP request → key extraction → batch_read(9 sets × 200 keys) → feature build → DLRM inference (PyTorch CPU) → response. This is the closest measurement to what a real recsys serving pod looks like.

Setup

Item	Value
Source	`benchmark/results/k6-runtime-client-comparison.md` (config "3.11 + GIL, stage OFF")
Image	`aerospike-benchmark:latest` (Python 3.11.14)
Loadgen	k6 10 VUs × 60s per scenario, 4 scenarios
Pods	2 replicas, 4 CPU / 4 GiB request, 8 CPU / 8 GiB limit

Both endpoints (/predict/official/sample and /predict/py-async/sample) live in the same pod, called alternately by k6 — server state and network are identical.

k6 client-side latency (single mode, 10 VUs × 60s)

Metric	aerospike-py	official	aerospike-py advantage
avg	118 ms	146 ms	−19%
median	134 ms	148 ms	−10%
p90	173 ms	293 ms	−41% 🔥
p95	189 ms	324 ms	−42% 🔥

k6 client-side latency (gather mode — 9 set fan-out)

Metric	aerospike-py	official	advantage
p95	234 ms	266 ms	−12%

The gather mode runs 9 batch_read calls in asyncio.gather(...). The gap shrinks because both clients now hit GIL serialization on the result conversion step — see Bottleneck Analysis.

Server-side Prometheus metrics (same run)

predict_duration_seconds p95 (FastAPI E2E, measured inside the pod):

Client	p95
aerospike-py	202 ms
official	274 ms

aerospike_batch_read_all_duration_seconds p95 (app-level — batch_read + as_dict + demux):

Client	p95
aerospike-py	137 ms
official	252 ms

Server-side and client-side numbers move in the same direction — the difference between them (~13 ms) is just network RTT + gateway + connection setup.

Per-mode summary (all p95 in ms)

Mode	aerospike-py	official	aerospike-py advantage
single	189	324	−42%
gather	234	266	−12%
merge_gather (aerospike-py only)	202	—	—
stress (0→20→50 VUs ramp)	592	—	—

Pattern across environments

As layers are added around the DB call, the ratio compresses (4.8× → 1.24× mean) but upper-percentile advantage holds — aerospike-py keeps the GIL released during I/O, while the official client serializes GIL acquisition through run_in_executor and spikes at the tail. That's why p95 still wins −42% in environment C even though mean is only −19%.

What's left to investigate: see Free-Threaded Python for what happens when GIL is removed entirely, and Bottleneck Analysis for the gather→single recipe that gives another −33%.

Environment A — Pure DB Client​

Aggregate result​

Per-set comparison (official vs aerospike-py)​

Environment B — uvicorn ASGI Only​

Pipeline breakdown (latency in ms)​

Higher-concurrency variant (concurrency=10, 200 iter)​

Environment C — uvicorn + DLRM (Real Serving)​

k6 client-side latency (single mode, 10 VUs × 60s)​

k6 client-side latency (gather mode — 9 set fan-out)​

Server-side Prometheus metrics (same run)​

Per-mode summary (all p95 in ms)​

Pattern across environments​

Environment A — Pure DB Client

Aggregate result

Per-set comparison (official vs aerospike-py)

Environment B — uvicorn ASGI Only

Pipeline breakdown (latency in ms)

Higher-concurrency variant (concurrency=10, 200 iter)

Environment C — uvicorn + DLRM (Real Serving)

k6 client-side latency (single mode, 10 VUs × 60s)

k6 client-side latency (gather mode — 9 set fan-out)

Server-side Prometheus metrics (same run)

Per-mode summary (all p95 in ms)

Pattern across environments