Version: main

Bottleneck Analysis & Optimization

What makes batch_read slow under load, how aerospike-py exposes per-stage timings to find out, and which optimizations measurably help.

Sources: benchmark/results/bottleneck-analysis.md, benchmark/results/internal-stage-toggle-comparison.md, benchmark/results/k6-runtime-client-comparison.md.

The initial mystery

Rust internal batch_read I/O (db_client_operation_duration_seconds):    0.9 ms
Python-level per-set read (aerospike_batch_read_set_duration):         51.0 ms
                                                                     ────────
Unaccounted overhead — 99% of the wall time happens outside Rust:    50.1 ms

To find where the 50 ms went, aerospike-py added per-stage histograms covering both Rust internals and the PyO3 ↔ asyncio boundary.

Internal stage profiling

Each batch_read is split into 10 stages. Histograms (db_client_internal_stage_seconds) record latency for each stage individually.

Stage	Where	GIL state
`key_parse`	Python tuples → Rust `Vec<Key>`	held
`limiter_wait`	Backpressure semaphore acquire	released
`tokio_schedule_delay`	Tokio task scheduling	released
`io`	Aerospike network round-trip	released
`into_pyobject`	Rust `BatchRecord` → `Py<...>` (Arc wrap, `spawn_blocking`)	held
`spawn_blocking_delay`	Wall time `spawn_blocking` task spent waiting in queue	—
`event_loop_resume_delay`	Event-loop tick from future-ready to coroutine resume	—
`to_dict`	`LazyBatchRecords.to_dict()` → `dict[Key, Record]`	held
`to_numpy`	`LazyBatchRecords.to_numpy(dtype)` → `NumpyBatchRecords` framing — the per-record buffer fill itself runs with the GIL released via `py.detach`	held (framing only)
`merge_to_dict`	Multi-set merge into typed dict	held
`future_into_py_setup`	PyO3 future ↔ asyncio bridge setup	held

Activating stage profiling

Stage profiling is off by default with zero overhead. Activate it three ways:

Env var (process-wide)
Runtime toggle
Scoped context manager

export AEROSPIKE_PY_INTERNAL_METRICS=1
# read once at process start

import aerospike_py
aerospike_py.set_internal_stage_metrics_enabled(True)

with aerospike_py.internal_stage_profiling():
    await client.batch_read(keys)

When disabled, Instant::now() syscalls and histogram updates are elided — only one AtomicBool load (~1 ns) per stage entry remains.

What the stages revealed (Python 3.11 + GIL)

Stage                    avg            Comment
────────────────────────────────────────────────────────────────────
into_pyobject            3.96 μs        Arc wrap is essentially free
limiter_wait             3.56 μs        Backpressure not active
future_into_py_setup     46.5 μs        PyO3 ↔ asyncio bridge is cheap
tokio_schedule_delay     83.1 μs        Tokio scheduling fine
to_dict                  832 μs         Per-call dict materialization OK
key_parse                967 μs         200-key tuple parsing OK
merge_to_dict            4.48 ms        9-set merge moderate
                                          (historical: this stage was
                                           named `merge_as_dict` before
                                           the alias drop in PR-374)
io                       7.51 ms        Network (inflated by GIL during parse)
event_loop_resume_delay  39.7 ms   🚨   Coroutines waiting their turn for GIL
spawn_blocking_delay     234 ms    🚨   spawn_blocking queue waiting for GIL

Conclusion. aerospike-py's Rust code is fast (~5 ms total). The 46 ms of overhead is real, but caused by GIL serialization at the PyO3 ↔ asyncio boundary, not by any fixable hot loop in Rust.

This finding directly motivated the Free-Threaded Python experiment, which collapsed both stages to near-zero.

Optimization 1: `gather(N)` → single `batch_read(mixed keys)`

The 9-set serving pipeline originally fanned out via asyncio.gather. Under GIL, this creates 9 simultaneous PyO3 → asyncio handoffs, and key_parse + to_dict stages serialize against each other.

Before (gather, 9 calls)
After (single, 1 call)
Why it works (timeline)

# 9 concurrent batch_read calls — GIL serializes key_parse + to_dict
await asyncio.gather(*[
    client.batch_read(set_i_keys) for i in range(9)
])

# 1800 keys in one call; demux by set in Python (cheap)
all_keys = []
for set_name, keys in per_set_keys.items():
    all_keys.extend(("ns", set_name, k) for k in keys)

result = await client.batch_read(all_keys)

per_set_results = defaultdict(dict)
for key, record in result.to_dict().items():
    per_set_results[key[1]][key[2]] = record

gather mode (9 sets × 200 keys):

T=0ms     asyncio.gather starts
          ├─ task 1: key_parse (1.46 ms, GIL held)
          ├─ task 2: key_parse waits (GIL)
          └─ task 9: key_parse waits (GIL)
T=~13ms   9 key_parse done (sequential, 9 × 1.46 ms)
          └─ 9 Rust async futures execute in parallel on Tokio
T=~15ms   9 I/O complete
T=15-51ms event loop resumes 9 coroutines sequentially:
          ├─ task 1 → to_dict (0.79 ms) → done
          ├─ (event loop tick: 3-5 ms)
          └─ task 9 → to_dict (0.79 ms) → done
T=~51ms   gather returns

single mode: only ONE key_parse, ONE to_dict, ONE future_into_py.

Measured impact (k6 10 VUs × 60s, FastAPI + DLRM)

aerospike-py p95
   gather        ████████████████████  434 ms
   single        ██████████████        302 ms   −30%

aerospike-py avg
   gather        █████████             189 ms
   single        ██████                126 ms   −33%

Client	Mode	avg	median	p90	p95
aerospike-py	gather (9×)	189	176	393	434
aerospike-py	single (mixed keys)	126	101	284	302
Improvement		−33%	−43%	−28%	−30%

Client	Mode	avg	p95
official	gather	251	502
official	single	188	396
aerospike-py vs official (single)		1.49×	1.31×

The single-call pattern wins for both clients (always doing less work), but the gain is larger for aerospike-py because removing 8 of the 9 PyO3 ↔ asyncio handoffs eliminates 8× the GIL serialization cost.

After enabling this as the default, a follow-up run (different test conditions, lighter overall load) showed avg 60.8 ms / p95 154.6 ms for aerospike-py single mode — a sustained absolute floor.

Optimization 2: Keep stage profiling always on

The toggle exists so that stage profiling can be left on in production if the overhead is small enough.

Rust internal batch_read op_duration (controlled load)
                          OFF (baseline)  →  ON (debug)
  avg                     2.16 ms            3.22 ms      +49%
  p95                     4.67 ms            7.61 ms      +63%
  p99                     4.93 ms            9.51 ms      +93%

That looks bad in isolation. But Rust internal time is only 5–10% of E2E.

E2E k6 single p95 (the one users feel)
  3.11 + GIL, stage OFF                       189 ms
  3.11 + GIL, stage ON                        143 ms ← lower (within noise)

Stage ON measured lower end-to-end p95 than stage OFF — within noise, the toggle has no measurable E2E impact. The Rust-level overhead disappears into the much larger asyncio + GIL cost outside Rust.

Recommendation: keep AEROSPIKE_PY_INTERNAL_METRICS=1 on in production. You get instant per-stage attribution when an incident happens. When the GIL is removed (3.14t), even the Rust-internal overhead drops to negligible.

Optimization 3: Switch to Python 3.14t

This is the largest single lever — see the Free-Threaded Python page for the full breakdown. Summary:

Item	3.11 + GIL	3.14t	Δ
Same Rust binary	—	—	—
`spawn_blocking_delay`	234 ms	0.12 ms	−99.95% 🔥
`event_loop_resume_delay`	39.7 ms	≈ 0	≈ −100%
`io`	7.51 ms	1.27 ms	−83%
E2E p95	189 ms	97 ms	−49%
TPS	41.6	61.2 iter/s	+47%

For the cumulative effect across all three optimizations, see the Overview.

Reproducibility note: local saturation testing

A separate saturation reproducer run (benchmark/multiworker_saturation/results/saturation_20260511/) attempted to reproduce a production CPU/RPS gap reported in issue #347 on a local macOS + Podman environment.

The right comparison metric across two different hardware envelopes is CPU per RPS (efficiency), not raw CPU% or raw RPS in isolation. Different machines have different capacity ceilings, so a flat RPS / CPU% ratio normalizes both.

                            legacy            aerospike-py     aspy vs legacy
                            ──────────────    ─────────────    ──────────────
Production (VU 100, prod hardware)
  raw RPS                   206               187              −9%
  raw CPU%                  62%               100%             +38%p
  CPU per RPS               0.301 %/req       0.535 %/req      +78%   ⚠ (large)

Local (VU 100, macOS + Podman, 3.11)
  raw RPS                   390.6             369.8            −5.3%
  raw CPU%                  125.9%            127.9%           +2.0%p
  CPU per RPS               0.322 %/req       0.346 %/req      +7.4%  (noise)

Same direction, very different magnitude. Both environments show aerospike-py being slightly less CPU-efficient than the legacy C extension, but the production gap (+78% CPU/RPS) is ~10× the local gap (+7.4% CPU/RPS, within run-to-run noise). The gate fails because the production magnitude does not reproduce locally.

Useful observation at low concurrency (VU 10) — note this comes from the higher-fidelity cell3/cell4 dataset (12 CPU + BURN_MS=30, the "most trustworthy" run per the verdict), not the same isolated run used for the VU 100 CPU table above:

	aerospike-py	legacy C ext
p50	67.6 ms	39.3 ms
p95	71.1 ms	73.6 ms
RSS max	696 MB	516 MB

At low concurrency, aerospike-py shows a measurable per-call overhead penalty (p50 +72%) consistent with the analyzed future_into_py_setup + tokio_schedule_delay path. The gap disappears under load, but is a known target for further optimization.

Why prod reproduction failed locally (best guesses): no DLRM PyTorch inference (our burn was pure-Python loop), single Aerospike node (vs prod 5-node cluster — Aerospike-side bottleneck masks client-side differences), and macOS Podman VM abstraction of cgroup scheduling.

Future optimization candidates

Identified but not yet implemented:

Idea	Estimated effect	Difficulty
`key_parse` interning (`py.intern()` for bin names)	−10–15%	Medium
`future_into_py` improvements	Theoretical max for this stage	High — needs upstream PyO3 PR
Promote module to `#[pymodule(gil_used = false)]` after audit	Removes any latent GIL re-acquisition cost on 3.14t	Medium
`batch_read_ordered` / `batch_read_many`	Drop dedup String overhead	Medium

The initial mystery​

Internal stage profiling​

Activating stage profiling​

What the stages revealed (Python 3.11 + GIL)​

Optimization 1: gather(N) → single batch_read(mixed keys)​

Measured impact (k6 10 VUs × 60s, FastAPI + DLRM)​

Optimization 2: Keep stage profiling always on​

Optimization 3: Switch to Python 3.14t​

Reproducibility note: local saturation testing​

Future optimization candidates​