본문으로 건너뛰기
버전: 0.10.6

Bottleneck Analysis & Optimization

This page explains what makes batch_read slow under load, how aerospike-py exposes per-stage timings to find out, and which optimizations measurably help.

Sources: benchmark/results/bottleneck-analysis.md, benchmark/results/internal-stage-toggle-comparison.md, benchmark/results/k6-runtime-client-comparison.md.

The Initial Mystery

Production observability showed Rust internal batch_read I/O taking ~0.9 ms, but the Python-level metric for a per-set read was ~51 ms. 99% of the wall time was happening outside Rust.

Rust metrics (db_client_operation_duration_seconds):    0.9 ms
Python metrics (aerospike_batch_read_set_duration): 51.0 ms
─────────────────────────────────────────────────────────────
Unaccounted overhead: 50.1 ms (98.2%)

To find where the 50 ms went, aerospike-py added per-stage histograms covering both Rust internals and the PyO3 ↔ asyncio boundary.

Internal Stage Profiling

Each batch_read is split into 10 stages. Histograms (db_client_internal_stage_seconds) record latency for each stage individually.

StageWhereGIL state
key_parsePython tuples → Rust Vec<Key>held
limiter_waitBackpressure semaphore acquirereleased
tokio_schedule_delayTokio task schedulingreleased
ioAerospike network round-tripreleased
into_pyobjectRust BatchRecordPy<...> (Arc wrap, run on spawn_blocking)held
spawn_blocking_delayWall time spawn_blocking task spent waiting in queue
event_loop_resume_delayEvent-loop tick from future-ready to coroutine resume
as_dictBatchReadHandle.as_dict()dict[Key, Record]held
merge_as_dictMulti-set merge into typed dictheld
future_into_py_setupPyO3 future ↔ asyncio bridge setupheld

Activating Stage Profiling

Stage profiling is off by default with zero overhead. Activate it three ways:

# Process-wide via env var (read at process start)
# AEROSPIKE_PY_INTERNAL_METRICS=1

# Runtime toggle
import aerospike_py
aerospike_py.set_internal_stage_metrics_enabled(True)

# Scoped context manager
with aerospike_py.internal_stage_profiling():
await client.batch_read(keys)

When disabled, the Instant::now() syscalls and histogram updates are elided — only one AtomicBool load (~1 ns) per stage entry remains.

What the Stages Revealed (Python 3.11 + GIL)

Running with profiling on under k6 load:

StageavgWhat this means
into_pyobject3.96 μsArc wrap is essentially free
limiter_wait3.56 μsBackpressure not active (no contention)
future_into_py_setup46.5 μsPyO3 ↔ asyncio bridge is cheap
tokio_schedule_delay83.1 μsTokio scheduling is fine
as_dict832 μsPer-call dict materialization OK
key_parse967 μs200-key tuple parsing OK
merge_as_dict4.48 ms9-set merge is moderate
io7.51 msNetwork — but inflated by GIL contention during response parsing
event_loop_resume_delay39.7 ms🚨 Coroutines waiting their turn for the GIL
spawn_blocking_delay234 ms🚨 spawn_blocking queue waiting for GIL

Conclusion: aerospike-py's Rust code is fast (~5 ms total for the actual batch_read work). The 46 ms of unaccounted overhead is real, but it's caused by GIL serialization at the PyO3 ↔ asyncio boundary, not by any fixable hot loop in Rust.

This finding directly motivated the Free-Threaded Python experiment, which collapsed both stages to near-zero.

Optimization 1: gather(N) → Single batch_read(mixed keys)

The 9-set serving pipeline originally fanned out via asyncio.gather:

# 9 concurrent batch_read calls
await asyncio.gather(*[
client.batch_read(set_i_keys) for i in range(9)
])

Under GIL, this creates 9 simultaneous PyO3 → asyncio handoffs. Each key_parse and as_dict stage acquires the GIL, and they serialize:

Timeline (gather mode, 9 sets × 200 keys):
T=0ms asyncio.gather starts
├─ task 1: key_parse (1.46 ms, GIL held)
├─ task 2: key_parse waits (GIL)
├─ ...
└─ task 9: key_parse waits (GIL)
T=~13ms 9 key_parse done (sequential, 9 × 1.46 ms)
└─ 9 Rust async futures execute in parallel on Tokio
T=~15ms 9 I/O complete
T=15-51ms event loop resumes 9 coroutines sequentially:
├─ task 1 → as_dict (0.79 ms) → done
├─ (event loop tick: 3-5 ms)
└─ task 9 → as_dict (0.79 ms) → done
T=~51ms gather returns

The fix: batch_read already supports mixed-set keys. Send all 1800 keys (9 × 200) in a single call and demux the result in Python:

all_keys = []
for set_name, keys in per_set_keys.items():
all_keys.extend(("ns", set_name, k) for k in keys)

result = await client.batch_read(all_keys)

# Demux by set in Python (cheap)
per_set_results = defaultdict(dict)
for key, record in result.as_dict().items():
per_set_results[key[1]][key[2]] = record

Measured impact (k6 10 VUs × 60s, FastAPI + DLRM)

ClientModeavgmedianp90p95
aerospike-pygather (9×)189176393434
aerospike-pysingle (mixed keys)126101284302
Improvement−33%−43%−28%−30%
ClientModeavgp95
officialgather251502
officialsingle188396
aerospike-py vs official (single mode)1.49×1.31×

The single-call pattern wins for both clients (it's always doing less work), but the gain is larger for aerospike-py because removing 8 of the 9 PyO3 ↔ asyncio handoffs eliminates 8× the GIL serialization cost.

After enabling this as the default, a follow-up run (different test conditions, lighter overall load) showed avg 60.8 ms / p95 154.6 ms for aerospike-py single mode — a sustained absolute floor.

Optimization 2: Always-On Stage Profiling

The toggle exists so that stage profiling can be left on in production if the overhead is small enough. The measurement:

Metric (Rust internal batch_read op_duration)OFF (baseline)ON (debug)Δ
avg2.16 ms3.22 ms+1.06 ms (+49%)
p954.67 ms7.61 ms+2.94 ms (+63%)
p994.93 ms9.51 ms+4.58 ms (+93%)

That looks bad in isolation. But Rust internal time is only 5–10% of E2E latency. At the FastAPI E2E level (the one users actually feel):

Configk6 single p95
3.11 + GIL, stage OFF189 ms
3.11 + GIL, stage ON143 ms

Stage ON measured lower end-to-end p95 than stage OFF — within noise, the toggle has no measurable E2E impact. The Rust-level overhead disappears into the much larger asyncio + GIL cost outside Rust.

Recommendation: keep AEROSPIKE_PY_INTERNAL_METRICS=1 on in production. You get instant per-stage attribution when an incident happens. When the GIL is removed (3.14t), even the Rust-internal overhead drops to negligible.

Optimization 3: Switch to Python 3.14t

This is the largest single lever — see the Free-Threaded Python page for the full breakdown. Summary:

  • Same Rust binary
  • spawn_blocking_delay: 234 ms → 0.12 ms
  • event_loop_resume_delay: 39.7 ms → ≈ 0
  • io: 7.51 ms → 1.27 ms
  • E2E p95: 189 ms → 97 ms (−49%)
  • TPS: 41.6 → 61.2 iter/s (+47%)

For the cumulative effect across all three optimizations and the recommended order to apply them, see the Overview.

Future Optimization Candidates

Identified but not yet implemented at the time of measurement:

IdeaEstimated effectDifficulty
key_parse interning (py.intern() for bin names)−10–15%Medium
future_into_py improvementsTheoretical max for this stageHigh — needs upstream PyO3 PR
Promote module to #[pymodule(gil_used = false)] after auditRemoves any latent GIL re-acquisition cost on 3.14tMedium