Skip to main content
Version: 0.10.10

Free-Threaded Python (3.14t)

Python 3.14t is the free-threaded build of CPython — the GIL is disabled (Py_GIL_DISABLED=1). This page reports what changes for aerospike-py and the official aerospike C client when the runtime is swapped, with no Rust or C source changes.

Sources: benchmark/results/python-3.14t-benchmark.md, benchmark/results/k6-runtime-client-comparison.md.

TL;DR

p95 single-mode (k6 10 VUs × 60s, FastAPI + DLRM)

aerospike-py
3.11 + GIL ███████████████ 189 ms
3.14t ████████ 97 ms −49% 🔥

official (C extension)
3.11 + GIL █████████████████████████ 324 ms
3.14t ██████████ 128 ms −60% 🔥

Throughput (aerospike-py): 41.6 → 61.2 iter/s +47%

The GIL was a shared bottleneck for both clients. aerospike-py's gain came without touching Rust.

What gets faster (and why)

Internal stage timings under load. The two GIL-bound stages collapse to near-zero.

Stage                     3.11 + GIL              3.14t       Change
────────────────────────────────────────────────────────────────────
spawn_blocking_delay 234 ms ████████ 0.12 ms −99.95% 🔥
event_loop_resume_delay 39.7 ms ██ ≈ 0 ≈ −100%
io (Aerospike network) 7.51 ms ▌ 1.27 ms −83%
merge_as_dict 4.48 ms ▎ 3.54 ms −21%
key_parse 967 μs · 1.06 ms +10% (noise)
tokio_schedule_delay 83.1 μs 49.5 μs −40%
limiter_wait 3.56 μs 0.96 μs −73%

Two stages dominate the gain:

  • spawn_blocking_delay drops from 234 ms to 0.12 ms. Under GIL, when a Rust async future completes and needs to convert results into Py<...> objects, it dispatches that work to a spawn_blocking worker. That worker has to acquire the GIL — and under contention from the asyncio event loop and other workers, the queue stretches into hundreds of milliseconds. With no GIL, the worker runs immediately.
  • event_loop_resume_delay drops to effectively zero. Under GIL, after the future resolves and the event loop is woken up, the coroutine still has to wait its turn for the GIL before resuming. Free-threaded mode lets multiple coroutines resume in parallel.

io shrinking 8× is a second-order effect: Tokio workers can now parse Aerospike protocol responses without contending with Python code for the GIL.

Same-workload comparison on 3.14t

On 3.14t, when both clients hit the same server load (alternating endpoints in the same pod, k6 10 VUs):

Clientp95 (single mode)Difference
aerospike-py126 msbaseline
official aerospike128 ms+2 ms (~1.5%, noise)

The 42% gap that existed under 3.11 + GIL collapses to ~2 ms under 3.14t.

This is the cleanest evidence that GIL contention — not architectural difference — was responsible for the bulk of the original gap.

Throughput (TPS)

Latency improvements translate directly into throughput. Two TPS views — k6 client iterations/s and server-side predict_requests_total rate — agree.

3.11 + GIL, stage OFF             ████████████        41.6  baseline
3.11 + GIL, stage ON █████████████ 44.1 +6% (noise)
3.14t, aerospike-py only ███████████████████ 61.2 +47% 🔥
3.14t, both clients (split load) ██████████████ 47.3 +14%
Configiterations/shttp_reqs/s
3.11 + GIL, stage OFF41.650.8
3.11 + GIL, stage ON44.152.9
3.14t, aerospike-py only61.280.0
3.14t, both clients (split load)47.359.8

Why "both clients" drops to 47.3 iter/s

When the two endpoints are loaded simultaneously, the same Aerospike server and FastAPI pod CPU are split between clients. Per-client throughput naturally halves. The 61.2 iter/s peak is the actual aerospike-py ceiling when it runs solo.

But aerospike-py still wins under real load

The "126 vs 128 ms" tie comes from a configuration where each client only sees ~5 effective VUs (10 VUs split across two endpoints). When the load is concentrated on one client, the gap reopens.

Solo-load comparison on 3.14t (each client owns the server)

Metricaerospike-py soloofficial soloaerospike-py advantage
k6 single p9597 ms134 ms−28%
k6 gather p95 (9× fan-out)107 ms253 ms−58% 🔥
Server predict_duration_seconds p95100 ms138 ms−28%
Server batch_read_all p9564 ms67 ms−4%
Theoretical capacity (Little's Law: 10 VUs / p95)~103 req/s~75 req/s+37%
k6 throughput numbers can mislead

The two solo runs used different k6 scripts. k6_benchmark_official_only.js sends exactly 1 request per iteration; k6_benchmark.js (used for aerospike-py solo) splits 10 VUs across 4 scenarios. Raw iterations/s is not apples-to-apples — but per-request latency is, because both runs used the same VU count and the same server.

Why aerospike-py still leads under solo load on 3.14t:

  1. Native async vs threadpool wrapping. Even without GIL contention, run_in_executor adds a thread-pool hop per request. aerospike-py awaits directly on the asyncio loop.
  2. Lazy dict conversion. batch_read() returns a BatchReadHandle (Arc-wrapped, ~10 μs). The Python-dict materialization happens lazily when the caller invokes .as_dict(). The official client builds the full dict on I/O completion.
  3. Single FFI boundary crossing. The aerospike-py Rust code completes a full batch_read inside one PyO3 call. The C extension crosses the Python ↔ C boundary multiple times per call.

These advantages compound as concurrency rises (the gather number — 107 vs 253 ms — is the clearest example).

  1. Add a 3.14t row to CI matrix. Run unit + integration tests under python:3.14.2t-slim.
  2. Audit unsafe and shared mutable state in Rust. aerospike-py is mostly thread-safe but should be audited before declaring gil_used = false.
  3. Promote to #[pymodule(gil_used = false)]. Once the audit is clean.
  4. Wait for or build official client wheels. PyPI does not yet ship cp314t wheels for the official aerospike package. Source build works (see notes below).

Notes

Side-effect: inference also got faster. DLRM inference (PyTorch CPU, control variable) dropped from 43.5 ms → 20.7 ms (−52%) on 3.14t. Unrelated to aerospike-py — a free side benefit for any GIL-bound inference path running alongside async I/O.

GIL state verified. With Py_GIL_DISABLED=1, the interpreter did not re-enable GIL after import aerospike_py — the Rust module currently declares #[pymodule(gil_used = true)] but the underlying code is already mostly thread-safe (ArcSwapOption for client, Arc<Vec<BatchRecord>> for batch handles, Mutex for metric registry). Promoting to gil_used = false after a full audit is a follow-up.

Build images. aerospike-benchmark:314t ships only aerospike-py (uses cp314t wheels from benchmark/deploy/wheels-314t/). aerospike-benchmark:314t-with-official adds the official C client built from source — required apt deps: build-essential libssl-dev libuv1-dev liblua5.1-0-dev libyaml-dev pkg-config zlib1g-dev (libyaml-dev is easy to miss). Build time ~10 min.