Version: 0.10.10

Free-Threaded Python (3.14t)

Python 3.14t is the free-threaded build of CPython — the GIL is disabled (Py_GIL_DISABLED=1). This page reports what changes for aerospike-py and the official aerospike C client when the runtime is swapped, with no Rust or C source changes.

Sources: benchmark/results/python-3.14t-benchmark.md, benchmark/results/k6-runtime-client-comparison.md.

TL;DR

p95 single-mode (k6 10 VUs × 60s, FastAPI + DLRM)

aerospike-py
  3.11 + GIL    ███████████████              189 ms
  3.14t         ████████                      97 ms   −49% 🔥

official (C extension)
  3.11 + GIL    █████████████████████████    324 ms
  3.14t         ██████████                   128 ms   −60% 🔥

Throughput (aerospike-py): 41.6 → 61.2 iter/s   +47%

The GIL was a shared bottleneck for both clients. aerospike-py's gain came without touching Rust.

What gets faster (and why)

aerospike-py: stage breakdown
official C client: even bigger ratio

Internal stage timings under load. The two GIL-bound stages collapse to near-zero.

Stage                     3.11 + GIL              3.14t       Change
────────────────────────────────────────────────────────────────────
spawn_blocking_delay      234 ms     ████████    0.12 ms      −99.95% 🔥
event_loop_resume_delay   39.7 ms    ██           ≈ 0         ≈ −100%
io (Aerospike network)    7.51 ms    ▌           1.27 ms      −83%
merge_as_dict             4.48 ms    ▎           3.54 ms      −21%
key_parse                 967 μs     ·           1.06 ms      +10% (noise)
tokio_schedule_delay      83.1 μs                49.5 μs      −40%
limiter_wait              3.56 μs                0.96 μs      −73%

Two stages dominate the gain:

spawn_blocking_delay drops from 234 ms to 0.12 ms. Under GIL, when a Rust async future completes and needs to convert results into Py<...> objects, it dispatches that work to a spawn_blocking worker. That worker has to acquire the GIL — and under contention from the asyncio event loop and other workers, the queue stretches into hundreds of milliseconds. With no GIL, the worker runs immediately.
event_loop_resume_delay drops to effectively zero. Under GIL, after the future resolves and the event loop is woken up, the coroutine still has to wait its turn for the GIL before resuming. Free-threaded mode lets multiple coroutines resume in parallel.

io shrinking 8× is a second-order effect: Tokio workers can now parse Aerospike protocol responses without contending with Python code for the GIL.

The official aerospike client is a synchronous C extension; applications wrap it with loop.run_in_executor(ThreadPoolExecutor, ...). Each request:

Thread-pool worker dispatch
C extension acquires GIL to parse arguments     ← serialized under GIL
Network call (releases GIL)
C extension reacquires GIL to build Python dict ← serialized under GIL

Under GIL contention, steps 2 and 4 serialize across all in-flight requests.

Removing the GIL parallelizes them. The official client's p95 falls 60% (324 → 128 ms) — even more than aerospike-py's 49% in absolute terms — because it had more to lose.

Same-workload comparison on 3.14t

On 3.14t, when both clients hit the same server load (alternating endpoints in the same pod, k6 10 VUs):

Client	p95 (single mode)	Difference
aerospike-py	126 ms	baseline
official aerospike	128 ms	+2 ms (~1.5%, noise)

The 42% gap that existed under 3.11 + GIL collapses to ~2 ms under 3.14t.

This is the cleanest evidence that GIL contention — not architectural difference — was responsible for the bulk of the original gap.

Throughput (TPS)

Latency improvements translate directly into throughput. Two TPS views — k6 client iterations/s and server-side predict_requests_total rate — agree.

k6 iterations/s (full 5m 30s run)
Server-side req/s

11 + GIL, stage OFF             ████████████        41.6  baseline
11 + GIL, stage ON              █████████████       44.1  +6% (noise)
14t, aerospike-py only          ███████████████████ 61.2  +47% 🔥
14t, both clients (split load)  ██████████████      47.3  +14%

Config	iterations/s	http_reqs/s
3.11 + GIL, stage OFF	41.6	50.8
3.11 + GIL, stage ON	44.1	52.9
3.14t, aerospike-py only	61.2	80.0
3.14t, both clients (split load)	47.3	59.8

Config	aerospike-py	official aerospike
3.11 + GIL	40.9 req/s	~24 req/s¹
3.14t, aerospike-py only	42.5 req/s	n/a (503)²
3.14t, both clients	32.9 req/s	34.4 req/s

¹ 3.11 + GIL warmup window inflates the official client's first sample; steady-state rate is comparable. ² The :314t image ships only aerospike-py — official endpoints return 503 (no cp314t PyPI wheel).

Why "both clients" drops to 47.3 iter/s

When the two endpoints are loaded simultaneously, the same Aerospike server and FastAPI pod CPU are split between clients. Per-client throughput naturally halves. The 61.2 iter/s peak is the actual aerospike-py ceiling when it runs solo.

But aerospike-py still wins under real load

The "126 vs 128 ms" tie comes from a configuration where each client only sees ~5 effective VUs (10 VUs split across two endpoints). When the load is concentrated on one client, the gap reopens.

Solo-load comparison on 3.14t (each client owns the server)

Metric	aerospike-py solo	official solo	aerospike-py advantage
k6 single p95	97 ms	134 ms	−28%
k6 gather p95 (9× fan-out)	107 ms	253 ms	−58% 🔥
Server `predict_duration_seconds` p95	100 ms	138 ms	−28%
Server `batch_read_all` p95	64 ms	67 ms	−4%
Theoretical capacity (Little's Law: 10 VUs / p95)	~103 req/s	~75 req/s	+37%

k6 throughput numbers can mislead

The two solo runs used different k6 scripts. k6_benchmark_official_only.js sends exactly 1 request per iteration; k6_benchmark.js (used for aerospike-py solo) splits 10 VUs across 4 scenarios. Raw iterations/s is not apples-to-apples — but per-request latency is, because both runs used the same VU count and the same server.

Why aerospike-py still leads under solo load on 3.14t:

Native async vs threadpool wrapping. Even without GIL contention, run_in_executor adds a thread-pool hop per request. aerospike-py awaits directly on the asyncio loop.
Lazy dict conversion. batch_read() returns a BatchReadHandle (Arc-wrapped, ~10 μs). The Python-dict materialization happens lazily when the caller invokes .as_dict(). The official client builds the full dict on I/O completion.
Single FFI boundary crossing. The aerospike-py Rust code completes a full batch_read inside one PyO3 call. The C extension crosses the Python ↔ C boundary multiple times per call.

These advantages compound as concurrency rises (the gather number — 107 vs 253 ms — is the clearest example).

Recommended migration path

Add a 3.14t row to CI matrix. Run unit + integration tests under python:3.14.2t-slim.
Audit unsafe and shared mutable state in Rust. aerospike-py is mostly thread-safe but should be audited before declaring gil_used = false.
Promote to #[pymodule(gil_used = false)]. Once the audit is clean.
Wait for or build official client wheels. PyPI does not yet ship cp314t wheels for the official aerospike package. Source build works (see notes below).

Notes

Side-effect: inference also got faster. DLRM inference (PyTorch CPU, control variable) dropped from 43.5 ms → 20.7 ms (−52%) on 3.14t. Unrelated to aerospike-py — a free side benefit for any GIL-bound inference path running alongside async I/O.

GIL state verified. With Py_GIL_DISABLED=1, the interpreter did not re-enable GIL after import aerospike_py — the Rust module currently declares #[pymodule(gil_used = true)] but the underlying code is already mostly thread-safe (ArcSwapOption for client, Arc<Vec<BatchRecord>> for batch handles, Mutex for metric registry). Promoting to gil_used = false after a full audit is a follow-up.

Build images. aerospike-benchmark:314t ships only aerospike-py (uses cp314t wheels from benchmark/deploy/wheels-314t/). aerospike-benchmark:314t-with-official adds the official C client built from source — required apt deps: build-essential libssl-dev libuv1-dev liblua5.1-0-dev libyaml-dev pkg-config zlib1g-dev (libyaml-dev is easy to miss). Build time ~10 min.

TL;DR​

What gets faster (and why)​

Same-workload comparison on 3.14t​

Throughput (TPS)​

Why "both clients" drops to 47.3 iter/s​

But aerospike-py still wins under real load​

Solo-load comparison on 3.14t (each client owns the server)​

Recommended migration path​

Notes​