Version: 0.10.6

Free-Threaded Python (3.14t)

Python 3.14t is the free-threaded build of CPython — the GIL is disabled (Py_GIL_DISABLED=1). This page reports what changes for aerospike-py and the official aerospike C client when the runtime is swapped, with no Rust or C source changes.

Source: benchmark/results/python-3.14t-benchmark.md and benchmark/results/k6-runtime-client-comparison.md.

TL;DR

Client	p95 (3.11 + GIL)	p95 (3.14t)	Improvement
aerospike-py	189 ms	97 ms	−49% 🔥
official aerospike (C client, source-built)	324 ms	128 ms	−60% 🔥

Throughput (aerospike-py): 41.6 → 61.2 iter/s (+47%) at 10 VUs single mode.

The GIL was a shared bottleneck for both clients. aerospike-py's gain came without touching Rust.

Why GIL Removal Helps Aerospike-py

The biggest internal stage costs in aerospike-py under 3.11 + GIL are not network I/O — they're stages where Rust async work waits for the Python interpreter:

Stage	3.11 + GIL avg	3.14t avg	Change
`spawn_blocking_delay`	234 ms	0.12 ms	−99.95% 🔥
`event_loop_resume_delay`	39.7 ms	≈ 0	≈ −100%
`io` (Aerospike network)	7.51 ms	1.27 ms	−83%
`merge_as_dict`	4.48 ms	3.54 ms	−21%
`key_parse`	967 μs	1.06 ms	+10% (noise)
`tokio_schedule_delay`	83.1 μs	49.5 μs	−40%
`limiter_wait`	3.56 μs	0.96 μs	−73%

Two stages dominate the gain:

spawn_blocking_delay drops from 234 ms to 0.12 ms. Under GIL, when a Rust async future completes and needs to convert results into Py<...> objects, it dispatches that work to a spawn_blocking worker. That worker has to acquire the GIL — and under contention from the asyncio event loop and other workers, the queue stretches into hundreds of milliseconds. With no GIL, the worker runs immediately.
event_loop_resume_delay drops to effectively zero. Under GIL, after the future resolves and the event loop is woken up, the coroutine still has to wait its turn for the GIL before resuming. Free-threaded mode lets multiple coroutines resume in parallel.

io shrinking 8× is a second-order effect: Tokio workers can now parse Aerospike protocol responses without contending with Python code for the GIL.

Why GIL Removal Helps the Official C Client (Even More, in Ratio)

The official aerospike client is a synchronous C extension; applications wrap it with loop.run_in_executor(ThreadPoolExecutor, ...). Each request:

Hops onto a thread-pool worker
The C extension acquires the GIL to parse arguments
Runs the network call (releasing the GIL)
Reacquires the GIL to build the Python result

Under GIL contention, steps 2 and 4 serialize across all in-flight requests. Removing the GIL parallelizes them, which is why the official client's p95 falls 60% (324 → 128 ms) — even more than aerospike-py's 49% reduction in absolute terms.

Same-Workload Comparison on 3.14t

On 3.14t, when both clients hit the same server load (alternating endpoints in the same pod, k6 10 VUs):

Client	p95 (single mode)	Difference
aerospike-py	126 ms	baseline
official aerospike	128 ms	+2 ms (~1.5%, noise)

The 42% gap that existed under 3.11 + GIL collapses to ~2 ms under 3.14t.

This is the cleanest evidence that GIL contention — not architectural difference — was responsible for the bulk of the original gap.

Throughput (TPS)

Latency improvements translate directly into throughput. Two TPS views — k6 client iterations/s and server-side predict_requests_total rate — agree.

k6 iterations/s (full 5m 30s run)

Config	iterations/s	http_reqs/s	vs 3.11 baseline
3.11 + GIL, stage OFF	41.6	50.8	baseline
3.11 + GIL, stage ON	44.1	52.9	+6% (noise)
3.14t, aerospike-py only	61.2	80.0	+47% 🔥
3.14t, both clients (split load)	47.3	59.8	+14%

Server-side `predict_requests_total` rate (single mode)

Config	aerospike-py	official aerospike
3.11 + GIL	40.9 req/s	~24 req/s¹
3.14t, aerospike-py only	42.5 req/s	n/a (503)²
3.14t, both clients	32.9 req/s	34.4 req/s

¹ 3.11 + GIL warmup window inflates the official client's first sample; steady-state rate is comparable to aerospike-py. ² The :314t image ships only aerospike-py — official endpoints return 503 (no cp314t PyPI wheel).

Why "both clients" drops to 47.3 iter/s

When the two endpoints are loaded simultaneously, the same Aerospike server and FastAPI pod CPU are split between clients. Per-client throughput naturally halves, so combined iterations/s reflects shared server capacity rather than per-client ceiling. The 61.2 iter/s peak is the actual aerospike-py ceiling when it runs solo.

But Aerospike-py Still Wins Under Real Load

The "126 vs 128 ms" tie comes from a configuration where each client only sees ~5 effective VUs (10 VUs split across two endpoints). When the load is concentrated on one client, the gap reopens.

Solo-load comparison on 3.14t (each client owns the server)

Metric	aerospike-py solo	official solo	aerospike-py advantage
k6 single p95	97 ms	134 ms	−28%
k6 gather p95 (9× fan-out)	107 ms	253 ms	−58% 🔥
Server `predict_duration_seconds` p95	100 ms	138 ms	−28%
Server `batch_read_all` p95	64 ms	67 ms	−4%
Theoretical capacity (Little's Law: 10 VUs / p95)	~103 req/s	~75 req/s	+37%

k6 throughput numbers can mislead

The two solo runs used different k6 scripts. k6_benchmark_official_only.js sends exactly 1 request per iteration; k6_benchmark.js (used for aerospike-py solo) splits 10 VUs across 4 scenarios. Raw iterations/s is not apples-to-apples — but per-request latency is, because both runs used the same VU count and the same server.

Why aerospike-py still leads under solo load on 3.14t:

Native async vs threadpool wrapping. Even without GIL contention, run_in_executor adds a thread-pool hop per request. aerospike-py awaits directly on the asyncio loop.
Lazy dict conversion. batch_read() returns a BatchReadHandle (Arc-wrapped, ~10 μs). The Python-dict materialization happens lazily when the caller invokes .as_dict(), avoiding eager work. The official client builds the full dict on I/O completion.
Single FFI boundary crossing. The aerospike-py Rust code completes a full batch_read inside one PyO3 call. The C extension crosses the Python ↔ C boundary multiple times per call.

These advantages compound as concurrency rises (the gather number — 107 vs 253 ms — is the clearest example).

Recommended Migration Path

Add a 3.14t row to CI matrix. Run unit + integration tests under python:3.14.2t-slim.
Audit unsafe and shared mutable state in Rust. aerospike-py is mostly thread-safe but should be audited before declaring gil_used = false.
Promote to #[pymodule(gil_used = false)]. Once the audit is clean.
Wait for or build official client wheels. PyPI does not yet ship cp314t wheels for the official aerospike package. Source build works (see notes below).

Notes

Side-effect: inference also got faster. DLRM inference (PyTorch CPU, control variable) dropped from 43.5 ms → 20.7 ms (−52%) on 3.14t. Unrelated to aerospike-py — a free side benefit for any GIL-bound inference path running alongside async I/O.

GIL state verified. With Py_GIL_DISABLED=1, the interpreter did not re-enable GIL after import aerospike_py — the Rust module currently declares #[pymodule(gil_used = true)] but the underlying code is already mostly thread-safe (ArcSwapOption for client, Arc<Vec<BatchRecord>> for batch handles, Mutex for metric registry). Promoting to gil_used = false after a full audit is a follow-up.

Build images. aerospike-benchmark:314t ships only aerospike-py (uses cp314t wheels from benchmark/deploy/wheels-314t/). aerospike-benchmark:314t-with-official adds the official C client built from source — required apt deps: build-essential libssl-dev libuv1-dev liblua5.1-0-dev libyaml-dev pkg-config zlib1g-dev (libyaml-dev is easy to miss). Build time ~10 min.

TL;DR​

Why GIL Removal Helps Aerospike-py​

Why GIL Removal Helps the Official C Client (Even More, in Ratio)​

Same-Workload Comparison on 3.14t​

Throughput (TPS)​

k6 iterations/s (full 5m 30s run)​

Server-side predict_requests_total rate (single mode)​

Why "both clients" drops to 47.3 iter/s​

But Aerospike-py Still Wins Under Real Load​

Solo-load comparison on 3.14t (each client owns the server)​

Recommended Migration Path​

Notes​