Skip to content

Add Python 3.14 support via _Py_DebugOffsets#831

Draft
charles-dyfis-net wants to merge 1 commit into
benfred:masterfrom
charles-dyfis-net:py314
Draft

Add Python 3.14 support via _Py_DebugOffsets#831
charles-dyfis-net wants to merge 1 commit into
benfred:masterfrom
charles-dyfis-net:py314

Conversation

@charles-dyfis-net
Copy link
Copy Markdown

Instead of generating version-specific bindgen struct bindings for 3.14, read interpreter state using field offsets from CPython's _Py_DebugOffsets metadata table (embedded at the start of _PyRuntime). This table, present since 3.13, provides offsets for all core interpreter structs -- PyInterpreterState, PyThreadState, _PyInterpreterFrame, PyCodeObject, and common Python object types -- allowing a profiler to read them without compile-time knowledge of their layouts.

The offset-based approach handles 3.14's key structural changes:

  • _PyStackRef tagged pointers replacing PyObject* in frame fields (extraction: bits & !1; null: bits == 1; tagged int: bits & 3 == 3)
  • FRAME_OWNED_BY_CSTACK renumbered from 3 to 4
  • Free-threaded PyObject header enlargement (16 -> 32 bytes)
  • Free-threaded PyASCIIObject.state.interned changed from a 2-bit bitfield to a full unsigned char (for atomic access), shifting kind/compact/ascii into the next byte
  • Thread-local bytecode (TLBC) in free-threaded builds, where instr_ptr may point into a per-thread copy rather than co_code_adaptive
  • Dict entry format change (dk_kind==0 now uses a 3-field KeyWithHash entry instead of the legacy PyDictKeyEntry)
  • Inline values for object attributes (Thread._name is stored inline rather than in a materialized __dict__)

Feature parity with the trait-based path:

  • Stack traces with filenames, function names, and line numbers
  • GIL detection (including free-threaded builds where GIL is disabled)
  • Thread names (via offset-based dict iteration and inline value reading)
  • Local variable inspection (--dump-locals), including dict, list, tuple, string, int, float, and numpy scalar formatting
  • Native stack merging, active/idle detection, short filenames

Python 3.13 also uses the offset-based path (validated via oracle comparison tests against the trait-based path on identical frozen process state).

Performance: frame and thread state structs are bulk-read in single process_vm_readv / mach_vm_read calls via StructBuf, matching the trait-based path's existing behavior of ~2 struct reads per frame (frame + code object), plus string/linetable reads that both paths require.

The existing trait-based code path is unchanged for Python <=3.12. The 3.13 bindgen types (v3_13_0) are reused in a few places where _Py_DebugOffsets doesn't yet provide sufficient information:

  • PyDictKeysObject layout (no PyObject header, stable across builds)
  • ht_cached_keys offset in PyHeapTypeObject (derived from the 3.13 bindgen offset, adjusted by the PyObject header size delta for free-threaded builds)
  • tp_dictoffset within PyTypeObject (discovered at runtime by scanning a module type for a known value)

These workarounds are documented and will be unnecessary if python/cpython#146462 is resolved (adding tp_dictoffset, tp_basicsize, and ht_cached_keys to _Py_DebugOffsets).

Known limitations:

  • 32-bit platforms: inline value reading (ht_cached_keys) is not supported; thread names will be unavailable for 3.14 on 32-bit.
  • The _dictvalues header size (4 bytes padded to pointer alignment) and MANAGED_DICT_OFFSET (-1 or -3 * ptr_size depending on Py_GIL_DISABLED) are derived from CPython internals not exposed via _Py_DebugOffsets.
  • PyCompactUnicodeObject size is derived as asciiobject_size + 2 * ptr_size (for the utf8_length and utf8 fields); _Py_DebugOffsets provides asciiobject_size but not PyCompactUnicodeObject size directly.

Instead of generating version-specific bindgen struct bindings for 3.14, read interpreter state using field offsets from CPython's `_Py_DebugOffsets` metadata table (embedded at the start of `_PyRuntime`). This table, present since 3.13, provides offsets for all core interpreter structs -- `PyInterpreterState`, `PyThreadState`, `_PyInterpreterFrame`, `PyCodeObject`, and common Python object types -- allowing a profiler to read them without compile-time knowledge of their layouts.

The offset-based approach handles 3.14's key structural changes:

  - `_PyStackRef` tagged pointers replacing `PyObject*` in frame fields (extraction: `bits & !1`; null: `bits == 1`; tagged int: `bits & 3 == 3`)
  - `FRAME_OWNED_BY_CSTACK` renumbered from 3 to 4
  - Free-threaded `PyObject` header enlargement (16 -> 32 bytes)
  - Free-threaded `PyASCIIObject.state.interned` changed from a 2-bit bitfield to a full `unsigned char` (for atomic access), shifting kind/compact/ascii into the next byte
  - Thread-local bytecode (TLBC) in free-threaded builds, where `instr_ptr` may point into a per-thread copy rather than `co_code_adaptive`
  - Dict entry format change (`dk_kind==0` now uses a 3-field `KeyWithHash` entry instead of the legacy `PyDictKeyEntry`)
  - Inline values for object attributes (`Thread._name` is stored inline rather than in a materialized `__dict__`)

Feature parity with the trait-based path:

  - Stack traces with filenames, function names, and line numbers
  - GIL detection (including free-threaded builds where GIL is disabled)
  - Thread names (via offset-based dict iteration and inline value reading)
  - Local variable inspection (`--dump-locals`), including dict, list, tuple, string, int, float, and numpy scalar formatting
  - Native stack merging, active/idle detection, short filenames

Python 3.13 also uses the offset-based path (validated via oracle comparison tests against the trait-based path on identical frozen process state).

Performance: frame and thread state structs are bulk-read in single `process_vm_readv` / `mach_vm_read` calls via `StructBuf`, matching the trait-based path's existing behavior of ~2 struct reads per frame (frame + code object), plus string/linetable reads that both paths require.

The existing trait-based code path is unchanged for Python <=3.12. The 3.13 bindgen types (`v3_13_0`) are reused in a few places where `_Py_DebugOffsets` doesn't yet provide sufficient information:

  - `PyDictKeysObject` layout (no `PyObject` header, stable across builds)
  - `ht_cached_keys` offset in `PyHeapTypeObject` (derived from the 3.13 bindgen offset, adjusted by the `PyObject` header size delta for free-threaded builds)
  - `tp_dictoffset` within `PyTypeObject` (discovered at runtime by scanning a module type for a known value)

These workarounds are documented and will be unnecessary if python/cpython#146462 is resolved (adding `tp_dictoffset`, `tp_basicsize`, and `ht_cached_keys` to `_Py_DebugOffsets`).

Known limitations:

  - 32-bit platforms: inline value reading (`ht_cached_keys`) is not supported; thread names will be unavailable for 3.14 on 32-bit.
  - The `_dictvalues` header size (4 bytes padded to pointer alignment) and `MANAGED_DICT_OFFSET` (-1 or -3 * ptr_size depending on `Py_GIL_DISABLED`) are derived from CPython internals not exposed via `_Py_DebugOffsets`.
  - `PyCompactUnicodeObject` size is derived as `asciiobject_size + 2 * ptr_size` (for the `utf8_length` and `utf8` fields); `_Py_DebugOffsets` provides `asciiobject_size` but not `PyCompactUnicodeObject` size directly.
@charles-dyfis-net
Copy link
Copy Markdown
Author

Adding Python 3.15 support should be straightforward after python/cpython#146462 lands. (Granted, Python 3.15 has its own high-performance sampling profiler -- but there are some features, such as native frame unwinding, that I don't know it to target, and having a single Rust library that can be used to build tools that monitor a wide range of Python interpreter releases has value all its own).

@charles-dyfis-net
Copy link
Copy Markdown
Author

charles-dyfis-net commented Mar 28, 2026

BTW, I quite appreciate the prior work done in #819 by @czardoz. The value of going the route used here is in futureproofing and avoiding the need for tens of thousands of lines of generated code to be added for every new interpreter release.

Keeping this in draft for now pending further testing -- it's been developed and smoketested on aarch64 (in a tree also having e04be07 applied); I'll want to spend some time with Intel platforms before calling it good.

@santagada
Copy link
Copy Markdown

santagada commented Apr 14, 2026

For what is worth I build this change and on windows with python 3.14 this never finds any stacks.

target\release\py-spy.exe record --pid 35860 -o test
py-spy> Sampling process 100 times a second. Press Control-C to exit.


py-spy> Stopped sampling because Control-C pressed
[2026-04-14T08:37:29.787074900Z ERROR inferno::flamegraph] No stack counts found
Error: Failed to write flamegraph: No stack counts found

@charles-dyfis-net
Copy link
Copy Markdown
Author

Interesting. Windows is not a platform I generally target -- will need to make an effort to get set up for development there.

@charles-dyfis-net
Copy link
Copy Markdown
Author

Since there's upstream support for 3.14, current plan is to wait for CPython to merge the extra requested offsets into the 3.15 tree, and target that only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants