Add Python 3.14 support via _Py_DebugOffsets#831
Conversation
Instead of generating version-specific bindgen struct bindings for 3.14, read interpreter state using field offsets from CPython's `_Py_DebugOffsets` metadata table (embedded at the start of `_PyRuntime`). This table, present since 3.13, provides offsets for all core interpreter structs -- `PyInterpreterState`, `PyThreadState`, `_PyInterpreterFrame`, `PyCodeObject`, and common Python object types -- allowing a profiler to read them without compile-time knowledge of their layouts. The offset-based approach handles 3.14's key structural changes: - `_PyStackRef` tagged pointers replacing `PyObject*` in frame fields (extraction: `bits & !1`; null: `bits == 1`; tagged int: `bits & 3 == 3`) - `FRAME_OWNED_BY_CSTACK` renumbered from 3 to 4 - Free-threaded `PyObject` header enlargement (16 -> 32 bytes) - Free-threaded `PyASCIIObject.state.interned` changed from a 2-bit bitfield to a full `unsigned char` (for atomic access), shifting kind/compact/ascii into the next byte - Thread-local bytecode (TLBC) in free-threaded builds, where `instr_ptr` may point into a per-thread copy rather than `co_code_adaptive` - Dict entry format change (`dk_kind==0` now uses a 3-field `KeyWithHash` entry instead of the legacy `PyDictKeyEntry`) - Inline values for object attributes (`Thread._name` is stored inline rather than in a materialized `__dict__`) Feature parity with the trait-based path: - Stack traces with filenames, function names, and line numbers - GIL detection (including free-threaded builds where GIL is disabled) - Thread names (via offset-based dict iteration and inline value reading) - Local variable inspection (`--dump-locals`), including dict, list, tuple, string, int, float, and numpy scalar formatting - Native stack merging, active/idle detection, short filenames Python 3.13 also uses the offset-based path (validated via oracle comparison tests against the trait-based path on identical frozen process state). Performance: frame and thread state structs are bulk-read in single `process_vm_readv` / `mach_vm_read` calls via `StructBuf`, matching the trait-based path's existing behavior of ~2 struct reads per frame (frame + code object), plus string/linetable reads that both paths require. The existing trait-based code path is unchanged for Python <=3.12. The 3.13 bindgen types (`v3_13_0`) are reused in a few places where `_Py_DebugOffsets` doesn't yet provide sufficient information: - `PyDictKeysObject` layout (no `PyObject` header, stable across builds) - `ht_cached_keys` offset in `PyHeapTypeObject` (derived from the 3.13 bindgen offset, adjusted by the `PyObject` header size delta for free-threaded builds) - `tp_dictoffset` within `PyTypeObject` (discovered at runtime by scanning a module type for a known value) These workarounds are documented and will be unnecessary if python/cpython#146462 is resolved (adding `tp_dictoffset`, `tp_basicsize`, and `ht_cached_keys` to `_Py_DebugOffsets`). Known limitations: - 32-bit platforms: inline value reading (`ht_cached_keys`) is not supported; thread names will be unavailable for 3.14 on 32-bit. - The `_dictvalues` header size (4 bytes padded to pointer alignment) and `MANAGED_DICT_OFFSET` (-1 or -3 * ptr_size depending on `Py_GIL_DISABLED`) are derived from CPython internals not exposed via `_Py_DebugOffsets`. - `PyCompactUnicodeObject` size is derived as `asciiobject_size + 2 * ptr_size` (for the `utf8_length` and `utf8` fields); `_Py_DebugOffsets` provides `asciiobject_size` but not `PyCompactUnicodeObject` size directly.
|
Adding Python 3.15 support should be straightforward after python/cpython#146462 lands. (Granted, Python 3.15 has its own high-performance sampling profiler -- but there are some features, such as native frame unwinding, that I don't know it to target, and having a single Rust library that can be used to build tools that monitor a wide range of Python interpreter releases has value all its own). |
|
BTW, I quite appreciate the prior work done in #819 by @czardoz. The value of going the route used here is in futureproofing and avoiding the need for tens of thousands of lines of generated code to be added for every new interpreter release. Keeping this in draft for now pending further testing -- it's been developed and smoketested on aarch64 (in a tree also having e04be07 applied); I'll want to spend some time with Intel platforms before calling it good. |
|
For what is worth I build this change and on windows with python 3.14 this never finds any stacks. |
|
Interesting. Windows is not a platform I generally target -- will need to make an effort to get set up for development there. |
|
Since there's upstream support for 3.14, current plan is to wait for CPython to merge the extra requested offsets into the 3.15 tree, and target that only. |
Instead of generating version-specific bindgen struct bindings for 3.14, read interpreter state using field offsets from CPython's
_Py_DebugOffsetsmetadata table (embedded at the start of_PyRuntime). This table, present since 3.13, provides offsets for all core interpreter structs --PyInterpreterState,PyThreadState,_PyInterpreterFrame,PyCodeObject, and common Python object types -- allowing a profiler to read them without compile-time knowledge of their layouts.The offset-based approach handles 3.14's key structural changes:
_PyStackReftagged pointers replacingPyObject*in frame fields (extraction:bits & !1; null:bits == 1; tagged int:bits & 3 == 3)FRAME_OWNED_BY_CSTACKrenumbered from 3 to 4PyObjectheader enlargement (16 -> 32 bytes)PyASCIIObject.state.internedchanged from a 2-bit bitfield to a fullunsigned char(for atomic access), shifting kind/compact/ascii into the next byteinstr_ptrmay point into a per-thread copy rather thanco_code_adaptivedk_kind==0now uses a 3-fieldKeyWithHashentry instead of the legacyPyDictKeyEntry)Thread._nameis stored inline rather than in a materialized__dict__)Feature parity with the trait-based path:
--dump-locals), including dict, list, tuple, string, int, float, and numpy scalar formattingPython 3.13 also uses the offset-based path (validated via oracle comparison tests against the trait-based path on identical frozen process state).
Performance: frame and thread state structs are bulk-read in single
process_vm_readv/mach_vm_readcalls viaStructBuf, matching the trait-based path's existing behavior of ~2 struct reads per frame (frame + code object), plus string/linetable reads that both paths require.The existing trait-based code path is unchanged for Python <=3.12. The 3.13 bindgen types (
v3_13_0) are reused in a few places where_Py_DebugOffsetsdoesn't yet provide sufficient information:PyDictKeysObjectlayout (noPyObjectheader, stable across builds)ht_cached_keysoffset inPyHeapTypeObject(derived from the 3.13 bindgen offset, adjusted by thePyObjectheader size delta for free-threaded builds)tp_dictoffsetwithinPyTypeObject(discovered at runtime by scanning a module type for a known value)These workarounds are documented and will be unnecessary if python/cpython#146462 is resolved (adding
tp_dictoffset,tp_basicsize, andht_cached_keysto_Py_DebugOffsets).Known limitations:
ht_cached_keys) is not supported; thread names will be unavailable for 3.14 on 32-bit._dictvaluesheader size (4 bytes padded to pointer alignment) andMANAGED_DICT_OFFSET(-1 or -3 * ptr_size depending onPy_GIL_DISABLED) are derived from CPython internals not exposed via_Py_DebugOffsets.PyCompactUnicodeObjectsize is derived asasciiobject_size + 2 * ptr_size(for theutf8_lengthandutf8fields);_Py_DebugOffsetsprovidesasciiobject_sizebut notPyCompactUnicodeObjectsize directly.