Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 71 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
776a1e1
Assign threads indices into bytecode copies
mpage Aug 15, 2024
2b40870
Replace most usage of PyCode_CODE
mpage Aug 27, 2024
344d7ad
Get bytecode copying working
mpage Aug 20, 2024
f203d00
Refactor remove_tools
mpage Aug 30, 2024
82b456a
Refactor remove_line_tools
mpage Aug 30, 2024
b021704
Instrument thread-local bytecode
mpage Sep 1, 2024
aea69c5
Use locks for instrumentation
mpage Sep 3, 2024
552277d
Add ifdef guards for each specialization family
mpage Sep 3, 2024
50a6089
Specialize BINARY_OP
mpage Sep 4, 2024
3f1d941
Limit the amount of memory consumed by bytecode copies
mpage Sep 6, 2024
7d2eb27
Make thread-local bytecode limits user configurable
mpage Sep 7, 2024
d5476b9
Fix a few data races when (de)instrumenting opcodes
mpage Sep 8, 2024
e3b367a
Make branch taken recording thread-safe
mpage Sep 8, 2024
b2375bf
Lock thread-local bytecode when specializing
mpage Sep 9, 2024
2707f8e
Load bytecode on RESUME_CHECK
mpage Sep 9, 2024
3fdcb28
Load tlbc on generator.throw()
mpage Sep 9, 2024
4a55ce5
Use tlbc instead of thread_local_bytecode
mpage Sep 9, 2024
8b3ff60
Use tlbc everywhere
mpage Sep 9, 2024
862afa1
Explicitly manage tlbc state
mpage Sep 9, 2024
0b4d952
Refactor API for fetching tlbc
mpage Sep 9, 2024
7795e99
Add unit tests
mpage Sep 10, 2024
693a4cc
Fix initconfig in default build
mpage Sep 10, 2024
b43531e
Fix instrumentation in default build
mpage Sep 10, 2024
9025f43
Synchronize bytecode modifications between specialization and instrum…
mpage Sep 10, 2024
c44c7d9
Add a high-level comment
mpage Sep 10, 2024
e2a6656
Fix unused variable warning in default build
mpage Sep 10, 2024
e6513d1
Fix test_config in free-threaded builds
mpage Sep 10, 2024
a18396f
Fix formatting
mpage Sep 10, 2024
81fe1a2
Remove comment
mpage Sep 10, 2024
837645e
Fix data race in _PyInstruction_GetLength
mpage Sep 10, 2024
f13e132
Fix tier2 optimizer
mpage Sep 11, 2024
942f628
Use __VA_ARGS__ for macros
mpage Sep 11, 2024
66cb24d
Update vcxproj files to include newly added files
mpage Sep 11, 2024
ad12bd4
Mark unused params
mpage Sep 11, 2024
1bbbbbc
Keep tier2 and the JIT disabled in free-threaded builds
mpage Sep 12, 2024
e63e403
Only allow enabling/disabling tlbc
mpage Sep 13, 2024
8b97771
Update libpython for gdb
mpage Sep 13, 2024
d34adeb
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 13, 2024
6d4fe73
Handle out of memory errors
mpage Sep 13, 2024
c2d8693
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 17, 2024
b104782
Fix warnings on windows
mpage Sep 17, 2024
deb5216
Fix another warning
mpage Sep 18, 2024
2f11cc7
Ugh actually fix it
mpage Sep 18, 2024
04f1ac3
Add high-level comment about index pools
mpage Sep 25, 2024
aa330b1
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 25, 2024
7dfd1ca
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 26, 2024
7c9da24
Exclude tlbc from refleak counts
mpage Sep 27, 2024
dd144d0
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 28, 2024
ad180d1
Regen files
mpage Sep 28, 2024
95d2264
Move `get_tlbc_blocks` into the sys module
mpage Sep 30, 2024
b6380de
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Sep 30, 2024
adb59ef
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 5, 2024
39c947d
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 10, 2024
2cc5830
Work around `this_instr` now being const
mpage Oct 11, 2024
96ec126
Make RESUME_CHECK cheaper
mpage Oct 11, 2024
5ecebd9
Pass tstate to _PyCode_GetTLBCFast
mpage Oct 11, 2024
815b2fe
Rename test_tlbc.py to test_thread_local_bytecode.py
mpage Oct 11, 2024
fb90d23
Remove per-family defines for specialization
mpage Oct 11, 2024
4e42414
Replace bytecode pointer with tlbc_index
mpage Oct 13, 2024
814e4ca
Add a test verifying that we clean up tlbc when the code object is de…
mpage Oct 14, 2024
ba3930a
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 14, 2024
cb8a774
Fix indentation
mpage Oct 14, 2024
0f8a55b
Clarify comment
mpage Oct 14, 2024
70ce0fe
Fix TSAN
mpage Oct 14, 2024
f512353
Add test for cleaning up tlbc in correct place, not old emacs buffer
mpage Oct 14, 2024
4be2b1f
Remove test_tlbc.py
mpage Oct 14, 2024
61c7aa9
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 17, 2024
ab6222c
Use int32_t instead of Py_ssize_t for tlbc indices
mpage Oct 17, 2024
6bbb220
Use _PyCode_CODE instead of PyFrame_GetBytecode in super_init_without…
mpage Oct 17, 2024
4580e3c
Update comment
mpage Oct 17, 2024
b992f44
Consolidate _PyCode_{Quicken,DisableSpecialization} into _PyCode_Init…
mpage Oct 17, 2024
4c040d3
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 18, 2024
5b7658c
Fix incorrect types
mpage Oct 18, 2024
bec5bce
Add command-line tests for enabling TLBC
mpage Oct 18, 2024
c9054b7
Update libpython.py for tlbc_index
mpage Oct 18, 2024
1a48ab2
Avoid special casing in _PyEval_GetExecutableCode
mpage Oct 19, 2024
b16ae5f
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 19, 2024
176b24e
Merge branch 'main' into gh-115999-thread-local-bytecode
mpage Oct 23, 2024
c107495
Clear TLBC when other caches are cleared
mpage Oct 23, 2024
07f9140
Remove _get_tlbc_blocks
mpage Oct 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions Include/cpython/code.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,24 @@ typedef struct {
uint8_t *per_instruction_tools;
} _PyCoMonitoringData;

#ifdef Py_GIL_DISABLED

/* Each thread specializes a thread-local copy of the bytecode in free-threaded
* builds. These copies are stored on the code object in a `_PyCodeArray`. The
* first entry in the array always points to the "main" copy of the bytecode
* that is stored at the end of the code object.
*/
typedef struct {
Py_ssize_t size;
char *entries[1];
mpage marked this conversation as resolved.
Show resolved Hide resolved
} _PyCodeArray;

#define _PyCode_DEF_THREAD_LOCAL_BYTECODE() \
_PyCodeArray *co_tlbc;
#else
#define _PyCode_DEF_THREAD_LOCAL_BYTECODE()
#endif

// To avoid repeating ourselves in deepfreeze.py, all PyCodeObject members are
// defined in this macro:
#define _PyCode_DEF(SIZE) { \
Expand Down Expand Up @@ -138,6 +156,7 @@ typedef struct {
Type is a void* to keep the format private in codeobject.c to force \
people to go through the proper APIs. */ \
void *co_extra; \
_PyCode_DEF_THREAD_LOCAL_BYTECODE() \
Yhg1s marked this conversation as resolved.
Show resolved Hide resolved
char co_code_adaptive[(SIZE)]; \
}

Expand Down
1 change: 1 addition & 0 deletions Include/cpython/initconfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ typedef struct PyConfig {
int cpu_count;
#ifdef Py_GIL_DISABLED
int enable_gil;
int tlbc_enabled;
#endif

/* --- Path configuration inputs ------------ */
Expand Down
15 changes: 15 additions & 0 deletions Include/internal/pycore_ceval.h
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,21 @@ _PyEval_IsGILEnabled(PyThreadState *tstate)
extern int _PyEval_EnableGILTransient(PyThreadState *tstate);
extern int _PyEval_EnableGILPermanent(PyThreadState *tstate);
extern int _PyEval_DisableGIL(PyThreadState *state);


static inline _Py_CODEUNIT *
_PyEval_GetExecutableCode(PyThreadState *tstate, PyCodeObject *co)
{
_Py_CODEUNIT *bc = _PyCode_GetTLBCFast(tstate, co);
if (bc != NULL) {
return bc;
}
if (!_PyInterpreterState_GET()->config.tlbc_enabled) {
return _PyCode_CODE(co);
}
return _PyCode_GetTLBC(co);
}

#endif

extern void _PyEval_DeactivateOpCache(void);
Expand Down
36 changes: 36 additions & 0 deletions Include/internal/pycore_code.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ extern "C" {
#include "pycore_stackref.h" // _PyStackRef
#include "pycore_lock.h" // PyMutex
#include "pycore_backoff.h" // _Py_BackoffCounter
#include "pycore_tstate.h" // _PyThreadStateImpl


/* Each instruction in a code object is a fixed-width value,
Expand Down Expand Up @@ -313,11 +314,17 @@ extern int _PyLineTable_PreviousAddressRange(PyCodeAddressRange *range);
/** API for executors */
extern void _PyCode_Clear_Executors(PyCodeObject *code);


#ifdef Py_GIL_DISABLED
// gh-115999 tracks progress on addressing this.
#define ENABLE_SPECIALIZATION 0
// Use this to enable specialization families once they are thread-safe. All
// uses will be replaced with ENABLE_SPECIALIZATION once all families are
// thread-safe.
#define ENABLE_SPECIALIZATION_FT 1
#else
#define ENABLE_SPECIALIZATION 1
#define ENABLE_SPECIALIZATION_FT ENABLE_SPECIALIZATION
#endif

/* Specialization functions */
Expand Down Expand Up @@ -600,6 +607,35 @@ struct _PyCode8 _PyCode_DEF(8);

PyAPI_DATA(const struct _PyCode8) _Py_InitCleanup;

#ifdef Py_GIL_DISABLED

// Return a pointer to the thread-local bytecode for the current thread, if it
// exists.
static inline _Py_CODEUNIT *
_PyCode_GetTLBCFast(PyThreadState *tstate, PyCodeObject *co)
{
_PyCodeArray *code = _Py_atomic_load_ptr_acquire(&co->co_tlbc);
int32_t idx = ((_PyThreadStateImpl*) tstate)->tlbc_index;
if (idx < code->size && code->entries[idx] != NULL) {
return (_Py_CODEUNIT *) code->entries[idx];
}
return NULL;
}

// Return a pointer to the thread-local bytecode for the current thread,
// creating it if necessary.
extern _Py_CODEUNIT *_PyCode_GetTLBC(PyCodeObject *co);

// Reserve an index for the current thread into thread-local bytecode
// arrays
//
// Returns the reserved index or -1 on error.
extern int32_t _Py_ReserveTLBCIndex(PyInterpreterState *interp);

// Release the current thread's index into thread-local bytecode arrays
extern void _Py_ClearTLBCIndex(_PyThreadStateImpl *tstate);
#endif

#ifdef __cplusplus
}
#endif
Expand Down
56 changes: 52 additions & 4 deletions Include/internal/pycore_frame.h
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,10 @@ typedef struct _PyInterpreterFrame {
PyObject *f_locals; /* Strong reference, may be NULL. Only valid if not on C stack */
PyFrameObject *frame_obj; /* Strong reference, may be NULL. Only valid if not on C stack */
_Py_CODEUNIT *instr_ptr; /* Instruction currently executing (or about to begin) */
#ifdef Py_GIL_DISABLED
Yhg1s marked this conversation as resolved.
Show resolved Hide resolved
/* Index of thread-local bytecode containing instr_ptr. */
int32_t tlbc_index;
#endif
_PyStackRef *stackpointer;
uint16_t return_offset; /* Only relevant during a function call */
char owner;
Expand All @@ -76,14 +80,27 @@ typedef struct _PyInterpreterFrame {
} _PyInterpreterFrame;

#define _PyInterpreterFrame_LASTI(IF) \
((int)((IF)->instr_ptr - _PyCode_CODE(_PyFrame_GetCode(IF))))
((int)((IF)->instr_ptr - _PyFrame_GetBytecode((IF))))

static inline PyCodeObject *_PyFrame_GetCode(_PyInterpreterFrame *f) {
PyObject *executable = PyStackRef_AsPyObjectBorrow(f->f_executable);
assert(PyCode_Check(executable));
return (PyCodeObject *)executable;
}

static inline _Py_CODEUNIT *
_PyFrame_GetBytecode(_PyInterpreterFrame *f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were storing the bytecode in the frame directly before, IIRC.
This looks more expensive, and is used on at least one fast path:
https://github.com/python/cpython/pull/123926/files#diff-729a985b0cb8b431cb291f1edb561bbbfea22e3f8c262451cd83328a0936a342R4821

Does it makes things faster overall, or is it just more compact?

Copy link
Contributor Author

@mpage mpage Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were storing the bytecode in the frame directly before, IIRC.

Yep. You suggested storing tlbc_index instead since it was smaller. I think this is better for a couple of reasons:

  1. It's smaller, as you said.
  2. It simplifies and speeds up the implementation of RESUME_CHECK. Previously, we would have to load the bytecode pointer for the current thread and deopt if it didn't match what was in the frame. Now we only have to compare tlbc indices. This is a cost shift, however, since now the callers of _PyFrame_GetBytecode have to do the more expensive load of the bytecode. I think the size reduction + simplification of RESUME_CHECK probably outweighs the higher cost of _PyFrame_GetBytecode. tier2 is also still disabled in free-threaded builds, so it's a bit hard to evaluate the relative cost of slower trace exits vs faster RESUME_CHECKs. Once we get tier2 enabled we can reevaluate.

{
#ifdef Py_GIL_DISABLED
PyCodeObject *co = _PyFrame_GetCode(f);
_PyCodeArray *tlbc = _Py_atomic_load_ptr_acquire(&co->co_tlbc);
assert(f->tlbc_index >= 0 && f->tlbc_index < tlbc->size);
return (_Py_CODEUNIT *)tlbc->entries[f->tlbc_index];
#else
return _PyCode_CODE(_PyFrame_GetCode(f));
#endif
}

static inline PyFunctionObject *_PyFrame_GetFunction(_PyInterpreterFrame *f) {
PyObject *func = PyStackRef_AsPyObjectBorrow(f->f_funcobj);
assert(PyFunction_Check(func));
Expand Down Expand Up @@ -144,13 +161,33 @@ static inline void _PyFrame_Copy(_PyInterpreterFrame *src, _PyInterpreterFrame *
#endif
}

#ifdef Py_GIL_DISABLED
static inline void
_PyFrame_InitializeTLBC(PyThreadState *tstate, _PyInterpreterFrame *frame,
PyCodeObject *code)
{
_Py_CODEUNIT *tlbc = _PyCode_GetTLBCFast(tstate, code);
if (tlbc == NULL) {
// No thread-local bytecode exists for this thread yet; use the main
// thread's copy, deferring thread-local bytecode creation to the
// execution of RESUME.
frame->instr_ptr = _PyCode_CODE(code);
frame->tlbc_index = 0;
}
else {
frame->instr_ptr = tlbc;
frame->tlbc_index = ((_PyThreadStateImpl *)tstate)->tlbc_index;
}
}
#endif

/* Consumes reference to func and locals.
Does not initialize frame->previous, which happens
when frame is linked into the frame stack.
*/
static inline void
_PyFrame_Initialize(
_PyInterpreterFrame *frame, _PyStackRef func,
PyThreadState *tstate, _PyInterpreterFrame *frame, _PyStackRef func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of passing the thread state is to initialize the tlbc index, IIUC.
Most callers will already have a tlbc index, so could we pass that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most callers will already have a tlbc index, so could we pass that instead?

The tlbc index is only present in free-threaded builds. To do this I think we'd need to either have separate versions of _PyFrame_Initialize for free-threaded/default builds or introduce the notion of thread-local bytecode into the default build. Passing the PyThreadState as the first parameter seems simpler and more maintainable.

PyObject *locals, PyCodeObject *code, int null_locals_from, _PyInterpreterFrame *previous)
{
frame->previous = previous;
Expand All @@ -162,7 +199,12 @@ _PyFrame_Initialize(
frame->f_locals = locals;
frame->stackpointer = frame->localsplus + code->co_nlocalsplus;
frame->frame_obj = NULL;
#ifdef Py_GIL_DISABLED
_PyFrame_InitializeTLBC(tstate, frame, code);
#else
(void)tstate;
frame->instr_ptr = _PyCode_CODE(code);
#endif
frame->return_offset = 0;
frame->owner = FRAME_OWNED_BY_THREAD;

Expand Down Expand Up @@ -224,7 +266,8 @@ _PyFrame_IsIncomplete(_PyInterpreterFrame *frame)
return true;
}
return frame->owner != FRAME_OWNED_BY_GENERATOR &&
frame->instr_ptr < _PyCode_CODE(_PyFrame_GetCode(frame)) + _PyFrame_GetCode(frame)->_co_firsttraceable;
frame->instr_ptr < _PyFrame_GetBytecode(frame) +
_PyFrame_GetCode(frame)->_co_firsttraceable;
}

static inline _PyInterpreterFrame *
Expand Down Expand Up @@ -315,7 +358,8 @@ _PyFrame_PushUnchecked(PyThreadState *tstate, _PyStackRef func, int null_locals_
_PyInterpreterFrame *new_frame = (_PyInterpreterFrame *)tstate->datastack_top;
tstate->datastack_top += code->co_framesize;
assert(tstate->datastack_top < tstate->datastack_limit);
_PyFrame_Initialize(new_frame, func, NULL, code, null_locals_from, previous);
_PyFrame_Initialize(tstate, new_frame, func, NULL, code, null_locals_from,
previous);
return new_frame;
}

Expand All @@ -339,7 +383,11 @@ _PyFrame_PushTrampolineUnchecked(PyThreadState *tstate, PyCodeObject *code, int
assert(stackdepth <= code->co_stacksize);
frame->stackpointer = frame->localsplus + code->co_nlocalsplus + stackdepth;
frame->frame_obj = NULL;
#ifdef Py_GIL_DISABLED
_PyFrame_InitializeTLBC(tstate, frame, code);
#else
frame->instr_ptr = _PyCode_CODE(code);
#endif
frame->owner = FRAME_OWNED_BY_THREAD;
frame->return_offset = 0;

Expand Down
56 changes: 56 additions & 0 deletions Include/internal/pycore_index_pool.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#ifndef Py_INTERNAL_INDEX_POOL_H
#define Py_INTERNAL_INDEX_POOL_H

#include "Python.h"

#ifdef __cplusplus
extern "C" {
#endif

#ifndef Py_BUILD_CORE
# error "this header requires Py_BUILD_CORE define"
#endif

#ifdef Py_GIL_DISABLED

// This contains code for allocating unique indices in an array. It is used by
// the free-threaded build to assign each thread a globally unique index into
// each code object's thread-local bytecode array.

// A min-heap of indices
typedef struct _PyIndexHeap {
int32_t *values;

// Number of items stored in values
Py_ssize_t size;

// Maximum number of items that can be stored in values
Py_ssize_t capacity;
} _PyIndexHeap;

// An unbounded pool of indices. Indices are allocated starting from 0. They
// may be released back to the pool once they are no longer in use.
typedef struct _PyIndexPool {
PyMutex mutex;

// Min heap of indices available for allocation
_PyIndexHeap free_indices;

// Next index to allocate if no free indices are available
int32_t next_index;
} _PyIndexPool;

// Allocate the smallest available index. Returns -1 on error.
extern int32_t _PyIndexPool_AllocIndex(_PyIndexPool *indices);

// Release `index` back to the pool
extern void _PyIndexPool_FreeIndex(_PyIndexPool *indices, int32_t index);

extern void _PyIndexPool_Fini(_PyIndexPool *indices);

#endif // Py_GIL_DISABLED

#ifdef __cplusplus
}
#endif
#endif // !Py_INTERNAL_INDEX_POOL_H
2 changes: 2 additions & 0 deletions Include/internal/pycore_interp.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ extern "C" {
#include "pycore_genobject.h" // _PyGen_FetchStopIterationValue
#include "pycore_global_objects.h"// struct _Py_interp_cached_objects
#include "pycore_import.h" // struct _import_state
#include "pycore_index_pool.h" // _PyIndexPool
#include "pycore_instruments.h" // _PY_MONITORING_EVENTS
#include "pycore_list.h" // struct _Py_list_state
#include "pycore_mimalloc.h" // struct _mimalloc_interp_state
Expand Down Expand Up @@ -222,6 +223,7 @@ struct _is {
struct _brc_state brc; // biased reference counting state
struct _Py_unique_id_pool unique_ids; // object ids for per-thread refcounts
PyMutex weakref_locks[NUM_WEAKREF_LIST_LOCKS];
_PyIndexPool tlbc_indices;
#endif

// Per-interpreter state for the obmalloc allocator. For the main
Expand Down
4 changes: 3 additions & 1 deletion Include/internal/pycore_tstate.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ typedef struct _PyThreadStateImpl {
// If set, don't use per-thread refcounts
int is_finalized;
} refcounts;

// Index to use to retrieve thread-local bytecode for this thread
int32_t tlbc_index;
#endif

#if defined(Py_REF_DEBUG) && defined(Py_GIL_DISABLED)
Expand All @@ -49,7 +52,6 @@ typedef struct _PyThreadStateImpl {

} _PyThreadStateImpl;


#ifdef __cplusplus
}
#endif
Expand Down
Loading
Loading