gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage · 2024-09-10T22:53:56Z

This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization for BINARY_OP as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.

Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.

Issue: Make the specializing interpreter thread-safe in --disable-gil builds #115999

- Fix a few places where we were not using atomics to (de)instrument opcodes. - Fix a few places where we weren't using atomics to reset adaptive counters. - Remove some redundant non-atomic resets of adaptive counters that presumably snuck as merge artifacts of python#118064 and python#117144 landing close together.

…entation using atomics

Read the opcode atomically, the interpreter may be specializing it

…_args

…Counters

Include/cpython/code.h

Include/internal/pycore_frame.h

Lib/test/test_cmd_line.py

Python/ceval_macros.h

Python/index_pool.c

It's cleaner to assign all threads the index of the main copy of the bytecode when tlbc is disabled rather than adding a special case in _PyEval_GetExecutableCode.

Include/cpython/code.h

Python/ceval_macros.h

mpage · 2024-10-22T00:02:52Z

@markshannon - Would you take a look at this, please?

markshannon · 2024-10-23T16:02:32Z

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

mpage · 2024-10-24T04:49:06Z

!buildbot nogil refleak

bedevere-bot · 2024-10-24T04:49:09Z

🤖 New build scheduled with the buildbot fleet by @mpage for commit 07f9140 🤖

The command will test the builders whose names match following regular expression: nogil refleak

The builders matched are:

AMD64 CentOS9 NoGIL Refleaks PR
AMD64 Fedora Rawhide NoGIL refleaks PR
aarch64 Fedora Rawhide NoGIL refleaks PR
PPC64LE Fedora Rawhide NoGIL refleaks PR

mpage · 2024-10-24T05:26:55Z

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifying refleaks.py, and is the same approach we use for tier2. Please have a look.

Lib/test/test_sys.py

markshannon · 2024-10-29T12:25:53Z

Lib/test/test_sys.py

+            # code objects is a large fraction of the total number of
+            # references, this can cause the total number of allocated
+            # blocks to exceed the total number of references.
+            if not support.Py_GIL_DISABLED:


Now that we can free the unused tlbcs, can we replace this with sys._clear_internal_caches()?

Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.

This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.

Maybe replace it with a more meaningful test rather than remove it. But in another PR.

markshannon

Looks good.

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

mpage · 2024-10-29T16:46:03Z

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

@markshannon - Unfortunately that doesn't help. See my reply inline.

markshannon

I still have concerns about memory use, but we can iterate on that in subsequent PRs.

mpage added 30 commits September 10, 2024 13:24

Assign threads indices into bytecode copies

776a1e1

Replace most usage of PyCode_CODE

2b40870

Get bytecode copying working

344d7ad

Refactor remove_tools

f203d00

Refactor remove_line_tools

82b456a

Instrument thread-local bytecode

b021704

Use locks for instrumentation

aea69c5

Add ifdef guards for each specialization family

552277d

Specialize BINARY_OP

50a6089

Limit the amount of memory consumed by bytecode copies

3f1d941

Make thread-local bytecode limits user configurable

7d2eb27

Make branch taken recording thread-safe

e3b367a

Lock thread-local bytecode when specializing

b2375bf

Load bytecode on RESUME_CHECK

2707f8e

Load tlbc on generator.throw()

3fdcb28

Use tlbc instead of thread_local_bytecode

4a55ce5

Use tlbc everywhere

8b3ff60

Explicitly manage tlbc state

862afa1

Refactor API for fetching tlbc

0b4d952

Add unit tests

7795e99

Fix initconfig in default build

693a4cc

Fix instrumentation in default build

b43531e

Synchronize bytecode modifications between specialization and instrum…

9025f43

…entation using atomics

Add a high-level comment

c44c7d9

Fix unused variable warning in default build

e2a6656

Fix test_config in free-threaded builds

e6513d1

Fix formatting

a18396f

Remove comment

81fe1a2

Fix data race in _PyInstruction_GetLength

837645e

Read the opcode atomically, the interpreter may be specializing it

mpage added 4 commits October 17, 2024 13:40

Use int32_t instead of Py_ssize_t for tlbc indices

ab6222c

Use _PyCode_CODE instead of PyFrame_GetBytecode in super_init_without…

6bbb220

…_args

Update comment

4580e3c

Consolidate _PyCode_{Quicken,DisableSpecialization} into _PyCode_Init…

b992f44

…Counters

Yhg1s reviewed Oct 18, 2024

View reviewed changes

Include/cpython/code.h Show resolved Hide resolved

Include/internal/pycore_frame.h Show resolved Hide resolved

Lib/test/test_cmd_line.py Show resolved Hide resolved

Python/ceval_macros.h Show resolved Hide resolved

Python/index_pool.c Outdated Show resolved Hide resolved

mpage added 3 commits October 18, 2024 09:35

Merge branch 'main' into pythongh-115999-thread-local-bytecode

4c040d3

Fix incorrect types

5b7658c

Add command-line tests for enabling TLBC

bec5bce

mpage requested review from markshannon and Yhg1s October 18, 2024 17:48

mpage added 3 commits October 18, 2024 15:03

Update libpython.py for tlbc_index

c9054b7

Avoid special casing in _PyEval_GetExecutableCode

1a48ab2

It's cleaner to assign all threads the index of the main copy of the bytecode when tlbc is disabled rather than adding a special case in _PyEval_GetExecutableCode.

Merge branch 'main' into pythongh-115999-thread-local-bytecode

b16ae5f

Yhg1s approved these changes Oct 19, 2024

View reviewed changes

Include/cpython/code.h Show resolved Hide resolved

Python/ceval_macros.h Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting change review labels Oct 19, 2024

mpage added 3 commits October 23, 2024 11:36

Merge branch 'main' into pythongh-115999-thread-local-bytecode

176b24e

Clear TLBC when other caches are cleared

c107495

Remove _get_tlbc_blocks

07f9140

Yhg1s reviewed Oct 25, 2024

View reviewed changes

Lib/test/test_sys.py Show resolved Hide resolved

markshannon reviewed Oct 29, 2024

View reviewed changes

markshannon self-requested a review October 29, 2024 16:54

markshannon approved these changes Oct 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage commented Sep 10, 2024 •

edited

Loading

mpage commented Oct 22, 2024

markshannon commented Oct 23, 2024

mpage commented Oct 24, 2024

bedevere-bot commented Oct 24, 2024

mpage commented Oct 24, 2024

markshannon Oct 29, 2024 •

edited

Loading

mpage Oct 29, 2024 •

edited

Loading

markshannon Oct 29, 2024

markshannon left a comment •

edited

Loading

mpage commented Oct 29, 2024 •

edited

Loading

markshannon left a comment

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

Are you sure you want to change the base?

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

Conversation

mpage commented Sep 10, 2024 • edited Loading

mpage commented Oct 22, 2024

markshannon commented Oct 23, 2024

mpage commented Oct 24, 2024

bedevere-bot commented Oct 24, 2024

mpage commented Oct 24, 2024

markshannon Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

mpage Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

markshannon Oct 29, 2024

Choose a reason for hiding this comment

markshannon left a comment • edited Loading

Choose a reason for hiding this comment

mpage commented Oct 29, 2024 • edited Loading

markshannon left a comment

Choose a reason for hiding this comment

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage commented Sep 10, 2024 •

edited

Loading

markshannon Oct 29, 2024 •

edited

Loading

mpage Oct 29, 2024 •

edited

Loading

markshannon left a comment •

edited

Loading

mpage commented Oct 29, 2024 •

edited

Loading