Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the yield with an isb on Arm. #17

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AGSaidi
Copy link
Contributor

@AGSaidi AGSaidi commented Jan 26, 2023

The yield instruction is treated as a nop on Arm processors which is very different than the x86 pause instruction that stalls execution for ~40 cycles.

An ISB serializes the pipeline and has been shown to be roughly analogous to the pause delays and is used is other databases for spinloops and adaptive spin loops where not hammering the cache line is important.

The yield instruction is treated as a nop on Arm processors which is very
different than the x86 pause instruction that stalls execution for ~40 cycles.

An ISB serializes the pipeline and has been shown to be roughly analogous
to the pause delays and is used is other databases for spinloops and adaptive
spin loops where not hammering the cache line is important.
@BrianNichols
Copy link
Member

@AGSaidi We are evaluating this pull request. I did run a multi-threaded spinlock test on macOS m1, and both "yield" and "isb" instructions resulted in nearly 100% cpu usage on each of the blocked threads. I realize cpu usage and power consumption are not exactly correlated, so this test might be misleading or Apple silicon might have a "pause like" implementation of "yield".

Can you point to a spinlock test and platform that demonstrates the advantage of "isb" on power consumption?

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Jan 27, 2023

@BrianNichols you'll still see 100% cpu utilization, the application is still using the core completely, but the key point here is it's going to iterate around the spinloop fewer timer. These fewer times mean less loads for the memory location that is being spun on into the memory system and that generally saves power and improves performance. The Performance improvement comes from two angles. 1. If there are any adaptive spin loops that have been tuned for 'pause' on intel this will make teh same tuning apply for Arm as opposed to ending early. 2. The fact that the memory system isn't saturated with loads from the fast loop means that the unlock is observed more quickly.

@BrianNichols
Copy link
Member

@AGSaidi Sounds reasonable. We should have a decision by next week.

@AGSaidi
Copy link
Contributor Author

AGSaidi commented Jan 27, 2023

Great. Please run your performance tests on Graviton and see if there are tests that improve. We've seen substantial improvements for lock contention workloads in other databases.

@BrianNichols
Copy link
Member

The pull request has been accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants