Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115704: Improve DJBX33A hash algorithm #115705

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

PeterYang12
Copy link

@PeterYang12 PeterYang12 commented Feb 20, 2024

Accelerating python hash algorithm by "unoptimizing" it when using DJBX33A as hash algorithm.
See Daniel Lemire's blog post:
https://lemire.me/blog/2016/07/21/accelerating-php-hashing-by-unoptimizing-it/
This idea has already been implemented in the PHP interpreter.

Copy link

cpython-cla-bot bot commented Feb 20, 2024

The following commit authors need to sign the Contributor License Agreement:

Click the button to sign:
CLA not signed

@bedevere-app
Copy link

bedevere-app bot commented Feb 20, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

1 similar comment
@bedevere-app
Copy link

bedevere-app bot commented Feb 20, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Feb 27, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

1 similar comment
@bedevere-app
Copy link

bedevere-app bot commented Aug 25, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@gpshead gpshead removed their assignment Aug 25, 2024
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like you to address Serhiy's concern on the issue as well. And could we have the actual assembly code being generated actually (and it's comparison)?

Python/pyhash.c Outdated
Comment on lines 172 to 176
hash = hash * 33 * 33 * 33 * 33 +
p[0] * 33 * 33 * 33 +
p[1] * 33 * 33 +
p[2] * 33 +
p[3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can directly compute those values (namely the powers of 33), although the compiler should be able to optimize it as well. You could however add a small comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. Done.

Python/pyhash.c Outdated Show resolved Hide resolved
Python/pyhash.c Outdated
Comment on lines 180 to 190
if (len >= 2) {
if (len > 2) {
hash = hash * 33 * 33 * 33 +
p[0] * 33 * 33 +
p[1] * 33 +
p[2];
}
else {
hash = hash * 33 * 33 + p[0] * 33 + p[1];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the compiler behaves differently if you use

if (len > 2) { ... }
else if (len == 2) { ... }
else if (len) { ... }

instead?

@cfbolz
Copy link
Contributor

cfbolz commented Aug 25, 2024

this PR is quite fundamentally different from the Lemire post/the PHP change. In PHP, the change was done in the implementation of the hash function for arbitrary lengths. In this PR, only the code for computing the hash of bytes with length <= 7 was changed. The hash function of arbitrary lengths is already using a direct multiplication in CPython, and is operating on several characters at a time.

This might still be a worthwhile change, but there should be really strong benchmarking results that show this.

@jxu
Copy link

jxu commented Aug 26, 2024

I support it only to clean up a little by getting rid of the ugly (IMO) C fall-through code and replacing it with a simple loop. I don't think the compiler will generate anything significantly different.

Accelerating python hash algorithm by "unoptimizing" it when using
DJBX33A as hash algorithm. See Daniel Lemire's blog post:
https://lemire.me/blog/2016/07/21/accelerating-php-hashing-by-unoptimizing-it/

Signed-off-by: PeterYang12 <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Aug 27, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants