Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-capturing groups for IPv4 address detection. #323

Merged
merged 2 commits into from
Mar 8, 2024

Conversation

elliotwutingfeng
Copy link
Contributor

Since we are not using the captured groups, we can replace them with non-capturing groups as a micro-op. Benchmarks indicate about 10-20% improvement in parsing speed for positive matches.

Using the re.ASCII flag appears to have slight speed boost for CPython, but makes things slower on PyPy for certain inputs.

Results are mixed for bracketless-IPv6 addresses.

import timeit

stmts = []
stmts.append(
"""
IP_RE.fullmatch("1.1.1.1")
"""
)
stmts.append(
"""
IP_RE.fullmatch("aBcD:ef01:2345:6789:aBcD:ef01:127.0.0.1")
"""
)
stmts.append(
"""
IP_RE.fullmatch("255.255.255.com")
"""
)
stmts.append(
"""
IP_RE.fullmatch("a.b.c.e.co.uk")
"""
)

setups = []
setups.append(
"""
import re
IP_RE = re.compile(
    r"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.)"
    r"{3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$"
)
"""
)
setups.append(
"""
import re
IP_RE = re.compile(
    r"^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.)"
    r"{3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$"
)
"""
)
setups.append(
"""
import re
IP_RE = re.compile(
    r"^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.)"
    r"{3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$", re.ASCII
)
"""
)

messages = ["original", "non-capturing groups", "non-capturing groups + re.ASCII"]

for (setup, message) in zip(setups, messages):
    print(message)
    for stmt in stmts:
        timer = timeit.Timer(stmt=stmt, setup=setup)
        print(timer.timeit(number=10_000_000))

CPython 3.11

original
2.1482726680005726
1.1063692410007206
3.2284626149994438
1.1412295229993106
non-capturing groups
1.8773814500000299
1.148869920000834
2.7006006390001858
1.1598876739999469
non-capturing groups + re.ASCII
1.7322393260001263
1.1551464869999108
2.7154349570000704
1.1148897359998955

PyPy 3.9

original
0.42392663299960986
0.21313040499990166
0.48231843799931085
0.21305666600073891
non-capturing groups
0.3258821990002616
0.23411077200034924
0.41092471499996464
0.2272209719994862
non-capturing groups + re.ASCII
0.34160923099989304
0.21923503299967706
0.42936557000030007
0.21578892799971072

@john-kurkowski
Copy link
Owner

Love it! Thanks for the thorough stats!

@john-kurkowski john-kurkowski merged commit 9f16a0c into john-kurkowski:master Mar 8, 2024
27 checks passed
@elliotwutingfeng elliotwutingfeng deleted the ip branch March 9, 2024 00:07
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Mar 29, 2024
https://build.opensuse.org/request/show/1163368
by user mia + anag+factory
- Update to 5.1.2:
  * Remove socket.inet_pton, to fix platform-dependent IP parsing
    #gh/john-kurkowski/tldextract#318
  * Use non-capturing groups for IPv4 address detection, for a
    slight speed boost
    #gh/john-kurkowski/tldextract#323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants