-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
30% faster reads by removing unnecessary file.tell()
calls
#579
base: main
Are you sure you want to change the base?
Conversation
This seems like a really good idea. It might be worthwhile to revisit how mixed-line endings are currently handled. If they could be standardized on For the issue around io.TextIOWrapper.tell() being an opaque number for io.TextIOWrapper.seek(), it feels like we could accept this risk and either: hope new versions of Python will always handle it in some proper way, or update Lasio code at such a time that Python changes the io.TextIOWrapper.tell()/seek() behavior. Does that seem like the right approach to you? |
@emilykl Thank you! This is wonderful. I'm happy to merge if we can work out a way to provide for a sensible exception and a backup way to read files with mixed line endings (surely they are uncommon) - such as a keyword argument. @dcslagel I'm on board with all your ideas. What do you think, Should we all get this PR to a merge state and release it first while considering other improvements for later work? 30% is worth risking changes with python in the future in my opinion. |
Thanks for the feedback! Very appreciated.
What we could potentially do is disable newline standardization by passing the argument The only issue I see there is that the In any case, I think it probably makes sense to make this "fast" method a configurable option -- either opt-in or opt-out. We can document the caveats, and perhaps even add a hard restriction that if the "fast" method is enabled, |
Sounds good!
I'd prefer the "fast" option to be on by default. Also perhaps it makes sense to remove the support for file_ref to be an open file object at the same time. If I understand correctly, this method should still work then in the two scenarios of file_ref being a path or a string containing the LAS file contents, correct? |
So a file that has a header line with a different line-ending from the first line in the file, plus a test case for it. So that it is a visible issue.
|
Rather than calling
file.tell()
for every line while searching for section headers, call it only for lines containing a section header. This results in a significant read performance boost of at least 30%.(This particular benchmark shows an even greater improvement but I was usually seeing values closer to 30% during local testing.)
This change is less straightforward than it seems, because we need to know the file position of the beginning of the header line; but calling
file.tell()
after reading that line gives us the file position of the beginning of the following line.This approach reverse-engineers the correct file position by observing that
file.tell()
seems to correspond to subtracting the length of the current line, plus an adjustment depending on the file's line ending type (CRLF vs. LF).Since the performance improvement is so large, I think this change is worth pursuing. However, there are two caveats:
This approach does not work reliably on files with mixed line endings.
file_obj.newlines
only reveals which line endings have been encountered so far in the file.test_open_url_different_newlines()
happens to pass anyway because the header line has the same line ending as the first line in the file. Header lines which have a different line ending from the first line in the file will fail.And the more serious issue...
io.TextIOWrapper.tell()
officially returns "an opaque number" which is guaranteed only to work as input toio.TextIOWrapper.seek()
. So although it seems to correspond to character count in practice, and works well enough to pass all the test cases for this project, that behavior is not guaranteed and could also change in future Python releases, making this a fragile approach.Very open to comments and ideas! If there is a safer way to get the correct file position, without calling
file.tell()
for every line, I would love to implement it.