-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependencies._column_loc
: files parameter has a mismatch between typing and implementation
#407
Comments
In the context of benchmarking polars methods I have found quite large differences I have found quite large differences. typically a spedup factor of 600 or so. I fear that the #370 implementation is lost as the commits got squared, right? This is a second independent run, most times unvectorized is in the 2 seconds ballbark again.
|
I could check out the branch locally, and pushed it again. So, I think you should be able to fetch it now as well. Otherwise, the changes are visible at https://github.com/audeering/audb/pull/370/files.
The unvectorized results are for the implementation presented here? For comparison, results from #370 results from you results from current Important is the pyarrow column. There your vectorized results are the fastest. Also my implementation from #370 shows faster results for the vectorized solution. The main reason, I decided against #370 was that it does not translate into any real world advantages when loading a database, as briefly mentioned in #375 (comment), |
I have now tried out your version that uses In this case I would be inline with you and do away the type hinting for sequences (and have only single file input). |
Problem
Typing of
Dependencies._column_loc
seem not to be matching the implementation.files
seems to be implemented for scalar input, whereas the type hint suggests that list input should be possible.For scalar input
df.at[files, column]
would crash, whereasdf.loc[files, column].values.tolist()
would return a list.Possible solutions
Additional points
Example
Further Background
Benchmarking in benchmark-dependencies-methods.py uses list comprehensions of the form:
When evaluating polars this becomes tremendously slow (
57
sec.).However when I implement this based on a version of
Dependencies._column_loc
that can process iterable input and call it withdeps.bit_depth(_files)
this is fast (0.009280681610107422
sec.) which is a factor or more than 6000 times faster.Should this issue maybe resolved before #385 can be meaningfully added to the benchmarks?
Saying that I would of course favor implementing the vectorized version of
_column_loc
- but I do not know how common the use of the list comprehension indeed is.What would your take on this be @hagenw ?
The text was updated successfully, but these errors were encountered: