-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars benchmarks methods #424
Conversation
65c1d08
to
2c6143a
Compare
The implementation with |
Regarding |
b26f17a
to
9e12393
Compare
These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.
|
Treatment Index variablethe previous previous version of this MR assumed that The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be I am currently maintaining the index as a variable use
|
method | pandas | polars | winner | factor |
---|---|---|---|---|
Dependencies.call() | 0.000 | 0.000 | polars | 2.667 |
Dependencies.contains(10000 files) | 0.003 | 0.002 | polars | 2.005 |
Dependencies.get_item(10000 files) | 0.648 | 0.013 | polars | 50.382 |
Dependencies.len() | 0.000 | 0.000 | pandas | 1.300 |
Dependencies.str() | 0.004 | 0.000 | polars | 24.677 |
Dependencies._add_attachment() | 0.171 | 0.104 | polars | 1.645 |
Dependencies._add_media(10000 files) | 0.073 | 0.008 | polars | 9.589 |
Dependencies._add_meta() | 0.127 | 0.100 | polars | 1.260 |
Dependencies._drop() | 0.118 | 0.021 | polars | 5.628 |
Dependencies._remove() | 0.067 | 0.002 | polars | 39.324 |
Dependencies._update_media() | 0.142 | 0.066 | polars | 2.148 |
Dependencies._update_media_version(10000 files) | 0.021 | 0.016 | polars | 1.341 |
Dependencies.archive(10000 files) | 0.045 | 0.014 | polars | 3.250 |
Dependencies.archives | 0.145 | 0.151 | pandas | 1.045 |
Dependencies.attachment_ids | 0.018 | 0.008 | polars | 2.375 |
Dependencies.attachments | 0.017 | 0.008 | polars | 2.194 |
Dependencies.bit_depth(10000 files) | 0.029 | 0.014 | polars | 2.031 |
Dependencies.channels(10000 files) | 0.030 | 0.013 | polars | 2.224 |
Dependencies.checksum(10000 files) | 0.030 | 0.014 | polars | 2.201 |
Dependencies.duration(10000 files) | 0.028 | 0.014 | polars | 2.066 |
Dependencies.files | 0.012 | 0.011 | polars | 1.040 |
Dependencies.format(10000 files) | 0.033 | 0.014 | polars | 2.345 |
Dependencies.media | 0.068 | 0.040 | polars | 1.702 |
Dependencies.removed(10000 files) | 0.029 | 0.014 | polars | 2.118 |
Dependencies.removed_media | 0.068 | 0.038 | polars | 1.809 |
Dependencies.sampling_rate(10000 files) | 0.029 | 0.014 | polars | 2.102 |
Dependencies.table_ids | 0.025 | 0.013 | polars | 1.927 |
Dependencies.tables | 0.017 | 0.008 | polars | 2.166 |
Dependencies.type(10000 files) | 0.028 | 0.014 | polars | 2.063 |
Dependencies.version(10000 files) | 0.032 | 0.013 | polars | 2.372 |
Great, thanks for your effort, now we can directly compare There are a few points that need to be considered when switching to
I would propose, to not consider switching to I think it would make sense to merge this into the |
I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision.
I have updated the requirements and the README. I also committed the script that I used to compare the analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready to merge.
a4275f2
to
9b3b748
Compare
After rebasing onto main (with no conflicts) I run into a failing test:
which resutls in a test failure:
Will I have to obtain some unmerged stuff from one of these?:
Or is there a different reason that I am not seing? |
The test was fixed with #445, which is merged now. |
Co-authored-by: Hagen Wierstorf <[email protected]>
Co-authored-by: Hagen Wierstorf <[email protected]>
Co-authored-by: Hagen Wierstorf <[email protected]>
9b3b748
to
38ef12d
Compare
closes #385
Benchmarking Polars (methods)
The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.
Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.
It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though
Comments:
"pandas casting" means that I have not been bothered with improving the implementation a lot. So polars convers df from pandas and back on return
concat(faster than in place):
An inplace version of this particular function would look like that
Interestingly it was slower than the concat version that assigns new memory for the returned table.
there are a few methods that use concat. Note that this for polars alters the sort order of the dataframe. As access is always per (pandas) index this should not matter, does it?
approx 14.14: is not run as the polars version is extremely slow for now.
both unvectorized: slow for polars, but an improvement is probably possible.
Results and Interpretation:
in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using
Dependencies._column_loc
would use vectorized code. I found a few instances where the current implementation iterates over files and callsDependencies._column_loc
on these single files.Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of
num_rows
andn_files
look like, witouth making it unrealistically bigI am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.