-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add column format option to iter rows #3681
Conversation
CodSpeed Performance ReportMerging #3681 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3681 +/- ##
==========================================
- Coverage 78.06% 77.91% -0.16%
==========================================
Files 728 727 -1
Lines 89967 91219 +1252
==========================================
+ Hits 70236 71074 +838
- Misses 19731 20145 +414
|
Offline: allow python and arrow column format since numpy is only for numeric. If user wants numpy they should be able to convert themselves |
43ea5cc
to
35adce9
Compare
Example from the linked issue, which is much faster now with
|
Feel free to re-add me as reviewer once we work in the offline discussions yesterday around using arrow! |
Ready for another pass, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any quick numbers for speedup to expect here?
daft/dataframe/dataframe.py
Outdated
) -> Iterator[Dict[str, Any]]: | ||
"""Return an iterator of rows for this dataframe. | ||
|
||
Each row will be a Python dictionary of the form { "key" : value, ... }. If you are instead looking to iterate over | ||
entire partitions of data, see: :meth:`df.iter_partitions() <daft.DataFrame.iter_partitions>`. | ||
|
||
By default, Daft will convert the columns to Python lists for easy consumption. However, for nested data such as List or Struct arrays, this can be expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comments here also about how we determine the appropriate Python types?
I think for example tensor
type gets converted to numpy arrays. Not sure if there is special handling for other logical types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Arrow to numpy: 0.0013179779052734375 seconds |
Addresses: #3634
Add an option to iter_rows to decude the format of columns during iteration, either Python (default), Arrow, or Numpy.