-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Adding Narwhals as compatibility layer between libraries #339
base: main
Are you sure you want to change the base?
Conversation
Thank you for making this pull request. Did you know? You can try it on Binder: . Also, the version of ITables developed in this PR can be installed with
(this requires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks for starting this 🙏 !
I could eliminate from lines 88 to 114 from file datatables_format.py and handle all with Narwhals. But the results would be different for some cases in test_databases_format.py. I can elaborate...
sure, please do!
src/itables/datatables_format.py
Outdated
# Polars DataFrame | ||
df = nw.from_native(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably use eager_only=True
if we're calling rows
src/itables/datatables_format.py
Outdated
# Polars DataFrame | ||
df = nw.from_native(df) | ||
data = list(df.iter_rows()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by - this would be more efficient as data = df.rows()
(it produces the same thing, just faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the PR for this. Thanks for the feedback.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #339 +/- ##
==========================================
- Coverage 94.76% 93.20% -1.56%
==========================================
Files 24 24
Lines 1280 1281 +1
==========================================
- Hits 1213 1194 -19
- Misses 67 87 +20 ☔ View full report in Codecov by Sentry. |
Thank you @DeaMariaLeon for you PR, and thanks @MarcoGorelli for reviewing it. I am eager to test it! (today or tomorrow hopefully) A few comments:
Thanks again for working on this! |
@mwouts, thanks for the feedback.
Ok. I'll share how it would look like if a path for pandas is not needed.
You don't have to be sorry. 🙂 In case the "elaboration" is still useful, (#339 (review)), here it is: When running So before adding Narwhals:
|
Just to be clear @mwouts: if we add Narwhals to handle all the libraries (without keeping the pandas' path separately), all the test results will be like Polars'. |
Yes, that would make sense. The reason why you see only 6 digits for Pandas DataFrames is that ITables does use the pandas formatters. That's an attempt to give a rendering that is similar to what the users get when they don't use ITables. Of course I could think of other ways, e.g. apply the formatting on the Javascript side directly (but then passing numbers with a higher precision would also make the notebooks heavier). Maybe it's less impactful for now to leave the Pandas part as is? We might have enough to do for now with the estimated size and the downsampling function, plus testing all the various table types. Are you able to run the documentation notebooks? See this paragraph and the next one. |
It seems that all the Jupyter notebook examples run correctly, @mwouts. I run the ones in the link "paragraphs" you mentioned, and the ones in the Advance parameters. But I don't know how to share them here - export to pdf didn't work and Jupiter files can't be attached. |
Hi @DeaMariaLeon , yes sure, it's more for you to see how changes you make might change how the dataframes are rendered - no need to share the notebooks, I can checkout your branch and rerun them locally. Re your PR, I would love to see it work for actual modin dataframes, see my PR #341 . It currently works for the dataframes that don't need to be downsampled. The downsampling is very important for ITables because the table data gets embedded in the notebook (I don't have anything like a server mode at the moment), and if we are not careful about this we can make the notebook huge and crash the web browser. To make the downsampling work (https://github.com/mwouts/itables/blob/main/src/itables/downsample.py) I would need to figure out to do this with
Let me know if you have any hints re the above! Thanks! |
Hi @mwouts, we'll be adding "estimated_size" to Narwhals in order to implement downsampling. |
Hi @mwouts, we have now I just saw on the PR you experimented with Modin (the one on top of this one), you use series. I wonder if we should add that too. Our new function works for dataframes only. Should I open a separate PR to (try to) modify |
Thank you @DeaMariaLeon , that's awesome! That's in the last version of
Yes ITables user can pass a Series to
Yes thanks for offering I think I will do try on my end. I am looking at the narwhals documentation and I have found already |
Yes
Very good - you'll be much faster! |
Related to issue #325
If we add the proposed code, nothing changes when using a pandas dataframe. And the rest of the libraries can be handled with Narwhals. That's needed because Narwhals and Polars don't allow duplicated column names, so the pandas path has to be kept.
We also have to do this because on iTables (currently), the returned results from the function
datatables_row
are different when using Polars vs. pandas in some cases.So, with the proposed changes, this function would now work for:
Or this:
I could eliminate from lines 88 to 114 from file
datatables_format.py
and handle all with Narwhals. But the results would be different for some cases intest_databases_format.py
. I can elaborate...I would like to have some feedback before going further please. Thanks.
Friendly ping @MarcoGorelli