Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improve DataFrame Column Statistics (see PyCharm) #3600

Open
Julian-J-S opened this issue Jan 28, 2025 · 3 comments
Open

Further improve DataFrame Column Statistics (see PyCharm) #3600

Julian-J-S opened this issue Jan 28, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@Julian-J-S
Copy link

Description

It is super helpful to quickly see meaningful column statistics in the "DataFame-Output".
like: min, max, mean, median, percentiles, distinct values, null count, frequency, top values, distribution, ...

The (annoying) alternative is to have multiple (temporary) cells for looking at different column stats.

Suggested solution

Marimo already has some column statistics available which is great(!) but there is some room for improvement.
Currently inf/float columns only show a graph while all others show only null/unique count.

PyCharm does this very well imo! (see for example: https://blog.jetbrains.com/pycharm/2024/10/data-exploration-with-pandas/)

They have many different statistics dependent on the data type.
They have also have to option to choose the detail level:

Alternative

No response

Additional context

No response

@Julian-J-S Julian-J-S added the enhancement New feature or request label Jan 28, 2025
@mscolnick
Copy link
Contributor

mscolnick commented Jan 28, 2025

Thanks for sharing. We calculate this information already (and I am pretty sure we do send it to the front end). We can surface this.

Does the column stats mode ('off', 'compact', 'detailed') apply to all dataframes? is that essentially a user setting or dataframe setting? when you close/reopen the notebook, is that setting persisted?

@Julian-J-S
Copy link
Author

Thanks for listening :)

Just tested and the mode is individual per cell. Also whenever a cell is rerun it is set back to "off" (which is a bad design In my opinion).
I would prefer a global option for all statistics and maybe a cell level in addition if necessary which preserves the state when rerunning 😆

Here just a simple screenshot of what is looks like in PyCharm Professional

Image

Aside from the statistics PyCharm and Databricks notebooks also have amazing build in Visualization support (Databricks also with tabs to create multiple visuals per DataFrame in a compact and easy way.)
If you are interested I can open another issue and explain more detailed what I mean and love so see 😉

@mscolnick
Copy link
Contributor

Thanks for the additional info.

If you are interested I can open another issue and explain more detailed what I mean and love so see 😉

Yes, that would be great, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants