Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSSION: Propose topics for pandas tutorials #140

Open
datapythonista opened this issue Aug 19, 2019 · 3 comments
Open

DISCUSSION: Propose topics for pandas tutorials #140

datapythonista opened this issue Aug 19, 2019 · 3 comments

Comments

@datapythonista
Copy link
Member

In the pandas documentation, we would like to add tutorials that cover end to end real use cases of pandas. This should make things very easy for first time users trying to address a specific problem with pandas.

Based on my personal experience, those are the kind of problems I usually address:

  • Exploratory analysis of a dataset to answer specific questions
  • Build a pipeline transforming one or more data sources to generate an output (for example to train a machine learning model)
  • Forecasting of time series (for example stock market data)
  • Preprocessing of textual data (for most NLP problems there are surely better tools like nltk, but pandas can be more flexible for some cases)

I'm sure people is doing other cool things with pandas, would be great to brainstorm and find more use cases, that are worth having a tutorial.

@galuhsahid
Copy link
Member

I'm thinking something along data cleansing - we can start with a real-world example of a messy dataset (with duplicated rows, missing values, unnecessary columns/rows...) and end up with a tidy one. I think this could be useful for people who are using pandas to clean their dataset, especially when the data gets too large for software to handle that it ends up slowing down their process.

However I can imagine that there are many ways to define what a messy dataset is, and since we're looking to address a specific problem, we might end up trying to solve too many problems at once.

I did run a workshop on this topic (notebook here, though it's in Indonesian) and we covered duplicated rows, missing values, removing columns/rows, and renaming column names on one real-world dataset.

Would love to hear all your thoughts on whether this use case is worth having a tutorial or not. Looking forward to discussing all other use cases as well.

@WuraolaOyewusi
Copy link
Contributor

@datapythonista In text Preprocessing, pandas plays a big role in giving some structure to the data. It's blissful to simply apply functions along columns.

@galuhsahid I think it's a good idea to use a real world dataset, and the use case is worth it from my perspective.

@sara-02
Copy link
Member

sara-02 commented Aug 20, 2019

I agree that an end-end tutorial is always better.
Some examples of end-end tutorials that I have presented to college students:
1: https://github.com/sara-02/pradarshan/blob/master/FWD_17_intro_to_pandas.ipynb
2: https://github.com/sara-02/pradarshan/blob/master/pandas_basic/py6.ipynb

Also, as mentioned by @WuraolaOyewusi showing pandas usecase on text Preprocessing will be another good usecase. Most tutorials we see for Pandas cover numerical analysis, text analysis tutorial will be a plus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants