Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize categorical columns #23

Open
iaindillingham opened this issue May 27, 2022 · 2 comments
Open

Summarize categorical columns #23

iaindillingham opened this issue May 27, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@iaindillingham
Copy link
Member

From @andrewscolm. Thanks, Colm 🙂

If we implemented #22, then we would struggle to summarize counts for each category, as some categorical columns would have more categories than other categorical columns. However, we could summarize the number of unique values and the number of missing values. As @wjchulme says about the number of unique values:

Usually because if it's 1, you know something has probably gone wrong. But just in general if it's lower/high than expected

Do we also need to summarize counts for each category?

@andrewscolm
Copy link

Thanks @iaindillingham, would you be able to implement something like the 'top_counts' column in Will's example? It summarizes the counts for a maximum of 4 categories. This would be really helpful to see if there have been any major mistakes.

If that isn't possible then, as Will stated, the number of unique values is still a useful insight.

@wjchulme
Copy link

wjchulme commented Jun 8, 2022

Not sure if this was clear before, but for categoricals a table of counts, a la cohort-report, is often still really useful.

So both a single-row-per-variable format to have an overview of the entire dataset (split by variable type) and count tabulation for relevant categorical variables would be useful. Could also simplify things by just tabulating all variables with fewer than ~20 unique values, to avoid eg STPs or MSOAs being tabulated and to ensure categorical-as-int variables are still included. These tables would live in a separate document.

Obv with redaction!

@iaindillingham iaindillingham added the enhancement New feature or request label Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Todo
Development

No branches or pull requests

3 participants