Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-validation for imbalanced label case #4

Open
jjc2718 opened this issue Aug 4, 2020 · 2 comments
Open

Cross-validation for imbalanced label case #4

jjc2718 opened this issue Aug 4, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@jjc2718
Copy link
Contributor

jjc2718 commented Aug 4, 2020

If labels are highly imbalanced (for example, TP53 in ovarian cancer) ROC can break because some cross-validation splits will only have one class.

Maybe using StratifiedKFold instead of standard k-fold CV is the best solution here?

@jjc2718
Copy link
Contributor Author

jjc2718 commented Sep 23, 2020

After thinking about this, I'm not planning to stratify cross-validation folds by label. I think if there are so few positive labels that splits only have one class sometimes by chance (e.g. the TP53/OV case described above), we're not going to be able to train effective models anyway due to the extreme label imbalance.

In general, I think there are downsides to stratifying by label (see, e.g. this CrossValidated post or this one). I want to make sure our cross-validation is as representative of external datasets as it can be (some of which may have different label proportions than TCGA), and generating CV folds randomly many times seems like a better way to evaluate generalization than forcing every test dataset to have the same label proportion.

I may revisit this in the future, but closing for now.

@jjc2718
Copy link
Contributor Author

jjc2718 commented Oct 19, 2020

Reopening this in light of #31 (comment) . I think stratification by label may be the best solution to the issue described there - need to think about it a bit.

@jjc2718 jjc2718 reopened this Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant