Could make_strata()
warn (or remove the strata
attribute) when only returning a single strata? Or message when pooling at all?
#441
Labels
feature
a feature request or enhancement
Feature
I was reminded about #438 by the GitHub lock bot, an issue where a user was surprised that
vfold_cv()
(and eventuallymake_strata()
) "didn't stratify" (or rather, treated the data as only having one stratum) when the stratification variable only had one class above the pooling threshold.I think rsample is doing the right thing here, and behaving as documented, but this behavior is still a bit surprising. Would it be possible for
make_strata()
to warn when it only returns a single stratum? I imagine this is almost always unintentional, as users wouldn't specify a stratification variable if they thought it would go unused.Another consideration here is that, even if only one stratum is created, the rset objects still contain a
strata
attribute. As a result, when printed these objects claim that they were created "using stratification":Created on 2023-07-27 with reprex v2.0.2
This might be a bit misleading, as the sampling here didn't depend on the
y
value at all. Would it make sense to drop thestrata
attribute if only one stratum is created?Finally, would it make sense for the categorical branch of
make_strata
to provide a message listing the categories that get "pooled" together, and which stratum they were pooled into? This might help users catch processing mistakes, if they weren't expecting to have any rare classes that would get automatically pooled. This might be too noisy though, and not as useful as warning about "single stratum" cases.The text was updated successfully, but these errors were encountered: