-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using group_initial_split() with small group will fail even if adjusting the prop
parameter?
#534
Comments
From the docs:
while you are
Since groups as a whole get allotted to training or testing, they can't be all represented in the test set, otherwise there would be no observations left for the training set. Stratification (as opposed to grouped resampling) aims to ensure that the proportion of each group is the same in the training and testing set as it is in the full dataset. So if you have a small group and want a training and testing set which both contain all groups, including that small group, stratification is typically what you want to use. This can be done with the Does this help? library(rsample)
set.seed(123)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
x = rnorm(1000))
# proportion of each group in the data
table(dat$group) / nrow(dat)
#>
#> A B C D
#> 0.296 0.301 0.311 0.092
dat_split <- initial_split(dat, strata = "group", prop = 0.75)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)
# preserved proportions
table(dat_train$group) / nrow(dat_train)
#>
#> A B C D
#> 0.29906542 0.29773031 0.30841121 0.09479306
table(dat_test$group) / nrow(dat_test)
#>
#> A B C D
#> 0.28685259 0.31075697 0.31872510 0.08366534
# what the prop argument does
nrow(dat_train) / nrow(dat)
#> [1] 0.749 Created on 2024-09-12 with reprex v2.1.0 |
Hi @hfrick thanks a lot for the answer. Sorry, that last statement was a bit misleading (I meant that by running K times, I want to each time one group in the test sample), so I removed that part. The main question remains: how come, having one group with frequency 0.1, setting Thanks! |
Ah, I see. Thanks for clarifying! I would say this could be loosely answered with "the error happens because we are sampling, not optimizing". In your example, we have 4 groups with one group about the size of the test set. So a grouped split with If you increase the number of attempts in your last illustration, you should be able to see it move towards 0.75. |
The problem
Summary:
group_initial_split()
fails often with small-frequency groups even if adjustingprop
to reflect the small-frequency group?I'm using
group_initial_split()
with a small number (4) groups. As I have one group with low frequency (10%), my intuition was that by settingprop=0.9
, this group would be selected within the training sample. However, I get very often (around 70%) error messages such as:How come this happens even if I adjusted
prop
? This fails even if I get the exact proportion of the group (1-freq(small_group))!? Am I misunderstanding theprop
argument?Thanks!
Reproducible example
Created on 2024-09-08 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: