You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
some colleagues observed that caret::train() with method = "cubist" errors when some special characters in factor values are present in predictors, tracing back to Cubist::cubist().
I really like Cubist because of its speed and straight-forward way of interpreting results. Thanks a lot for your energy invested in this nice and clean R port!
I thought I'll have a look into the issue to figure out a possible solution. Below is some testing of standard ASCII characters, some of them with special roles in Rulequest Cubist, and non-ASCII umlauts, to diagnose the errors, and a suggestion for resolving a part of the issue:
################################################################################## Description: Special characters are not correctly escaped for values in ## Cubist data file################################################################################
library("mlbench")
library("Cubist")
#> Warning: package 'Cubist' was built under R version 3.4.4#> Loading required package: lattice
library("tidyverse")
#> ── Attaching packages ──────────────────────────── tidyverse 1.2.1 ──#> ✔ ggplot2 3.1.0 ✔ purrr 0.2.5#> ✔ tibble 1.4.2 ✔ dplyr 0.7.8#> ✔ tidyr 0.8.2 ✔ stringr 1.3.1#> ✔ readr 1.1.1 ✔ forcats 0.2.0#> Warning: package 'ggplot2' was built under R version 3.4.4#> Warning: package 'tibble' was built under R version 3.4.3#> Warning: package 'tidyr' was built under R version 3.4.4#> Warning: package 'purrr' was built under R version 3.4.4#> Warning: package 'dplyr' was built under R version 3.4.4#> Warning: package 'stringr' was built under R version 3.4.4#> ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──#> ✖ dplyr::filter() masks stats::filter()#> ✖ dplyr::lag() masks stats::lag()# Example data set
data(BostonHousing)
# Test with only 2 factorial predictorsboston_housing<- as.tibble(BostonHousing[, c("crim", "zn", "medv")])
# Convert numeric `crim` and `zn` to factorsboston_housing<-boston_housing %>%
mutate(
zn= as.factor(zn),
crim= as.factor(crim)
)
#> Warning: package 'bindrcpp' was built under R version 3.4.4## See https://www.rulequest.com/cubist-unix.html for exceptions:# "What's in a name?## Special characters (comma, colon, period, vertical bar `|') can appear in## names and values, but must be prefixed by the escape character `\'. ## For example, the name "Filch, Grabbit, and Co." would be written as `Filch\,## Grabbit\, and Co\.'. (However, it is not necessary to escape colons in times ## and periods in numbers.)"# Test (1) ASCII, no umlaut ----------------------------------------------------# Recode factor levels
(boston_housing_chars<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$ a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Fine, works
(mod_housing_chars<-
cubist(x=boston_housing_chars[, -c(3)], y=boston_housing_chars$medv,
committees=10))
#> #> Call:#> cubist.default(x = boston_housing_chars[, -c(3)], y#> = boston_housing_chars$medv, committees = 10)#> #> Number of samples: 506 #> Number of predictors: 2 #> #> Number of committees: 10 #> Number of rules per committee: 31, 28, 28, 26, 27, 25, 27, 24, 30, 24
# Test (2) Umlaut "ä" ----------------------------------------------------------# Recode factor levels
(boston_housing_umlaut<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$ä", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$ä a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Errormod_housing_umlaut<-
cubist(x=boston_housing_umlaut[, -c(3)], y=boston_housing_umlaut$medv,
committees=10)
#> cubist code called exit with value 1#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (3) comma "," -----------------------------------------------------------# Recode factor levels
(boston_housing_comma<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$,", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$, a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Error
(mod_housing_comma<-
cubist(x=boston_housing_comma[, -c(3)], y=boston_housing_comma$medv,
committees=10))
#> cubist code called exit with value 1#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (4) period "." ----------------------------------------------------------# Recode factor levels
(boston_housing_period<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$.", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$. a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Works
(mod_housing_period<-
cubist(x=boston_housing_period[, -c(3)], y=boston_housing_period$medv,
committees=10))
#> #> Call:#> cubist.default(x = boston_housing_period[, -c(3)], y#> = boston_housing_period$medv, committees = 10)#> #> Number of samples: 506 #> Number of predictors: 2 #> #> Number of committees: 10 #> Number of rules per committee: 31, 28, 28, 26, 27, 25, 27, 24, 30, 24
# Test (5) colon ":" -----------------------------------------------------------# Recode factor levels
(boston_housing_colon<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$:", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$: a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Error
(mod_housing_colon<-
cubist(x=boston_housing_colon[, -c(3)], y=boston_housing_colon$medv,
committees=10))
#> cubist code called exit with value 1#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (6) vertical bar "|" ----------------------------------------------------# Recode factor levels
(boston_housing_bar<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$|", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$| a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Error
(mod_housing_bar<-
cubist(x=boston_housing_bar[, -c(3)], y=boston_housing_bar$medv,
committees=10))
#> cubist code called exit with value 1#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
# Test (7) semicolon ";" -------------------------------------------------------# Recode factor levels
(boston_housing_semicol<-boston_housing %>%
mutate(
# Recode respective level of first valuecrim= recode(crim, `0.00632`="a@_$;", .default= levels(crim)),
zn= recode(zn, `18`="a@_$?", .default= levels(zn))
) %>%
rename(`zn|d`=zn))
#> # A tibble: 506 x 3#> crim `zn|d` medv#> <fct> <fct> <dbl>#> 1 a@_$; a@_$? 24.0#> 2 0.02731 0 21.6#> 3 0.02729 0 34.7#> 4 0.03237 0 33.4#> 5 0.06905 0 36.2#> 6 0.02985 0 28.7#> 7 0.08829 12.5 22.9#> 8 0.14455 12.5 27.1#> 9 0.21124 12.5 16.5#> 10 0.17004 12.5 18.9#> # ... with 496 more rows# Error
(mod_housing_semicol<-
cubist(x=boston_housing_semicol[, -c(3)], y=boston_housing_semicol$medv,
committees=10))
#> cubist code called exit with value 1#> Error in strsplit(tmp, "\"")[[1]]: subscript out of bounds
Based on the errors above, escaping of the following characters does not work: ",", ":", ";", "|", , "ä". However, "." works fine. I was quite suprised, because according to this info page of Rulequest, escaping should work for comma, colon, period, and vertical bar. Here is is the output from the current escaping helper:
My guess is that "." is not a problem because C Cubist parses the values in the data file correctly due to separation by comma, and escaping has no effect.
The new escapes() helper only escapes ",", ":", and "|". This resolves issues with umlaut parsing (no fixed = TRUE in gsub()), and Cubist now works when factorial variables contain these. This change lets Cubist::cubist() compute successfully for semicolon ";" character in factors, but unfortunately not for the remaining special characters. However, I was not able to figure out how to get escaping of ",", ":", and "|" working.
I have no experience in C (yet). Do you have any ideas why escaping fails for these reserved Cubist characters? Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage? Maybe this is also locale specific and depends on the encoding conversion between C Cubist files and R objects.
Would be great to fully support escaping, because these special characters are quite common. If there is no easy solution, I think it would be helpful to include checks in Cubist::cubist() and let it error with an informative message when these characters are in factors or character columns of the predictor data frame.
Thanks for your help, looking forward to your insight into this issue.
Cheers,
Philipp
The text was updated successfully, but these errors were encountered:
Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage?
Yeah, it's their original limitation.
Do you want to PR or should I just make the change?
Hi @topepo sorry just saw now, it's been long ago. If you have time just making the change, would be great! otherwise, I guess if this is an original limitation, a note would also suffice in the README. Thanks for making this pkg. Cheers
Hi Max,
some colleagues observed that
caret::train()
withmethod = "cubist"
errors when some special characters in factor values are present in predictors, tracing back toCubist::cubist()
.I really like Cubist because of its speed and straight-forward way of interpreting results. Thanks a lot for your energy invested in this nice and clean R port!
I thought I'll have a look into the issue to figure out a possible solution. Below is some testing of standard ASCII characters, some of them with special roles in Rulequest Cubist, and non-ASCII umlauts, to diagnose the errors, and a suggestion for resolving a part of the issue:
Based on the errors above, escaping of the following characters does not work:
","
,":"
,";"
,"|"
, ,"ä"
. However,"."
works fine. I was quite suprised, because according to this info page of Rulequest, escaping should work for comma, colon, period, and vertical bar. Here is is the output from the current escaping helper:My guess is that "." is not a problem because C Cubist parses the values in the data file correctly due to separation by comma, and escaping has no effect.
Here is the session info output:
I made a commit in the forked repo here to fix a part of the issues here
The new
escapes()
helper only escapes","
,":"
, and"|"
. This resolves issues with umlaut parsing (nofixed = TRUE
ingsub()
), and Cubist now works when factorial variables contain these. This change letsCubist::cubist()
compute successfully for semicolon";"
character in factors, but unfortunately not for the remaining special characters. However, I was not able to figure out how to get escaping of","
,":"
, and"|"
working.I have no experience in C (yet). Do you have any ideas why escaping fails for these reserved Cubist characters? Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage? Maybe this is also locale specific and depends on the encoding conversion between C Cubist files and R objects.
Would be great to fully support escaping, because these special characters are quite common. If there is no easy solution, I think it would be helpful to include checks in
Cubist::cubist()
and let it error with an informative message when these characters are in factors or character columns of the predictor data frame.Thanks for your help, looking forward to your insight into this issue.
Cheers,
Philipp
The text was updated successfully, but these errors were encountered: