You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
H2O version, Operating System and Environment
Windows 10. R version 4.4.1. h2o R package version 3.44.0.3
Actual behavior
When I use h2o.shap_summary_plot to plot the output of a random forest model with only numeric variables, the normalized versions of those the numeric variables that are binary (0,1) are not 0,1, they come out as ~0.5 and 1 (and so are plotted as purple and pink rather than blue and pink). But if I include a factor variable in my model, then they get normalized to 0 and 1 (and are plotted as blue and pink). Numeric variables seem to be being normalized differently depending on whether there are factor variables in the model.
Expected behavior
I would expect the binary variables to be treated the same regardless of what other variables are in the model
Steps to reproduce
h2o.init()
example <- data.frame(
NumericVar = rnorm(100, mean = 50, sd = 10), # Numeric variable (normal distribution)
BinaryVar = sample(c(0, 1), 100, replace = TRUE), # Binary variable (0, 1)
BinaryVar2 = sample(c(0, 1), 100, replace = TRUE) # Binary variable (0, 1)
)
example$CorrelatedVar = example$NumericVar * 0.8 + rnorm(100, mean = 0, sd = 5) # Add response variable
# run and plot model that contains numeric values only. The binary numeric variables don't get normalized to 0 and 1.
regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
y = "CorrelatedVar",
ntrees = 500,
mtries = 3,
sample_rate = 0.632,
min_rows = 2,
seed = 42,
max_depth = 20)
p1=h2o.shap_summary_plot(
model = rfModel,
newdata = regressionMatrix
)
p1
# change one variable to a factor, then run and plot the model. The remaining binary numeric variable does get normalized to 0 and 1
example$BinaryVar2=as.factor(example$BinaryVar2) # change one of the binary variables to a factor
regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
y = "CorrelatedVar",
ntrees = 500,
mtries = 3,
sample_rate = 0.632,
min_rows = 2,
seed = 42,
max_depth = 20)
p2=h2o.shap_summary_plot(
model = rfModel,
newdata = regressionMatrix
)
p2
Screenshots
"p1" plot - only numeric variables in the model. Both binary variables are not normalized to 0 and 1, more like 0.5 and 1
"p2" plot - "BinaryVar2" has been changed to a factor. Now the remaining numeric binary variable is normalized to 0 and 1
Why is this happening? How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?
The text was updated successfully, but these errors were encountered:
This looks like a bug. Thank you for reporting it!
Why is this happening?
We try to show the value of individual columns using one color scheme and to make it more robust to outliers we show use quantiles of the points instead of their actual value. This should be relatively robust for continuous values (outlier won't make the point with just one color). Another advantage is that you can somehow compare the values between multiple columns - the same quantile will have the same color regardless the actual value.
How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?
I would suggest using factors as the models might benefit from the information that the column contains discrete values.
But if you want to change how the values are normalized you can use the following code. I changed the code so that it doesn't use quantiles for columns with less than 32 unique values.
.uniformize<-function(col) {
if (is.factor(col)) {
return(.min_max(as.numeric(col) / nlevels(col)))
}
if (is.character(col) || all(is.na(col))) {
if (is.character(col) &&!all(is.na(col))) {
fct<- as.factor(col)
return(.min_max(as.numeric(fct) / nlevels(fct)))
}
return(rep_len(0, length(col)))
}
res<-colif (length(unique(col)) >=32) # don't uniformize for low number of unique valuesres<-stats::ecdf(col)(col)
res[is.na(res)] <-0return(res)
}
assignInNamespace(".uniformize", .uniformize, "h2o")
Thanks for your quick response. Another hack I found to change how the values are normalized was to add a fake character variable. This variable then got automatically deleted when running the model, but the normalizing still worked in the way I wanted. But thanks for the code, that is much better.
H2O version, Operating System and Environment
Windows 10. R version 4.4.1. h2o R package version 3.44.0.3
Actual behavior
When I use h2o.shap_summary_plot to plot the output of a random forest model with only numeric variables, the normalized versions of those the numeric variables that are binary (0,1) are not 0,1, they come out as ~0.5 and 1 (and so are plotted as purple and pink rather than blue and pink). But if I include a factor variable in my model, then they get normalized to 0 and 1 (and are plotted as blue and pink). Numeric variables seem to be being normalized differently depending on whether there are factor variables in the model.
Expected behavior
I would expect the binary variables to be treated the same regardless of what other variables are in the model
Steps to reproduce
Screenshots
"p1" plot - only numeric variables in the model. Both binary variables are not normalized to 0 and 1, more like 0.5 and 1
"p2" plot - "BinaryVar2" has been changed to a factor. Now the remaining numeric binary variable is normalized to 0 and 1
Why is this happening? How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?
The text was updated successfully, but these errors were encountered: