-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
…d correlation check to be general to whatever variable prefix is supplied
…s with working method
@jds485 I got this error running on tallgrass. Looks like an error getting the NHDPlus attributes from science base. Any thoughts on what is going on here? Its in target p1_sb_data_g2_csv |
Did you try restarting it? I have seen errors like this occur seemingly at random. Maybe we need to implement a trycatch for these downloads, too. |
…for the snowmelt dominated region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was able to run this on Tallgrass finally. I copied your regional-hydrologic-forcings-ml folder into my directory and then just re-ran it there. It kept all the previously run targets, so I could skip over them. So - idk if that is a good long term solution because if you had changed something prior to me copying it, it would not have been the same as the pr, but I think it worked ok for now.
I erased the p6 targets and re-ran them just to check that everything ran ok and it did. Then I stepped through some of the functions in an Rstudio session just to see if I could follow what was going on. I got a little confused with some things in the train_models_grid function but I generally understand what is going on.
The plots look good but I noted a couple places where I think the metric should be added as the yaxis or title on the plot in addition to the file name just because there are so many plots that it would be easy to look at the wrong one without that metric name being on there in an obvious way. I think its ok to merge if you just add those two little things.
6_predict/src/plot_diagnostics.R
Outdated
png(filename = fileout, width = 4, height = 4, units = 'in', res = 200) | ||
boxplot(data_split$training[[metric]], | ||
data_split$testing[[metric]], | ||
names = c('Training', 'Testing')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though its in the file name, it would probably be good to have ylab=metric in the boxplot function to label that y axis on the plot.
plot(df_pred_obs$obs, df_pred_obs$.pred, | ||
xlim = c(0,plt_lim), ylim = c(0,plt_lim), | ||
xlab = 'Observed', ylab = 'Predicted', cex = 0.4, pch = 16) | ||
lines(c(0,plt_lim), c(0,plt_lim), col = 'red') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a title with the metric name? title(metric)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out missing metric names in titles. I'll add metric names to the titles on all of the figures
…bute for CONUS models
This PR adds a set of targets that develop ranger random forest models to predict streamflow metrics.
Phase 5: Select Features
Two feature selections are applied globally (to the full GAGES dataset):
refine_features
function is used to remove features that have the same value for all gages, and to drop features that are specified indrop_columns
. The dropped features either had NAs, were perfectly correlated with another feature, or were highly correlated with another feature (>0.9).drop_high_corr_ACCTOT
function is used to drop all TOT features that are highly correlated with non-TOT features. This step resulted in dropping all remaining TOT features in favor of using divergence routed ACC features.Phase 6: Predict
screen_Boruta
function is used to further screen features. First, the features within the modeling region are scanned and removed if they have the same value for all gages in the modeling region. Then, thesplit_data
function is used to create training and testing datasets. Finally, the Boruta algorithm is used to select "all relevant" features based on the training dataset. The Boruta algorithm is applied three times: using all features, without CAT features, and without ACC features. The union of identified features is selected for prediction for that model region and flow metric. This function returns the train/test split with screened features.train_models_grid
function is used to tune the parameters of the RF model using a grid search. The search is a space filled design. A 5-fold cross validation is used to get the prediction performance (mean over 5 folds). The function returns the summary results for each model evaluated, and the model whose parameters provided the best RMSE. This step takes about 8 mins on 35 cores. I have not tried to further reduce computation time.Plots
The result of Boruta screening applied to all features is plotted with the
plot_Boruta
functionThe performance metrics as a function of the grid search parameters are plotted with the
plot_hyperparam_opt_results_RF
function. Note: number of trees is the 3rd variable not symbolized here. Could vary point size for trees.The variable importance plot for the best model is plotted with the
plot_vip
function.How to run
sbatch --ntasks-per-node=35 ./slurm/launch-rstudio-container.slurm regional-hydrologic-forcings-container_FHWA_cc55e9da.sif
Notes: Created issues for these
split_data
currently uses a random split instead of using the nested gages matrix to inform splits Create train, test, and cross validation splits with nestedness matrix #112train_models_grid
is currently based on a grid search. Refining hyperparameters with a Bayesian optimization is very slow when using the results of the grid search to provide initial solutions. Should investigate why and report to thetune_bayes
GitHub repo if we have trouble. Investigate slowness of tune_bayes #114Closes #97, #111