Initial prediction model workflow #110

jds485 · 2022-05-04T19:18:49Z

This PR adds a set of targets that develop ranger random forest models to predict streamflow metrics.

Phase 5: Select Features
Two feature selections are applied globally (to the full GAGES dataset):

The refine_features function is used to remove features that have the same value for all gages, and to drop features that are specified in drop_columns. The dropped features either had NAs, were perfectly correlated with another feature, or were highly correlated with another feature (>0.9).
The drop_high_corr_ACCTOT function is used to drop all TOT features that are highly correlated with non-TOT features. This step resulted in dropping all remaining TOT features in favor of using divergence routed ACC features.

Phase 6: Predict

For each model region and metric to be predicted, the screen_Boruta function is used to further screen features. First, the features within the modeling region are scanned and removed if they have the same value for all gages in the modeling region. Then, the split_data function is used to create training and testing datasets. Finally, the Boruta algorithm is used to select "all relevant" features based on the training dataset. The Boruta algorithm is applied three times: using all features, without CAT features, and without ACC features. The union of identified features is selected for prediction for that model region and flow metric. This function returns the train/test split with screened features.
The train_models_grid function is used to tune the parameters of the RF model using a grid search. The search is a space filled design. A 5-fold cross validation is used to get the prediction performance (mean over 5 folds). The function returns the summary results for each model evaluated, and the model whose parameters provided the best RMSE. This step takes about 8 mins on 35 cores. I have not tried to further reduce computation time.

Plots
The result of Boruta screening applied to all features is plotted with the plot_Boruta function

The performance metrics as a function of the grid search parameters are plotted with the plot_hyperparam_opt_results_RF function. Note: number of trees is the 3rd variable not symbolized here. Could vary point size for trees.

The variable importance plot for the best model is plotted with the plot_vip function.

How to run

submitted with sbatch --ntasks-per-node=35 ./slurm/launch-rstudio-container.slurm regional-hydrologic-forcings-container_FHWA_cc55e9da.sif
Container version: regional-hydrologic-forcings-container_FHWA_cc55e9da.sif
number of cores: 35
I set up the parallelization to work with tar_make(). I haven't tested with tar_make_clustermq(). I initially had issues with the clustermq approach and switched to a different method with doParallel.

Notes: Created issues for these

Only the 90th percentile flow volume is predicted, but we should be able to map with dynamic branching to predict for other metrics. Predictions for other flow metrics #113
split_data currently uses a random split instead of using the nested gages matrix to inform splits Create train, test, and cross validation splits with nestedness matrix #112
train_models_grid is currently based on a grid search. Refining hyperparameters with a Bayesian optimization is very slow when using the results of the grid search to provide initial solutions. Should investigate why and report to the tune_bayes GitHub repo if we have trouble. Investigate slowness of tune_bayes #114
Several tests we should do: Test with and without WB model variables, Test with and without monthly climate variables, Test creating and using seasonal climate variables to remove high monthly correlations Model refining tests #115
Should add an automatic method in the p6 phase to select variables with correlations < 0.9. Add automatic screening of highly correlated variables #116

Closes #97, #111

…-ml into main

… out directory

…-ml into main

…on-ACC variables

…-ml into main

…d correlation check to be general to whatever variable prefix is supplied

…s with working method

…er regions

…-ml into main

slevin75 · 2022-05-04T23:23:19Z

@jds485 I got this error running on tallgrass. Looks like an error getting the NHDPlus attributes from science base. Any thoughts on what is going on here? Its in target p1_sb_data_g2_csv

jds485 · 2022-05-05T12:24:39Z

I got this error running on tallgrass. Looks like an error getting the NHDPlus attributes from science base. Any thoughts on what is going on here? Its in target p1_sb_data_g2_csv

Did you try restarting it? I have seen errors like this occur seemingly at random. Maybe we need to implement a trycatch for these downloads, too.

…h flow metrics

…for the snowmelt dominated region

…S model

slevin75

So I was able to run this on Tallgrass finally. I copied your regional-hydrologic-forcings-ml folder into my directory and then just re-ran it there. It kept all the previously run targets, so I could skip over them. So - idk if that is a good long term solution because if you had changed something prior to me copying it, it would not have been the same as the pr, but I think it worked ok for now.

I erased the p6 targets and re-ran them just to check that everything ran ok and it did. Then I stepped through some of the functions in an Rstudio session just to see if I could follow what was going on. I got a little confused with some things in the train_models_grid function but I generally understand what is going on.

The plots look good but I noted a couple places where I think the metric should be added as the yaxis or title on the plot in addition to the file name just because there are so many plots that it would be easy to look at the wrong one without that metric name being on there in an obvious way. I think its ok to merge if you just add those two little things.

slevin75 · 2022-05-10T16:58:29Z

6_predict/src/plot_diagnostics.R

+  png(filename = fileout, width = 4, height = 4, units = 'in', res = 200)
+  boxplot(data_split$training[[metric]],
+          data_split$testing[[metric]], 
+          names = c('Training', 'Testing'))


Even though its in the file name, it would probably be good to have ylab=metric in the boxplot function to label that y axis on the plot.

slevin75 · 2022-05-10T17:04:11Z

6_predict/src/plot_diagnostics.R

+  plot(df_pred_obs$obs, df_pred_obs$.pred,
+       xlim = c(0,plt_lim), ylim = c(0,plt_lim),
+       xlab = 'Observed', ylab = 'Predicted', cex = 0.4, pch = 16)
+  lines(c(0,plt_lim), c(0,plt_lim), col = 'red')


add a title with the metric name? title(metric)

Thanks for pointing out missing metric names in titles. I'll add metric names to the titles on all of the figures

…bute for CONUS models

jds485 added 26 commits April 25, 2022 13:35

update path

33c2592

Merge branch 'main' of github.com:USGS-R/regional-hydrologic-forcings…

e7b9819

…-ml into main

rough draft of workflow for training ML models

be4128c

remove seed (targets has seeds for each target). Remove comments. Add…

e9d5ab4

… out directory

Merge branch 'main' of github.com:USGS-R/regional-hydrologic-forcings…

9af1924

…-ml into main

resolve merge conflict

a45c5a6

Added feature selection based on correlation. Some minor format edits

bda139e

Changed location of correlation script. Minor formating

65753e6

functions to select features

fd721ee

Change to remove only ACC variables that are highly correlated with n…

c038583

…on-ACC variables

add Boruta RF screening target

fc0e4dc

split functions into a screening function and a training function

4a4f81b

add Boruta plot

1caedc5

changed plot symbols and labels

dcc6d8a

added tidymodels package

3a04219

added train test split function, and built training function further

b4c7e46

edit some grid ranges and format

27e0adb

Merge branch 'main' of github.com:USGS-R/regional-hydrologic-forcings…

df9f2d0

…-ml into main

resolve merge conflicts

148d5a4

Updated refine features to accept object instead of file path. Update…

9552580

…d correlation check to be general to whatever variable prefix is supplied

added several plot types. Not all tested yet.

22bebcd

Updated Boruta screening to use ACC var names. Updated training model…

ba703a6

…s with working method

added placeholders for many plot targets for future model runs in oth…

28405b8

…er regions

add out files to gitignore

9fba885

Merge branch 'main' of github.com:USGS-R/regional-hydrologic-forcings…

c921729

…-ml into main

Merge branch 'main' into 97-pred-models

c57bec8

jds485 added 2 commits May 4, 2022 14:37

commented out targets that are untested

6aa21ff

add function description, and add environment for cluster export

a03734c

jds485 mentioned this pull request May 4, 2022

Create predictions for other model regions #111

Closed

This was referenced May 4, 2022

Create train, test, and cross validation splits with nestedness matrix #112

Closed

Predictions for other flow metrics #113

Open

using ggplot graphics for vip

d51e6d7

jds485 requested a review from slevin75 May 4, 2022 20:42

jds485 added 11 commits May 5, 2022 13:15

fixed function description

b18608d

fixed function description

bbd10fe

added train_prop parameter. fixed function descriptions

0de068e

uncommented targets for other model regions. commented out season_hig…

93c7278

…h flow metrics

added function for a barplot comparing model runs

e758d43

added function to predict test data and corrected the cluster number …

419c15d

…for the snowmelt dominated region

added targets to predict in rain and snowmelt dominated test regions

4aaf922

added boxplots and scatterplots for observed and predicted data

76301de

added functions that allow using exact test data from other regions

e8196bb

added targets for exact test region data for rain-snow model and CONU…

d457a5d

…S model

commit log files

964d0da

jds485 linked an issue May 10, 2022 that may be closed by this pull request

Create predictions for other model regions #111

Closed

slevin75 approved these changes May 10, 2022

View reviewed changes

added titles with metric name to plots. Added cluster region as attri…

0f5be0e

…bute for CONUS models

jds485 merged commit 18dca5d into USGS-R:main May 10, 2022

This was referenced May 10, 2022

Model refining tests #115

Open

Select gages for DRB and COL models #38

Closed

jds485 mentioned this pull request Jun 29, 2022

Quick PR to get main up to date with prediction models branch #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial prediction model workflow #110

Initial prediction model workflow #110

jds485 commented May 4, 2022 •

edited

Loading

slevin75 commented May 4, 2022 •

edited

Loading

jds485 commented May 5, 2022

slevin75 left a comment

slevin75 May 10, 2022

slevin75 May 10, 2022

jds485 May 10, 2022

Initial prediction model workflow #110

Initial prediction model workflow #110

Conversation

jds485 commented May 4, 2022 • edited Loading

slevin75 commented May 4, 2022 • edited Loading

jds485 commented May 5, 2022

slevin75 left a comment

Choose a reason for hiding this comment

slevin75 May 10, 2022

Choose a reason for hiding this comment

slevin75 May 10, 2022

Choose a reason for hiding this comment

jds485 May 10, 2022

Choose a reason for hiding this comment

jds485 commented May 4, 2022 •

edited

Loading

slevin75 commented May 4, 2022 •

edited

Loading