Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Initial prediction model workflow #110

Merged
merged 41 commits into from
May 10, 2022
Merged

Initial prediction model workflow #110

merged 41 commits into from
May 10, 2022

Conversation

jds485
Copy link
Member

@jds485 jds485 commented May 4, 2022

This PR adds a set of targets that develop ranger random forest models to predict streamflow metrics.

Phase 5: Select Features
Two feature selections are applied globally (to the full GAGES dataset):

  • The refine_features function is used to remove features that have the same value for all gages, and to drop features that are specified in drop_columns. The dropped features either had NAs, were perfectly correlated with another feature, or were highly correlated with another feature (>0.9).
  • The drop_high_corr_ACCTOT function is used to drop all TOT features that are highly correlated with non-TOT features. This step resulted in dropping all remaining TOT features in favor of using divergence routed ACC features.

Phase 6: Predict

  • For each model region and metric to be predicted, the screen_Boruta function is used to further screen features. First, the features within the modeling region are scanned and removed if they have the same value for all gages in the modeling region. Then, the split_data function is used to create training and testing datasets. Finally, the Boruta algorithm is used to select "all relevant" features based on the training dataset. The Boruta algorithm is applied three times: using all features, without CAT features, and without ACC features. The union of identified features is selected for prediction for that model region and flow metric. This function returns the train/test split with screened features.
  • The train_models_grid function is used to tune the parameters of the RF model using a grid search. The search is a space filled design. A 5-fold cross validation is used to get the prediction performance (mean over 5 folds). The function returns the summary results for each model evaluated, and the model whose parameters provided the best RMSE. This step takes about 8 mins on 35 cores. I have not tried to further reduce computation time.

Plots
The result of Boruta screening applied to all features is plotted with the plot_Boruta function
Boruta_vhfdc1_q0 9_rain

The performance metrics as a function of the grid search parameters are plotted with the plot_hyperparam_opt_results_RF function. Note: number of trees is the 3rd variable not symbolized here. Could vary point size for trees.
hyperparam_diagnostic_vhfdc1_q0 9_rain

The variable importance plot for the best model is plotted with the plot_vip function.
vip_vhfdc1_q0 9_rain

How to run

  • submitted with sbatch --ntasks-per-node=35 ./slurm/launch-rstudio-container.slurm regional-hydrologic-forcings-container_FHWA_cc55e9da.sif
  • Container version: regional-hydrologic-forcings-container_FHWA_cc55e9da.sif
  • number of cores: 35
  • I set up the parallelization to work with tar_make(). I haven't tested with tar_make_clustermq(). I initially had issues with the clustermq approach and switched to a different method with doParallel.

Notes: Created issues for these

Closes #97, #111

jds485 added 26 commits April 25, 2022 13:35
…d correlation check to be general to whatever variable prefix is supplied
@jds485 jds485 requested a review from slevin75 May 4, 2022 20:42
@slevin75
Copy link
Collaborator

slevin75 commented May 4, 2022

@jds485 I got this error running on tallgrass. Looks like an error getting the NHDPlus attributes from science base. Any thoughts on what is going on here? Its in target p1_sb_data_g2_csv

image

@jds485
Copy link
Member Author

jds485 commented May 5, 2022

I got this error running on tallgrass. Looks like an error getting the NHDPlus attributes from science base. Any thoughts on what is going on here? Its in target p1_sb_data_g2_csv

Did you try restarting it? I have seen errors like this occur seemingly at random. Maybe we need to implement a trycatch for these downloads, too.

@jds485 jds485 linked an issue May 10, 2022 that may be closed by this pull request
Copy link
Collaborator

@slevin75 slevin75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was able to run this on Tallgrass finally. I copied your regional-hydrologic-forcings-ml folder into my directory and then just re-ran it there. It kept all the previously run targets, so I could skip over them. So - idk if that is a good long term solution because if you had changed something prior to me copying it, it would not have been the same as the pr, but I think it worked ok for now.

I erased the p6 targets and re-ran them just to check that everything ran ok and it did. Then I stepped through some of the functions in an Rstudio session just to see if I could follow what was going on. I got a little confused with some things in the train_models_grid function but I generally understand what is going on.

The plots look good but I noted a couple places where I think the metric should be added as the yaxis or title on the plot in addition to the file name just because there are so many plots that it would be easy to look at the wrong one without that metric name being on there in an obvious way. I think its ok to merge if you just add those two little things.

png(filename = fileout, width = 4, height = 4, units = 'in', res = 200)
boxplot(data_split$training[[metric]],
data_split$testing[[metric]],
names = c('Training', 'Testing'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though its in the file name, it would probably be good to have ylab=metric in the boxplot function to label that y axis on the plot.

plot(df_pred_obs$obs, df_pred_obs$.pred,
xlim = c(0,plt_lim), ylim = c(0,plt_lim),
xlab = 'Observed', ylab = 'Predicted', cex = 0.4, pch = 16)
lines(c(0,plt_lim), c(0,plt_lim), col = 'red')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a title with the metric name? title(metric)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out missing metric names in titles. I'll add metric names to the titles on all of the figures

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create predictions for other model regions Create initial prediction models
2 participants