This repository has been archived by the owner on Jun 2, 2023. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Edits to spatial train/val/test, additional performance metrics #211
Edits to spatial train/val/test, additional performance metrics #211
Changes from 7 commits
de51eea
e30e3cc
13ee6cb
e014068
f5d132f
85c6917
37fc891
fe69f86
81dca56
89ec644
a4930cd
a7e35b5
c60adc5
5535ba4
4ec8543
03a8ae0
1f1d6f7
029a933
71d4bfd
4875ffc
2b59d01
7d630f8
6c2f18d
c2e08ad
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is an omission in the original code and that the following code should be added to correct that:
If I'm understanding them correctly, I think the suggested edits change the functionality of this section. As I read the original code, it appears that train_sites (as well as test_sites and val_sites) were sites that only appeared in that partition, but they weren't necessarily the only sites in that partition. In the revised code, it appears that if train_sites is specified, it will use only those sites in the training evaluations (and remove those sites from the test and validation partition).
If the intention is to change the functionality of train_sites, etc, then it's probably good to have a broader conversation. I am not using that option in my projects currently, but I don't know if others are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I wasn't sure if this was intentional. I could add another
explicit_spatial_split
parameter here to allow for the previous functionality when False and this new functionality when True. I'll hold off on that before receiving feedback from othersThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. Yeah. This was definitely a bug! 😮 Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetrbarclay: in a comment below, Jeff suggests we assume that sites within the train/val/test are the only sites in those partitions. That's also what I would expect. Do you know of anyone who is relying on the previous method wherein sites that are not within train_sites/val_sites/test_sites could be in the train/val/test partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting to be quite verbose. I suggest we split the
group
argument into two, maybe (group_spatially
, andgroup_temporally
). Thegroup_spatially
would be just a boolean. Thegroup_temporally
would be astr
.then the function could be something like:
I think that should work. We'd just have to document how the
group_temporally
argument needs to work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a bigger change, but I think it would be worth trying. It would also require propagating the change all the way up including any Snakefiles that are using this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also resolve #212
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that is much cleaner. I think one more argument would be needed for how to do temporal aggregation. I think what you've programmed would compute metrics for native timestep only (daily). I used a sum of the daily data to get biweekly, monthly, and yearly timesteps and compute metrics for those. Let me try out this edit because it will make the code much cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this change in the most recent commit. I needed 4 group args to replicate the previous functionality. I think it's more generic now to using different timesteps of aggregation, but it's more difficult to understand how to apply the 4 args to compute the desired metrics. For reference, here is the function that I used in the snakemake file to assign the 4 new args to this function to compute different metrics for different groups. See the description of the 4 args in the function docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the addition of biweekly and yearly options for the metrics is great.
If I'm reading this correctly, "biweekly", "monthly", and "yearly" all also use "seg_id_nat". For consistency with the other grouped metrics, it seems good to have that included in the group list. (so group = ['seg_id_nat','biweekly'])
(and as an aside, I'm noticing we should remove the hardcoded reference to seg_id_nat and replace it with spatial_idx_name. I think it's just the 3 references in this section. Would you want to fix that in this PR since you're already editing this section?)
Also, without running (which I haven't done) I'm not sure how monthly and yearly are different from ['seg_id_nat','month'] and ['seg_id_nat','year'] since they are both grouping on the same things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biweekly, monthly and yearly options are resampling the daily timeseries to those time steps by taking the sum of the data within those time periods (only for the days with observations). I'm not sure that sum is the best option and am open to other suggestions.
The resulting performance metrics are computed over all reaches, not by reach as with the ['seg_id_nat','time'] options, so I can add the group option that reports these metrics by reach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with removing the "seg_id_nat" comparison to be more generic, but that will affect snakemake workflows. For example, the workflow examples all have a function that defines
group
using seg_id_nat. Might be better to address this problem in a separate issueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the pandas grouper (so maybe that's the source of my confusion), but both monthly and yearly use 2 pandas groupers, 1 on time_idx_name and one on spatial_idx_name, right? So are you summing by reach and then calculating metrics across all the reaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: I added metrics for reach-biweekly, reach-monthly and reach-yearly timeseries.
We could also have reach-biweekly-month (summarize the biweekly timeseries by month), reach-biweekly-year, and reach-monthly-year. reach-biweekly-time would require an additional function to define a biweekly index for Python datetime objects.