-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-1468] overwrite tier1 variable #156
Conversation
…erwrite_tier1_variable merge changes from GEN-1516-table_update_cohort_specific branch
scripts/table_updates/config.json
Outdated
@@ -19,7 +19,7 @@ | |||
"CRC2": "syn52943210", | |||
"RENAL": "syn59474249" | |||
}, | |||
"main_genie_release_version": "16.6-consortium", | |||
"main_genie_release_version": "17.4-consortium", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chelsea-Na do we want to keep this at 17.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! If we are ready to test the output, we should test it out on 17.4-consortium. We will eventually need to also test on 17.6-consortium once its out.
Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing? Or if it adds to a log if there is a mismatch between the upload and the replaced value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chelsea-Na This is a hard replacement so I didn't check GENIE value missingness. I checked with Rixing that we think all BPC samples/patients should be in Main GENIE referring to BPC project description. I'm happy to add a check for sample/patient missingness. Moreover, it doesn't log for the discrepancies between uploaded and replaced values. I'm happy to add a function for it if it's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to address your question: "Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing?
If the main GENIE sample/patient is missing, the values will be filled with NaN
.
If the GENIE value is missing, then the final dataframe shows missingness since we use all main GENIE values to hard replace BPC. However, this might not be the case because no missingness has been found in target columns in the sample clinical table and patient clinical table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chelsea-Na Two functions have been added to check for sample/patient missingness and log the discrepancies between uploaded and replaced values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's push the version to 17.6-consortium
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a first pass, had some comments
mock_logger.assert_not_called() | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's a lot of parameters here so it makes it really hard to read when using pytest.mark.parametrize
(usually once I have more than 3 parameters, I'd use a different method). I'd recommend something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
scripts/table_updates/utilities.py
Outdated
""" | ||
# check the validity of bpc_column_list | ||
valid_col = column_mapping_table.loc[column_mapping_table["prissmm_form"] == form,].prissmm_element.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using prissm_form
? Could there be somewhere in the function docstring describing why we are pulling our column list for both bpc and main genie from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is the prissmm_form matches form column in the Data Table information table. I can update the doctring.
scripts/table_updates/utilities.py
Outdated
) | ||
else: | ||
main_genie_table = main_genie_table[main_genie_column_list + ["SAMPLE_ID"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the original code looked like but was there ever a handling of potential duplicates before (when we first query by cohort) and after merging here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code doesn't handle duplicate.
scripts/table_updates/utilities.py
Outdated
how="left", | ||
left_on="cpt_genie_sample_id", | ||
right_on="SAMPLE_ID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Chelsea's concern here: what happens if there isn't a 1:1 merge between bpc and main genie (BPC has sample/patients not present in clinical). How did the code previously handle the merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code doesn't handle duplicate. See here: https://github.com/Sage-Bionetworks/genie-bpc-pipeline/blob/1bc58ec5c7415ba5b989dbb5a0de39b4839a1b0b/scripts/table_updates/update_data_table.py#L346C1-L357C6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Great work here! I'll defer to @rxu17 for a final review!
"OVARIAN": "syn64042773" | ||
}, | ||
"main_genie_release_version": "16.6-consortium", | ||
"tier1a_replacement_mapping":{ | ||
"patient_characteristics_tier1a_replacement_mapping_table": "syn64331052", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask what these tables are for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patient_characteristics_tier1a_replacement_mapping_table
and cancer_panel_test_tier1a_replacement_mapping_table
track how original bpc codes are mapped to NAACCR code. This is added to address Chelsea's request: "adds to a log if there is a mismatch between the upload and the replaced value".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall! Some comments/questions
Problem:
The tier1a variables (race, sex, ethnicity, sample_type, seq_date) in BPC tables need to be replaced with values extracted from Main GENIE.
Solution:
Testing:
Unit tests have been added.