Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-1468] overwrite tier1 variable #156

Merged
merged 13 commits into from
Jan 11, 2025
Merged

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Oct 18, 2024

Problem:

The tier1a variables (race, sex, ethnicity, sample_type, seq_date) in BPC tables need to be replaced with values extracted from Main GENIE.

Solution:

  1. Updated get_main_genie_clinical_sample_file to allow it more generic to pull both patient and sample release files.
  2. Added update_tier1a and overwrite_tier1a functions to update BPC tables allowing to update for all cohorts or a specific cohort.

Testing:

Unit tests have been added.

@danlu1 danlu1 marked this pull request as draft October 18, 2024 19:42
…erwrite_tier1_variable

merge changes from GEN-1516-table_update_cohort_specific branch
@danlu1 danlu1 marked this pull request as ready for review October 21, 2024 16:59
@@ -19,7 +19,7 @@
"CRC2": "syn52943210",
"RENAL": "syn59474249"
},
"main_genie_release_version": "16.6-consortium",
"main_genie_release_version": "17.4-consortium",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chelsea-Na do we want to keep this at 17.2?

Copy link
Contributor

@Chelsea-Na Chelsea-Na Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! If we are ready to test the output, we should test it out on 17.4-consortium. We will eventually need to also test on 17.6-consortium once its out.

Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing? Or if it adds to a log if there is a mismatch between the upload and the replaced value?

Copy link
Contributor Author

@danlu1 danlu1 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chelsea-Na This is a hard replacement so I didn't check GENIE value missingness. I checked with Rixing that we think all BPC samples/patients should be in Main GENIE referring to BPC project description. I'm happy to add a check for sample/patient missingness. Moreover, it doesn't log for the discrepancies between uploaded and replaced values. I'm happy to add a function for it if it's helpful.

Copy link
Contributor Author

@danlu1 danlu1 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to address your question: "Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing?
If the main GENIE sample/patient is missing, the values will be filled with NaN.
If the GENIE value is missing, then the final dataframe shows missingness since we use all main GENIE values to hard replace BPC. However, this might not be the case because no missingness has been found in target columns in the sample clinical table and patient clinical table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chelsea-Na Two functions have been added to check for sample/patient missingness and log the discrepancies between uploaded and replaced values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's push the version to 17.6-consortium

@danlu1 danlu1 requested a review from a team as a code owner October 23, 2024 19:12
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a first pass, had some comments

scripts/table_updates/utilities.py Outdated Show resolved Hide resolved
scripts/table_updates/utilities.py Outdated Show resolved Hide resolved
scripts/table_updates/tests/test_utilities.py Outdated Show resolved Hide resolved
scripts/table_updates/utilities.py Outdated Show resolved Hide resolved
scripts/table_updates/tests/test_utilities.py Show resolved Hide resolved
mock_logger.assert_not_called()


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's a lot of parameters here so it makes it really hard to read when using pytest.mark.parametrize (usually once I have more than 3 parameters, I'd use a different method). I'd recommend something like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""
# check the validity of bpc_column_list
valid_col = column_mapping_table.loc[column_mapping_table["prissmm_form"] == form,].prissmm_element.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using prissm_form? Could there be somewhere in the function docstring describing why we are pulling our column list for both bpc and main genie from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is the prissmm_form matches form column in the Data Table information table. I can update the doctring.

)
else:
main_genie_table = main_genie_table[main_genie_column_list + ["SAMPLE_ID"]]
Copy link
Contributor

@rxu17 rxu17 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the original code looked like but was there ever a handling of potential duplicates before (when we first query by cohort) and after merging here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code doesn't handle duplicate.

how="left",
left_on="cpt_genie_sample_id",
right_on="SAMPLE_ID",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Chelsea's concern here: what happens if there isn't a 1:1 merge between bpc and main genie (BPC has sample/patients not present in clinical). How did the code previously handle the merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danlu1 danlu1 requested a review from rxu17 January 2, 2025 19:22
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Great work here! I'll defer to @rxu17 for a final review!

"OVARIAN": "syn64042773"
},
"main_genie_release_version": "16.6-consortium",
"tier1a_replacement_mapping":{
"patient_characteristics_tier1a_replacement_mapping_table": "syn64331052",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask what these tables are for?

Copy link
Contributor Author

@danlu1 danlu1 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patient_characteristics_tier1a_replacement_mapping_table and cancer_panel_test_tier1a_replacement_mapping_table track how original bpc codes are mapped to NAACCR code. This is added to address Chelsea's request: "adds to a log if there is a mismatch between the upload and the replaced value".

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! Some comments/questions

scripts/table_updates/update_data_table.py Show resolved Hide resolved
scripts/table_updates/update_data_table.py Show resolved Hide resolved
scripts/table_updates/update_data_table.py Outdated Show resolved Hide resolved
scripts/table_updates/update_data_table.py Show resolved Hide resolved
@danlu1 danlu1 merged commit 95e96c4 into develop Jan 11, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants