Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Context
This PR implements the merging of features for studylocus-to-gene in L2G, this is to address the high rate of missingness as well as the class imbalance between Goldstandard positives/negatives.
🛠 What does this PR implement
New function in the L2GFeatureMatrix class: "merge_features_in_efo":
For a given studylocus-to-gene entry in the L2G feature matrix, say studyLocus_1 to gene_1 :
Identify all other studies which are of the same EFO/trait.
Identify other studylocus-to-gene entries for gene_1 from these studies, which are within a set window size (default 500kb) to studyLocus_1:
merge all features from these entries together into a single row, taking the maximum where necessary.
New function in the L2GGoldstandards class: "balance_classes"
Downsamples the Goldstandard negative set randomly, based on a predefined upper limit for the GSN:GSP ratio (defaults to 2).
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?