-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add interval logic for l2g features #812
base: dev
Are you sure you want to change the base?
Conversation
# feature will be the same for any gene associated with a studyLocus) | ||
local_max.withColumn( | ||
"regional_maximum", | ||
f.max(local_feature_name).over(Window.partitionBy("studyLocusId")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes Jack!!!
The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.
The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:
- add any features you add to the
test_l2g_feature_matrix.py
suite, to make sure that the code doesnt crash - In the same file, add a semantic test for the common logic
- Update the documentation pages
- Pull dev branch to bring the changes to the feature matrix step
# intervals | ||
"pchicMean", | ||
"pchicMeanNeighbourhood", | ||
"enhTssMean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have more descriptive feature names
"enhTssMean", | |
"enhancerTssCorrelationMean", |
"pchicMean", | ||
"pchicMeanNeighbourhood", | ||
"enhTssMean", | ||
"enhTssMeanNeighbourhood", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"enhTssMeanNeighbourhood", | |
"enhancerTssCorrelationMeanNeighbourhood", |
"pchicMeanNeighbourhood", | ||
"enhTssMean", | ||
"enhTssMeanNeighbourhood", | ||
"dhsPmtrMean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dhsPmtrMean", | |
"dhsPromoterCorrelationMean", |
"enhTssMean", | ||
"enhTssMeanNeighbourhood", | ||
"dhsPmtrMean", | ||
"dhsPmtrMeanNeighbourhood", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dhsPmtrMeanNeighbourhood", | |
"dhsPromoterCorrelationMeanNeighbourhood", |
@@ -282,6 +289,11 @@ class LocusToGeneConfig(StepConfig): | |||
wandb_run_name: str | None = None | |||
hf_hub_repo_id: str | None = "opentargets/locus_to_gene" | |||
download_from_hub: bool = True | |||
# interval_sources: dict[str, str] | None = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this
lambda x, y: x.unionByName(y, allowMissingColumns=True), | ||
# create interval instances by parsing each source | ||
[ | ||
Intervals.from_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in config.py
. I wouldn't split the logic into different sources of data so you don't have to iterate and then perform the union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this more simple. We are now not so interested in adjusting which interval sources we might want. Because we only use it for L2G, I think the process is simpler if we compute all interval data, and then we pick what we want to include based on the features.
This way you only need to provide one path for the intervals (that leads to the folder that contains them all), compute everything, and then let the list of features decide what is ingested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the processed interval dataset, we still want to update it with the latest gene index every release right?
But this join is a part of the interval processing step, and it is done differently for each interval source, some source have gene names attached already, while others require an overlap of genomic regions.
So, Maybe we can bring back v2g step in a dag (but only intervals, "intervals" step)?
Or we process it in this list format, which I agree looks very ugly and messy.
src/gentropy/l2g.py
Outdated
how="inner", | ||
) | ||
.drop("start", "end", "vi_chromosome", "position"), | ||
_schema=Intervals.get_schema(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works. You're converting the interval dataset into a variant to gene format, so the schema has changed.
What I would do: this logic converts the raw intervals into variant/gene relationships. I would create a method in the Intervals dataset (with a name different to v2g
) to compute this. This could be useful later on, so having it inside L2G is not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I moved this into a function to overlap with variant index in interval datasets, and I included "variantId" into the interval schema, is this the easiest way to address this without making a new v2g style dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is ideal, as the unit of this dataset would be a variant instead of an interval. Would collect the variants into a list work?
@@ -1,6 +1,6 @@ | |||
"""Factory that computes features based on an input list.""" | |||
|
|||
from __future__ import annotations | |||
from __future__ import annotations # noqa: I001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this?
@@ -127,6 +135,12 @@ class FeatureFactory: | |||
"vepMeanNeighbourhood": VepMeanNeighbourhoodFeature, | |||
"vepMaximum": VepMaximumFeature, | |||
"vepMaximumNeighbourhood": VepMaximumNeighbourhoodFeature, | |||
"pchicMean": PchicMeanFeature, | |||
"pchicMeanNeighbourhood": PchicMeanNeighbourhoodFeature, | |||
"enhTssMean": EnhTssMeanFeature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update feature names
…1_l2g_intervals
…1_l2g_intervals
…d test for interval features
…1_l2g_intervals
…1_l2g_intervals
✨ Context
Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512
🛠 What does this PR implement
🙈 Missing
More features from anderson + thurman.
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?