feat: add interval logic for l2g features #812

xyg123 · 2024-10-03T13:02:08Z

✨ Context

Adding interval based features to the l2g model, based on the feature list (opentargets/issues#3521).
opentargets/issues#3512

🛠 What does this PR implement

Implementation of PCHIC-based interval features for the L2G gene prediction model.
Added back interval processing steps into the L2G feature generation step.

🙈 Missing

More features from anderson + thurman.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…1_l2g_intervals

addramir · 2024-10-07T11:30:13Z

src/gentropy/dataset/l2g_features/intervals.py

+        # feature will be the same for any gene associated with a studyLocus)
+        local_max.withColumn(
+            "regional_maximum",
+            f.max(local_feature_name).over(Window.partitionBy("studyLocusId")),


Why is it maximum? According to the table and what we discussed it should be mean?
https://docs.google.com/spreadsheets/d/1wUs1AprRCCGItZmgDhc1fF5BtwCSosdzFv4NQ8V6Dtg/edit?gid=452826388#gid=452826388

…1_l2g_intervals

ireneisdoomed

Thank you for the changes Jack!!!

The logic to build the features looks good! Please see my comments, but they are more along the lines of how we process the interval data in the L2G step.
I suggested processing all interval sources to make the process simpler, but since the code is accommodated to take source names and paths individually and changing it is a mess, it's also fine to leave it like that as long as the interval_paths parameter is correctly configured.

The implemented changes wouldn't run, because of the creation of a Interval dataset with a mismatching schema. I would encourage you to:

add any features you add to the test_l2g_feature_matrix.py suite, to make sure that the code doesnt crash
In the same file, add a semantic test for the common logic
Update the documentation pages
Pull dev branch to bring the changes to the feature matrix step

ireneisdoomed · 2024-10-11T11:20:42Z

src/gentropy/config.py

+            # intervals
+            "pchicMean",
+            "pchicMeanNeighbourhood",
+            "enhTssMean",


I'd like to have more descriptive feature names

Suggested change

"enhTssMean",

"enhancerTssCorrelationMean",

ireneisdoomed · 2024-10-11T11:20:58Z

src/gentropy/config.py

+            "pchicMean",
+            "pchicMeanNeighbourhood",
+            "enhTssMean",
+            "enhTssMeanNeighbourhood",


Suggested change

"enhTssMeanNeighbourhood",

"enhancerTssCorrelationMeanNeighbourhood",

ireneisdoomed · 2024-10-11T11:24:46Z

src/gentropy/config.py

+            "pchicMeanNeighbourhood",
+            "enhTssMean",
+            "enhTssMeanNeighbourhood",
+            "dhsPmtrMean",


Suggested change

"dhsPmtrMean",

"dhsPromoterCorrelationMean",

ireneisdoomed · 2024-10-11T11:24:59Z

src/gentropy/config.py

+            "enhTssMean",
+            "enhTssMeanNeighbourhood",
+            "dhsPmtrMean",
+            "dhsPmtrMeanNeighbourhood",


Suggested change

"dhsPmtrMeanNeighbourhood",

"dhsPromoterCorrelationMeanNeighbourhood",

ireneisdoomed · 2024-10-11T11:25:25Z

src/gentropy/config.py

@@ -282,6 +289,11 @@ class LocusToGeneConfig(StepConfig):
    wandb_run_name: str | None = None
    hf_hub_repo_id: str | None = "opentargets/locus_to_gene"
    download_from_hub: bool = True
+    # interval_sources: dict[str, str] | None = {


I would remove this

ireneisdoomed · 2024-10-11T12:43:41Z

src/gentropy/l2g.py

+                    lambda x, y: x.unionByName(y, allowMissingColumns=True),
+                    # create interval instances by parsing each source
+                    [
+                        Intervals.from_source(


See my comment in config.py. I wouldn't split the logic into different sources of data so you don't have to iterate and then perform the union

ireneisdoomed · 2024-10-11T12:46:23Z

src/gentropy/config.py

I'd make this more simple. We are now not so interested in adjusting which interval sources we might want. Because we only use it for L2G, I think the process is simpler if we compute all interval data, and then we pick what we want to include based on the features.
This way you only need to provide one path for the intervals (that leads to the folder that contains them all), compute everything, and then let the list of features decide what is ingested.

With the processed interval dataset, we still want to update it with the latest gene index every release right?
But this join is a part of the interval processing step, and it is done differently for each interval source, some source have gene names attached already, while others require an overlap of genomic regions.

So, Maybe we can bring back v2g step in a dag (but only intervals, "intervals" step)?

Or we process it in this list format, which I agree looks very ugly and messy.

ireneisdoomed · 2024-10-11T13:19:57Z

src/gentropy/l2g.py

+                    how="inner",
+                )
+                .drop("start", "end", "vi_chromosome", "position"),
+                _schema=Intervals.get_schema(),


I don't think this works. You're converting the interval dataset into a variant to gene format, so the schema has changed.
What I would do: this logic converts the raw intervals into variant/gene relationships. I would create a method in the Intervals dataset (with a name different to v2g) to compute this. This could be useful later on, so having it inside L2G is not great.

Yes, so I moved this into a function to overlap with variant index in interval datasets, and I included "variantId" into the interval schema, is this the easiest way to address this without making a new v2g style dataset?

I don't think it is ideal, as the unit of this dataset would be a variant instead of an interval. Would collect the variants into a list work?

ireneisdoomed · 2024-10-11T13:20:15Z

src/gentropy/method/l2g/feature_factory.py

@@ -1,6 +1,6 @@
 """Factory that computes features based on an input list."""

-from __future__ import annotations
+from __future__ import annotations  # noqa: I001


Why is this?

ireneisdoomed · 2024-10-11T13:39:29Z

src/gentropy/method/l2g/feature_factory.py

@@ -127,6 +135,12 @@ class FeatureFactory:
        "vepMeanNeighbourhood": VepMeanNeighbourhoodFeature,
        "vepMaximum": VepMaximumFeature,
        "vepMaximumNeighbourhood": VepMaximumNeighbourhoodFeature,
+        "pchicMean": PchicMeanFeature,
+        "pchicMeanNeighbourhood": PchicMeanNeighbourhoodFeature,
+        "enhTssMean": EnhTssMeanFeature,


update feature names

…1_l2g_intervals

…d test for interval features

…1_l2g_intervals

feat: add interval logic for l2g features

9c31f43

github-actions bot added size-M Method Dataset Step Feature labels Oct 3, 2024

xyg123 added 3 commits October 3, 2024 14:29

chore: fix docstrings

330b79e

chore: fix attribute errors

183c827

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

500bae8

…1_l2g_intervals

addramir requested a review from ireneisdoomed October 5, 2024 07:21

xyg123 added 2 commits October 7, 2024 11:14

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

7cb4b5f

…1_l2g_intervals

fix: multiple input lines from merge

2035a52

addramir reviewed Oct 7, 2024

View reviewed changes

xyg123 added 3 commits October 7, 2024 16:02

fix: change to mean comparison, add additional interval features

985a901

fix: change to mean comparison, add additional interval features

b01b4e8

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

688c73a

…1_l2g_intervals

ireneisdoomed requested changes Oct 11, 2024

View reviewed changes

xyg123 added 4 commits October 15, 2024 10:21

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

6837df3

…1_l2g_intervals

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

f194098

…1_l2g_intervals

fix: change interval schema, reorganise interval processing, begin ad…

a9c0f6b

…d test for interval features

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

63d6db6

…1_l2g_intervals

github-actions bot added size-L and removed size-M labels Oct 17, 2024

Merge branch 'dev' of https://github.com/opentargets/gentropy into xg…

374a7c3

…1_l2g_intervals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add interval logic for l2g features #812

feat: add interval logic for l2g features #812

xyg123 commented Oct 3, 2024 •

edited

Loading

addramir Oct 7, 2024

ireneisdoomed left a comment •

edited

Loading

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

xyg123 Oct 16, 2024

ireneisdoomed Oct 11, 2024

xyg123 Oct 16, 2024

ireneisdoomed Oct 16, 2024

ireneisdoomed Oct 11, 2024

ireneisdoomed Oct 11, 2024

	"enhTssMeanNeighbourhood",
	"enhancerTssCorrelationMeanNeighbourhood",

	"dhsPmtrMeanNeighbourhood",
	"dhsPromoterCorrelationMeanNeighbourhood",

feat: add interval logic for l2g features #812

Are you sure you want to change the base?

feat: add interval logic for l2g features #812

Conversation

xyg123 commented Oct 3, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

Choose a reason for hiding this comment

ireneisdoomed left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xyg123 commented Oct 3, 2024 •

edited

Loading

ireneisdoomed left a comment •

edited

Loading