Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up clonotype distance calculation #470

Merged
merged 63 commits into from
Sep 8, 2024

Conversation

felixpetschko
Copy link
Collaborator

@felixpetschko felixpetschko commented Jan 7, 2024

Close #368
Close #507

compute_distances new version added

@grst grst changed the title compute_distances new version added Speed up clonotype distance calculation Jan 9, 2024
@grst
Copy link
Collaborator

grst commented Jan 9, 2024

Hi @felixpetschko,

I just had a look at the failing tests. Most of them are indeed direct or indirect failures due to the new distance calculator. Some of them are testing the API of the DoubleLookupNeighborFinder and are expected to fail because you changed the API (for good reasons). Those we'll have to adapt or remove eventually.

At first, I suggest you take a look at the tests in test_clonotypes -- it seems they cover some edge cases where your implementation throws an error.

image

LMK if anything is unclear!

Gregor

@felixpetschko
Copy link
Collaborator Author

@grst Now I could make all tests pass without changing them - I didn't break the API. There are still some documentation and other refinements needed.
The next step would be to implement j_gene matching which I already did for a version that @MKanetscheider uses for his analysis by just copying the v_gene related parts of the code. Should I also do it like that here or should we introduce a more general concept, eg by just passing the column suffixes (eg ["v_call", "j_call",...]) instead of the booleans (same_v_gene, same_j_gene). I don't know if it's biologically possible to have more than those 2.

@grst
Copy link
Collaborator

grst commented Aug 30, 2024

I dropped support for python 3.9 in the main branch which changed some of the formatting rules (most notably Union[A, B] is now simply A | B.

I resolved the (few) merge conflicts and hopefully didn't break anything.

@felixpetschko
Copy link
Collaborator Author

Hi @grst , now I added some documentation and did some variable renaming for clarification. I also added testing for the j_gene matching.

    if partitions == "leiden":
        part = g.community_leiden(
            objective_function="modularity",
            resolution_parameter=resolution,
            n_iterations=n_iterations,
        )
    elif partitions == "fastgreedy": # <--
        part = g.community_fastgreedy().as_clustering()
    else:
        part = g.clusters(mode="weak")

I also allowed to use the "fastgreedy" algorithm for the graph partitioning since it's used for the BCR analysis. However I have been told that choosing a different partitioning method doesn't make a big difference in most cases.

There still seems to be a test case that has some problems with the python versions (conda / test (ubuntu-latest, 3.12) (pull_request)) Could you maybe take a look at it?

Besides from that I am done with my implementations so far. If everything is fine for you I would make some final test runs on the cluster tomorrow. Then we can finally finish this PR :)

Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now!

I'll merge it and make a release on Monday just in time for the conference.

The conda test failure is not your fault. I just retriggered it, if it doesn't fix itself I'll look into it or we can ignore it for now.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grst grst merged commit cbc01d1 into scverse:main Sep 8, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clone definition purely using CDR3 sequence speed up define_clonotypes
2 participants