Speed up anndata writing speed after define_clonotype_clusters #556

felixpetschko · 2024-09-18T17:27:51Z

The writing speed of the anndata object after the define_clonotype_clusters function is too slow which is currently one of the biggest bottlenecks in scirpy's analysis pipeline. The reason for that is the clonotype_id -> cell_ids mapping (dict[str, np.ndarray[str]) that gets stored to the anndata.uns attribute. Now I implemented a version where this mapping has the datatype dict[str, list[str]) and gets converted to a json object before storing it. This is quite fast and is very similar to the previous implementation such that only minor changes in the rest of the code were necessary.

…he anndata result

for more information, see https://pre-commit.ci

felixpetschko · 2024-09-18T17:57:49Z

I made some measurements on my laptop:

In the first image we can see why we have a problem. Storing the anndata object takes ~28 times longer than the define_clonotype_clusters function itself.

With the json approach the storage time can be reduced drastically and for around 100k cells it takes around the same execution time as the function itself.

Here we see the speedup of the new approach.

into result_storage

grst · 2024-09-20T08:42:22Z

Maybe this can be fixed in anndata directly. There's no good reason for this to be slow. I'll open a ticket.

If they cant or dont want to fix it, then we can go with this workaround.

grst · 2024-09-21T12:55:16Z

scverse/anndata#1684

felixpetschko and others added 3 commits September 17, 2024 18:38

convert cell_indices str->array dict to a csr matrix before storing t…

f1c40fa

…he anndata result

save cell_indices as json format

97e13b7

[pre-commit.ci] auto fixes from pre-commit.com hooks

78124f1

for more information, see https://pre-commit.ci

felixpetschko added 2 commits September 18, 2024 20:15

removed unused conversion function

bca630e

Merge branch 'result_storage' of https://github.com/felixpetschko/scirpy

6166305

into result_storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up anndata writing speed after define_clonotype_clusters #556

Speed up anndata writing speed after define_clonotype_clusters #556

felixpetschko commented Sep 18, 2024

felixpetschko commented Sep 18, 2024

grst commented Sep 20, 2024

grst commented Sep 21, 2024

Speed up anndata writing speed after define_clonotype_clusters #556

Are you sure you want to change the base?

Speed up anndata writing speed after define_clonotype_clusters #556

Conversation

felixpetschko commented Sep 18, 2024

felixpetschko commented Sep 18, 2024

grst commented Sep 20, 2024

grst commented Sep 21, 2024