Scalability to >1M cells #370

grst · 2022-10-09T14:47:04Z

Description of feature

I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.

This meta issue is to give an overview of the progress improving scirpy's scalability.

graph TB
    subgraph legend
         legend1(could be faster -- minutes)
         OK(OK -- seconds)
         legend2(prohibitively slow -- hours)
         legend3(not profiled yet)
         style legend1 stroke:#ff7f00
         style OK stroke:#4daf4a
         style legend2 stroke:#e41a1c
    end

graph TB
    subgraph preprocessing
      IO --> index_chains
      index_chains --> QC
      QC --> dist_id[ir_dist identity]
      QC --> dist_levenshtein[ir_dist levenshtein]
      QC --> dist_alignment[ir_dist alignment]
      dist_id --> define_clonotypes
      dist_levenshtein --> define_clonotypes
      dist_alignment --> define_clonotypes
      define_clonotypes --> clonotypes
      QC -.-> autoencoder
      autoencoder -.-> clonotypes
      autoencoder -.-> define_clonotypes

      clonotypes[(CLONOTYPES)]
      
      style IO stroke:#ff7f00
      style index_chains stroke:#ff7f00
      style QC stroke:#4daf4a
      style dist_id stroke:#4daf4a
      style define_clonotypes stroke:#e41a1c
      style dist_levenshtein stroke:#e41a1c
      style dist_alignment stroke:#e41a1c
      style clonotypes stroke:white
   end
   
   subgraph downstream
      clonotypes --> clonotype_network
      clonotypes --> other[other tools]
   end

Action items

data structure (Implement scverse datastucture #356). The foundation for other changes. Might also speed up saving the anndata object.
reading data (Speed up read_airr #367). User experience can be improved, but not a top priority atm.
index_chains (Speed up index_chains #386). Could be faster
ir_dist (Speed up ir_dist #304). Needs more scalable methods for computing sequence distances.
define_clonotypes (speed up define_clonotypes #368). At the very least needs a better parallelization. Maybe there's room for some jax/numba.
autoencoder-based embedding (Autoencoder-based sequence embedding #369). Possible alternative to ir_dist. Maybe it even makes sense to combine ir_dist and define_clonotypes into a single step.

The text was updated successfully, but these errors were encountered:

grst mentioned this issue Apr 18, 2023

scverse datastructure for AIRR data #327

Closed

1 task

grst mentioned this issue Jan 11, 2024

Large dataset tutorial #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability to >1M cells #370

Scalability to >1M cells #370

grst commented Oct 9, 2022 •

edited

Loading

Scalability to >1M cells #370

Scalability to >1M cells #370

Comments

grst commented Oct 9, 2022 • edited Loading

Description of feature

Action items

grst commented Oct 9, 2022 •

edited

Loading