Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

covsonar 2 runtime (and memory usage?) #98

Closed
6 tasks
matthuska opened this issue Aug 31, 2023 · 2 comments
Closed
6 tasks

covsonar 2 runtime (and memory usage?) #98

matthuska opened this issue Aug 31, 2023 · 2 comments

Comments

@matthuska
Copy link
Contributor

It would be great if covsonar 2 was faster than covsonar 1, but we don't expect that to be the case because covsonar 2 is much more flexible than 1. Nevertheless, covsonar 2 has to be fast enough to be useful for us.

The following commands have to run in a reasonable amount of time* (and with a reasonable amount of memory?):

  • extract all metadata and mutation profiles from a large database (~16M sequences in GISAID global 2023-08-31)
  • add a large number of new sequences and metadata to a new database
  • add a large number of sequences and metadata to a database, most of which are already present in the database
  • extract (or count) sequences that match a given genomic profile with a set of mutations
  • extract (or count) sequences that match a given lineage and all sublineages
  • delete a small number of sequences from a large database
  • where reasonable is defined as < 1.5x the runtime of covsonar 1, or in a fixed amount of time that is deemed reasonable
@matthuska matthuska added this to the covsonar 2.0.0 milestone Aug 31, 2023
@matthuska
Copy link
Contributor Author

In case it's useful in the future, I profiled the addition of 10 sequences to the current covsonar2 version using pyinstrument. Nothing to do here, just wanted to keep it somewhere in case we need to optimize this process at some point. It looks like alignment takes ~25 seconds out of 42 seconds total, with the remaining time split equally between cigar_parse and lift_vars:

Program: sonar import --threads 1 --db output/covsonar2.db --fasta seqs-10.fasta --no-progress

41.884 <module>  sonar:2
├─ 40.941 main  covsonar/sonar.py:1100
│  └─ 40.934 execute_commands  covsonar/sonar.py:1058
│     └─ 40.929 handle_import  covsonar/sonar.py:718
│        └─ 40.929 import_data  covsonar/utils.py:549
│           └─ 40.914 _import_fasta  covsonar/utils.py:748
│              └─ 40.693 sonarAligner.process_cached_sample  covsonar/align.py:260
│                 ├─ 25.939 sonarAligner.align  covsonar/align.py:56
│                 │  └─ 25.872 sg_trace_striped_32  parasail/bindings_v2.py:3429
│                 ├─ 7.267 <listcomp>  covsonar/align.py:303
│                 │  └─ 7.265 sonarAligner.lift_vars  covsonar/align.py:403
│                 │     └─ 7.119 sonarAligner.update_nuc_positions  covsonar/align.py:343
│                 │        ├─ 4.205 Series.between  pandas/core/series.py:5411
│                 │        │     [14 frames hidden]  pandas
│                 │        ├─ 2.400 _LocIndexer.__setitem__  pandas/core/indexing.py:831
│                 │        │     [10 frames hidden]  pandas
│                 │        └─ 0.438 DataFrame.__getitem__  pandas/core/frame.py:3713
│                 ├─ 6.027 sonarAligner.parse_cigar  covsonar/align.py:83
│                 │  └─ 6.013 handle_deletion  covsonar/align.py:176
│                 │     └─ 6.013 is_frameshift_del  covsonar/align.py:119
│                 │        └─ 5.835 DataFrame.groupby  pandas/core/frame.py:8130
│                 │              [29 frames hidden]  pandas
│                 └─ 1.442 Result.__del__  parasail/bindings_v2.py:273
└─ 0.911 <module>  covsonar/sonar.py:5
   └─ 0.497 <module>  covsonar/cache.py:5
      └─ 0.441 <module>  pandas/__init__.py:1

@matthuska
Copy link
Contributor Author

Closed because we do not plan to continue covsonar 2 development.

In summary the performance was much worse than covsonar 1, and some work was put into improving that situation (see #110) but was abandoned to switch to a different solution using PostgreSQL.

@matthuska matthuska closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant