Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

Closed
ccoulombe opened this issue Jan 14, 2025 · 4 comments
Closed
Labels
question Further information is requested

Comments

@ccoulombe
Copy link

Hello,

disclaimer: I'm no user of AF3, but support users that are.

  1. What would be the recommendations in order to run a large number of tokens, say >6000 or > 7000 tokens on A100 40 GB ?
    A current HPC job has been running for a week now, with unified memory.

  2. What is the implications of the different values of pair_transition_shard_spec ?
    After reading the section performance I am uncertain to understand the values in pair_transition_shard_spec and thus their implication in performance for an A100 40GB.
    For instance, what represent the tuple (2048, None), and what does None imply ?

Also interested in seeing an answer to #236

Thanks,

@Augustin-Zidek Augustin-Zidek added the question Further information is requested label Jan 15, 2025
@Augustin-Zidek
Copy link
Collaborator

Hello,

  1. What would be the recommendations in order to run a large number of tokens, say >6000 or > 7000 tokens on A100 40 GB?
  • One option is to chop the protein in overlapping chunks (e.g. 0--2500, 2000-5500, 5000--7500), run each chunk separately, then stitch back together. However, this might cause worse prediction accuracy.
  • Is getting an A100/H100 with 80 GB of RAM an option?

What is the implications of the different values of pair_transition_shard_spec?

The format is (num_tokens_upper_bound, shard_size). None means there is no upper bound (relevant code).

For instance:

  • (2048, None) - for sequences up to 2048 tokens, do not shard
  • (4096, 1024) - for sequences up to 4096 tokens, shard in chunks of 1024
  • (None, 512) - for all other sequences, shard in chunks 512

IIRC the impact on memory requirements is small (but enough to be useful). A tiny shard_size won't save you for very long sequences, as there are other memory bottlenecks.

@ccoulombe
Copy link
Author

Thanks for the insights and clarifications!

Currently, only A100s 40GB are available, but new systems with H100 80GB will be available this year.

Thanks for the help!

@ccoulombe
Copy link
Author

ccoulombe commented Jan 15, 2025

Actually, one more question, which also asked in #236

Would allocating more CPU memory and adjusting pair_transition_shard_spec with (None, 256) help predict larger inputs (e.g. >6000 tokens) ?

@ccoulombe ccoulombe reopened this Jan 15, 2025
ccoulombe added a commit to ccoulombe/alphafold3 that referenced this issue Jan 16, 2025
@Augustin-Zidek
Copy link
Collaborator

Would allocating more CPU memory and adjusting pair_transition_shard_spec with (None, 256) help predict larger inputs (e.g. >6000 tokens) ?

Sorry, I am not sure, but worth trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants