[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

ccoulombe · 2025-01-14T16:10:29Z

Hello,

disclaimer: I'm no user of AF3, but support users that are.

What would be the recommendations in order to run a large number of tokens, say >6000 or > 7000 tokens on A100 40 GB ?
A current HPC job has been running for a week now, with unified memory.
What is the implications of the different values of pair_transition_shard_spec ?
After reading the section performance I am uncertain to understand the values in pair_transition_shard_spec and thus their implication in performance for an A100 40GB.
For instance, what represent the tuple (2048, None), and what does None imply ?

Also interested in seeing an answer to #236

Thanks,

The text was updated successfully, but these errors were encountered:

Augustin-Zidek · 2025-01-15T15:46:44Z

Hello,

What would be the recommendations in order to run a large number of tokens, say >6000 or > 7000 tokens on A100 40 GB?

One option is to chop the protein in overlapping chunks (e.g. 0--2500, 2000-5500, 5000--7500), run each chunk separately, then stitch back together. However, this might cause worse prediction accuracy.
Is getting an A100/H100 with 80 GB of RAM an option?

What is the implications of the different values of pair_transition_shard_spec?

The format is (num_tokens_upper_bound, shard_size). None means there is no upper bound (relevant code).

For instance:

(2048, None) - for sequences up to 2048 tokens, do not shard
(4096, 1024) - for sequences up to 4096 tokens, shard in chunks of 1024
(None, 512) - for all other sequences, shard in chunks 512

IIRC the impact on memory requirements is small (but enough to be useful). A tiny shard_size won't save you for very long sequences, as there are other memory bottlenecks.

ccoulombe · 2025-01-15T21:20:34Z

Thanks for the insights and clarifications!

Currently, only A100s 40GB are available, but new systems with H100 80GB will be available this year.

Thanks for the help!

ccoulombe · 2025-01-15T21:25:46Z

Actually, one more question, which also asked in #236

Would allocating more CPU memory and adjusting pair_transition_shard_spec with (None, 256) help predict larger inputs (e.g. >6000 tokens) ?

@Augustin-Zidek

Based on description from @Augustin-Zidek in google-deepmind#259

Augustin-Zidek · 2025-01-16T14:28:20Z

Would allocating more CPU memory and adjusting pair_transition_shard_spec with (None, 256) help predict larger inputs (e.g. >6000 tokens) ?

Sorry, I am not sure, but worth trying.

Augustin-Zidek added the question Further information is requested label Jan 15, 2025

ccoulombe closed this as completed Jan 15, 2025

ccoulombe reopened this Jan 15, 2025

ccoulombe added a commit to ccoulombe/alphafold3 that referenced this issue Jan 16, 2025

Added comments on pair_transition_shard_spec

ad2d5f7

Based on description from @Augustin-Zidek in google-deepmind#259

ccoulombe mentioned this issue Jan 16, 2025

Added comments on pair_transition_shard_spec #267

Closed

Augustin-Zidek closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

ccoulombe commented Jan 14, 2025

Augustin-Zidek commented Jan 15, 2025

ccoulombe commented Jan 15, 2025

ccoulombe commented Jan 15, 2025 •

edited

Loading

Augustin-Zidek commented Jan 16, 2025

[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

[QUESTION] Recommendations to run large number of tokens on A100 40GB #259

Comments

ccoulombe commented Jan 14, 2025

Augustin-Zidek commented Jan 15, 2025

ccoulombe commented Jan 15, 2025

ccoulombe commented Jan 15, 2025 • edited Loading

Augustin-Zidek commented Jan 16, 2025

ccoulombe commented Jan 15, 2025 •

edited

Loading