[Feature] Automatically Free pipeline of Prompts #3095

richardjonker2000 · 2025-01-27T16:54:10Z

Motivation

Memory Issue When Running Multiple Batches of Prompts

Description

Similar to the known issue regarding memory freeing (PR #3069), I am encountering out-of-memory (OOM) problems when processing multiple batches of prompts.

My goal is to process a large number of prompts (on the order of thousands) while saving outputs every 500 prompts. To achieve this, I split the prompts into batches. However, due to memory not being freed properly, running subsequent batches leads to OOM errors.

Additionally, while the upcoming pipe.close() feature appears to free the model from VRAM, it would require reloading the model for each batch, which may not be optimal in this scenario.

Reproduction

models = 'the/path/of/internlm2/model'

batch_size = 500
batches = [prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]  # Split into batches of 500
pipe = pipeline(model)

for batch in batches:        
    response = pipe(batch)
    # Output the result of the batch

Expected Behavior

Memory should be properly freed after processing each batch to allow continuous execution.
Ideally, there should be a way to clear intermediate memory usage without fully unloading the model (as pipe.close() might do).

Related resources

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

lzhangzz · 2025-02-01T04:53:40Z

Memory allocated during inference will be reused by the engine, no additional deallocation is needed. However if a later batch requires more memory the engine will try to allocate more memory, which may trigger the OOM exception.

There are 2 cases

Some later batch make the engine try to allcoate more memory than the system currently available. In this case try to decrease memory related parameters such as cache_max_entry_count and max_prefill_token_num
After generation of the batch is complete, you called other pytorch functions that allocates gpu memory, the allocated memory will be cached by pytorch and is not reusable by the engine. In this case a torch.cuda.empty_cache() is needed to empty the cache before the next batch starts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Automatically Free pipeline of Prompts #3095

[Feature] Automatically Free pipeline of Prompts #3095

richardjonker2000 commented Jan 27, 2025 •

edited

Loading

lzhangzz commented Feb 1, 2025 •

edited

Loading

[Feature] Automatically Free pipeline of Prompts #3095

[Feature] Automatically Free pipeline of Prompts #3095

Comments

richardjonker2000 commented Jan 27, 2025 • edited Loading

Motivation

Memory Issue When Running Multiple Batches of Prompts

Description

Reproduction

Expected Behavior

Related resources

Additional context

lzhangzz commented Feb 1, 2025 • edited Loading

richardjonker2000 commented Jan 27, 2025 •

edited

Loading

lzhangzz commented Feb 1, 2025 •

edited

Loading