Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Automatically Free pipeline of Prompts #3095

Open
richardjonker2000 opened this issue Jan 27, 2025 · 1 comment
Open

[Feature] Automatically Free pipeline of Prompts #3095

richardjonker2000 opened this issue Jan 27, 2025 · 1 comment

Comments

@richardjonker2000
Copy link

richardjonker2000 commented Jan 27, 2025

Motivation

Memory Issue When Running Multiple Batches of Prompts

Description

Similar to the known issue regarding memory freeing (PR #3069), I am encountering out-of-memory (OOM) problems when processing multiple batches of prompts.

My goal is to process a large number of prompts (on the order of thousands) while saving outputs every 500 prompts. To achieve this, I split the prompts into batches. However, due to memory not being freed properly, running subsequent batches leads to OOM errors.

Additionally, while the upcoming pipe.close() feature appears to free the model from VRAM, it would require reloading the model for each batch, which may not be optimal in this scenario.

Reproduction

models = 'the/path/of/internlm2/model'

batch_size = 500
batches = [prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]  # Split into batches of 500
pipe = pipeline(model)

for batch in batches:        
    response = pipe(batch)
    # Output the result of the batch    

Expected Behavior

  • Memory should be properly freed after processing each batch to allow continuous execution.
  • Ideally, there should be a way to clear intermediate memory usage without fully unloading the model (as pipe.close() might do).

Related resources

No response

Additional context

No response

@lzhangzz
Copy link
Collaborator

lzhangzz commented Feb 1, 2025

Memory allocated during inference will be reused by the engine, no additional deallocation is needed. However if a later batch requires more memory the engine will try to allocate more memory, which may trigger the OOM exception.

There are 2 cases

  1. Some later batch make the engine try to allcoate more memory than the system currently available. In this case try to decrease memory related parameters such as cache_max_entry_count and max_prefill_token_num
  2. After generation of the batch is complete, you called other pytorch functions that allocates gpu memory, the allocated memory will be cached by pytorch and is not reusable by the engine. In this case a torch.cuda.empty_cache() is needed to empty the cache before the next batch starts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants