Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed functions iterating over tensors from torch compilation process #224

Open
wants to merge 1 commit into
base: habana-main
Choose a base branch
from

Conversation

jczaja
Copy link

@jczaja jczaja commented Sep 25, 2024

Problem:
Recently from dependencies of tgi-gaudi project some torch compile graph breaks were event out and it made some torch compiled graphs much bigger and more memory consuming which in some models could led to Device out-of-memory.

Solution:
Torch compiled graphs that wer causing Device OOM behaviour where related to having loops inside of them that where processing lots of tensors. Those functions with loops were excluded from torch compilation process.

@@ -92,7 +92,6 @@ def biggest_single_chunk(offset):
return 0


@torch_compile_for_eager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't use torch._dynamo.graph_break() in place of mark_step instead of removing compilation of graph?
same question applies to all below cases

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the change cause the ops within the function to be executed eagerly?

Copy link
Author

@jczaja jczaja Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the change cause the ops within the function to be executed eagerly?

I assume(I have not run logging of fallback to eager events) that functions excluded from torch compile regions (as done in this PR) are now running eager e.g. pytorch ops from code that got torch compile decorator discarded are running eager .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some of internal testing it was revealed that excluding those functions(as in this PR) from torch compile region did not have an impact on performance or accuracy.

@mandy-li mandy-li requested review from schoi-habana and removed request for bsochack October 1, 2024 17:59
Copy link
Collaborator

@schoi-habana schoi-habana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants