Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support in-place Concat when input is a captured value #399

Closed
robertknight opened this issue Nov 2, 2024 · 3 comments
Closed

Support in-place Concat when input is a captured value #399

robertknight opened this issue Nov 2, 2024 · 3 comments
Labels
performance Issues that affect model inference or loading performance

Comments

@robertknight
Copy link
Owner

robertknight commented Nov 2, 2024

#239 added an optimization that reduces the cost of KV-cache extension by a Concat op from O(sequence_length) to O(1). This optimization currently does not work if the input KV-cache buffer is a captured value in a subgraph. This is because captured values are exposed to subgraphs as immutable views, even if the value was passed to Model::run as an owned value. For the optimization to work, Concat needs to be run as an in-place operation which requires the first input to be an owned value.

Solving the general problem that captured values cannot be used in in-place operations would help in other ways too. For example there are models that try to reshape captured values and currently this reshape copies the data unnecessarily.

Conceptually what I think is needed is the ability for subgraph capture environments (CaptureEnv) to capture values by value rather than by reference, if those values are not going to be used outside of the subgraph.

@robertknight robertknight added the performance Issues that affect model inference or loading performance label Nov 2, 2024
@robertknight
Copy link
Owner Author

While working on this I ran into an issue where in-place extension of the KV-cache fails for some caches in the Whisper example because a KV-cache tensor (eg. past_key_values.0.decoder.key) is an input to both a Concat op and a Shape op, and the Concat op gets scheduled first. The buffer can't be used in-place because the Shape op needs it later. See also #98.

robertknight added a commit that referenced this issue Nov 12, 2024
After creating the initial execution plan by traversing the graph backwards from
the requested outputs, re-order it by traversing the plan forwards and choosing
a preferred operator to run from the frontier of runnable operators at each
step. The initial logic for picking the next operator is to choose the first
non-in-place operator if there is one (eg. Shape) or the first operator
otherwise.

This re-ordering increases the likelihood that an operator which can run
in-place is able to do so, because its in-place input is less likely to be
required by a non in-place op that runs later.

Fixes #98
See also #399 (comment).
robertknight added a commit that referenced this issue Nov 13, 2024
After creating the initial execution plan by traversing the graph backwards from
the requested outputs, re-order it by traversing the plan forwards and choosing
a preferred operator to run from the frontier of runnable operators at each
step. The initial logic for picking the next operator is to choose the first
non-in-place operator if there is one (eg. Shape) or the first operator
otherwise.

This re-ordering increases the likelihood that an operator which can run
in-place is able to do so, because its in-place input is less likely to be
required by a non in-place op that runs later.

Fixes #98
See also #399 (comment).
@robertknight
Copy link
Owner Author

Another complication with Hugging Face models exported by Optimum: The KV cache buffers initially reserved by rten-generate don't get used if using the decoder_model_merged model. The first run of the models uses an alternate branch which doesn't use the past_key_values.{layer}.decoder.{key, value} inputs. Instead new tensors are allocated which don't have spare capacity reserved for in-place growth.

robertknight added a commit that referenced this issue Nov 14, 2024
When preparing the capture environment to pass to an operator with subgraphs,
if a captured value is not going to be needed by subsequent operators in the
current graph then it can be passed by value rather than by reference. In
the subgraph, this allows the value to be (potentially) used as an in-place
input.

 - Pass `CaptureEnv` to `Graph::run_subgraph` by value rather than by reference

 - Add map of by-value captures to `CaptureEnv`

 - Add `CaptureEnv::child` method which creates a child environment with no
   captures of its own. This is not currently used but will be needed by loop
   operators which run a subgraph more than once and need to obtain a fresh
   capture environment for each iteration.

Part of #399.
robertknight added a commit that referenced this issue Nov 14, 2024
When preparing the capture environment to pass to an operator with subgraphs,
if a captured value is not going to be needed by subsequent operators in the
current graph then it can be passed by value rather than by reference. In
the subgraph, this allows the value to be (potentially) used as an in-place
input.

 - Pass `CaptureEnv` to `Graph::run_subgraph` by value rather than by reference

 - Add map of by-value captures to `CaptureEnv`

 - Add `CaptureEnv::child` method which creates a child environment with no
   captures of its own. This is not currently used but will be needed by loop
   operators which run a subgraph more than once and need to obtain a fresh
   capture environment for each iteration.

Part of #399.
@robertknight
Copy link
Owner Author

Resolved in #407. Some additional changes were required in rten-generate (#408) to take advantage of this in the Hugging Face models (see Whisper and TrOCR examples) where the issue originally came up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that affect model inference or loading performance
Projects
None yet
Development

No branches or pull requests

1 participant