Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Add initial draft of malloc/free nodes ... #352

Draft
wants to merge 4 commits into
base: sycl
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 71 additions & 37 deletions sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -1691,45 +1691,79 @@ Exceptions:

|===

=== Features Still in Development

==== Memory Allocation Nodes

There is no provided interface for users to define a USM allocation/free
operation belonging to the scope of the graph. It would be error prone and
non-performant to allocate or free memory as a node executed during graph
submission. Instead, such a memory allocation API needs to provide a way to
return a pointer which won't be valid until the allocation is made on graph
finalization, as allocating at finalization is the only way to benefit from
the known graph scope for optimal memory allocation, and even optimize to
eliminate some allocations entirely.

Such a deferred allocation strategy presents challenges however, and as a result
we recommend instead that prior to graph construction users perform core SYCL
USM allocations to be used in the graph submission. Before to coming to this
recommendation we considered the following explicit graph building interfaces
for adding a memory allocation owned by the graph:

1. Allocation function returning a reference to the raw pointer, i.e. `void*&`,
which will be instantiated on graph finalization with the location of the
allocated USM memory.

2. Allocation function returning a handle to the allocation. Applications use
the handle in node command-group functions to access memory when allocated.

3. Allocation function returning a pointer to a virtual allocation, only backed
with an actual allocation when graph is finalized or submitted.

Design 1) has the drawback of forcing users to keep the user pointer variable
alive so that the reference is valid, which is unintuitive and is likely to
result in bugs.

Design 2) introduces a handle object which has the advantages of being a less
error prone way to provide the pointer to the deferred allocation. However, it
requires kernel changes and introduces an overhead above the raw pointers that
are the advantage of USM.

Design 3) needs specific backend support for deferred allocation.
Support depends on the availablity of backend support for deferred allocation:
link:../experimental/sycl_ext_oneapi_virtual_mem.asciidoc[sycl_ext_oneapi_virtual_mem]

The following interfaces enables users to define a memory allocation/free operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there another plan/draft to record the memory allocation/free with record&replay APIs?

The current functions such as sycl::malloc_device are not recorded into a graph with record&replay APIs, we may need new async functions such as sycl::malloc_device_async

belonging to the scope of the graph. It would be error prone and non-performant
to allocate or free memory as a node executed during graph submission. Instead,
such a memory allocation API needs to provide a way to return a pointer which
won't be valid until the allocation is made on graph finalization, as allocating
at finalization is the only way to benefit from the known graph scope for optimal
memory allocation, and even optimize to eliminate some allocations entirely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people will have the question "if I have two executable graphs created from the same modifiable graph, will that node be different memory allocations?", so we should state something about that.

Table {counter: tableNumber}. Member functions of the `command_graph` class (memory allocation).
[cols="2a,a"]
|===
|Member function|Description

|
[source, c++]
----
std::pair<void*,node>
EwanC marked this conversation as resolved.
Show resolved Hide resolved
add_malloc_device(size_t num_bytes, const property_list& propList = {});
Comment on lines +1734 to +1735
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have this templated so that users can call this similarly to sycl::malloc_device<T>? Just saves the users having to do a cast after the call.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add. My only reason to start with a single version of this API was not having to update too many functions, in case we change the basic design of this interface.

----


|
Returns a pair of a pointer to memory and a node. The pointer is allocated on the `device`
that is associated with current graph by first execution of the `command_graph`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first execution of the command_graph

Question about whether we want to tighten this more to say something like "allocated during the finalization of command-graph to the returned executable graph".

The current wording leaves open the suggestion that if you never execute the graph nothing will be allocated, but we probably will do the allocation on finalization - which means less latency when submitting the graph.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "may be allocated during the finalization ... "? That way we can keep our options, depending on the backend.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we align with line 1723 here?

in line 1723, it says allocating at finalization is the only way

All nodes that depend on this node and are thereby executed after have access to the allocated memory.
The allocation size is specified in bytes.

Preconditions:

* This member function is only available when the `command_graph` state is
`graph_state::modifiable`.

Parameters:

* `num_bytes` - allocation size in bytes.

* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.

Exceptions:

* Throws synchronously with error code `feature_not_supported` if any device associated
with the command graph does not have `aspect::usm_device_allocations`.
|

[source, c++]
----
node
add_free(void* ptr, const property_list& propList = {});
----


|
Returns a free node that has been added to the graph. Accesses of nodes that depend of this node
(predecessors) to the allocated memory are undefined behavior.

Parameters:

* `ptr` - memory pointed to by. Must be allocated by `add_malloc_device`.

* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.

|===

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that once intel#12366 merges i think we'll need to extend the node_type enum to include malloc_device and free

=== Features Still in Development

reble marked this conversation as resolved.
Show resolved Hide resolved
==== Device Specific Graph

Expand Down
Loading