-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tensor lazy allocate and recyle strategy to comp replay #176
base: main
Are you sure you want to change the base?
Conversation
…pComms to function generate_io_tensors
… and invoke replaySingle to replay every node in function replayTrace
@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding support for "lazy allocate tensor" and recycle storage tensor. Please check my inlined comments.
I suggest to skip this feature for now since it may need more work to make it robust. I did not run into out of memory issue with the current version.
@@ -368,7 +368,7 @@ def build_torchscript_func(n): | |||
if ( | |||
n.op_schema == "" | |||
or n.name == "aten::record_stream" | |||
or n.name.startswith("aten::_foreach") | |||
#or n.name.startswith("aten::_foreach") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaekyungHeo , why this line is commented out?
# The tensor with the same storage id may located on different devices. | ||
self.tensor_storage_map: Dict[int, []] = defaultdict(set) | ||
self.tensor_storage_map: Dict[int, Dict[torch.device, torch.Tensor]] = {} | ||
self.tensor_alloc_set = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tensor_alloc_set is populated, but never be referenced. Shall we clean it up?
@@ -492,9 +487,64 @@ def add_unique_tensor(node_name, node_id, t_id, shape, input, device=-1): | |||
if t_id in self.input_tensor_ids: | |||
output_set.add(self.tensors_mapping[(node.id, t_id, False)]) | |||
|
|||
def allocate_tensors(self): | |||
start_ns = time.time_ns() | |||
def get_tensor_from_storage(self, node, storage_id, data_offset, elem_bytes, device, shape, data_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor stride is missing. Please refer to the code in main branch to fix it.
@@ -519,11 +569,16 @@ def allocate_tensors(self): | |||
) | |||
tensor_strides = node.get_input_tensor_strides() | |||
for idx, (data_type, t_id, shape) in enumerate(get_input_tensors(node)): | |||
device = self.device | |||
tensor_id, storage_id, storage_offset, element_num, item_size, device_str = t_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may fail, t_id may not have device_str
|
||
self.tensor_storage_sizes[storage_id] = max(storage_offset + element_num * item_size, self.tensor_storage_sizes[storage_id]) | ||
|
||
self.tensor_registry_permanent[replay_t_id] = ("lazy_alloc", (storage_id, device), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code hard coded the tensor allocation to "lazy_alloc", I suggest to make it optional since lazy_alloc may affect the performance. If the tensors are allocated ahead of time, we usually does not include allocation time during benchmark. However, it is lazy_alloc, the tensor allocation time is always included during benchmark time.
def lazy_alloc_tensors(self, inputs, node): | ||
for i in range(len(inputs)): | ||
if isinstance(inputs[i], tuple) and inputs[i][0] == "lazy_alloc": | ||
inputs[i] = inputs[i][2](node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy_alloc should only run once, and self.tensor_registry should be updated to use the allocated tensor next time. This code seems to allocate the tensor during each access.
del self.tensor_registry[replay_t_id] | ||
need_del_replay_t_ids_in_input.add(replay_t_id) | ||
elif replay_t_id in self.instantiate and device is not None: | ||
self.recycle_instantiate_tensors(node.id, storage_id, device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reference count is an overkill for recycling the tensor in self.instantiate. Why not just use the same idea for self.replay_tensor_id_to_last_node_id_map? We can add storage_tensor_id_to_last_node_id_map, and populate it in add_unique_tensor.
@@ -1579,6 +1618,12 @@ def readComputeArgs(self, check_args: bool = True): | |||
default=True, | |||
help="when a et_id is being replayed multiple times, setting this to false will use temsors from previous runs.", | |||
) | |||
parser.add_argument( | |||
"--recycle_storages", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to "recycle_tensor_storage"? Also if this is true, we need to allocate the storage tensor again when next iteration starts.
We need another flag "--lazy-alloc-tensor"
Summary
add tensor lazy allocate and recyle strategy to comp replay
Test Plan
mpirun -np 2 et_replay.par --trace-path param_bench/fb/integration_tests/resnet-2gpu