add tensor lazy allocate and recyle strategy to comp replay #176

shengfukevin · 2024-08-28T22:20:59Z

Summary

add tensor lazy allocate and recyle strategy to comp replay

Test Plan

mpirun -np 2 et_replay.par --trace-path param_bench/fb/integration_tests/resnet-2gpu

…pComms to function generate_io_tensors

… and invoke replaySingle to replay every node in function replayTrace

facebook-github-bot · 2024-08-28T22:24:10Z

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

shengfukevin

Thanks for adding support for "lazy allocate tensor" and recycle storage tensor. Please check my inlined comments.

I suggest to skip this feature for now since it may need more work to make it robust. I did not run into out of memory issue with the current version.

shengfukevin · 2024-08-28T22:28:15Z

et_replay/et_replay_utils.py

@@ -368,7 +368,7 @@ def build_torchscript_func(n):
    if (
        n.op_schema == ""
        or n.name == "aten::record_stream"
-        or n.name.startswith("aten::_foreach")
+        #or n.name.startswith("aten::_foreach")


@TaekyungHeo , why this line is commented out?

shengfukevin · 2024-08-29T18:42:38Z

et_replay/tools/et_replay.py

        # The tensor with the same storage id may located on different devices.
-        self.tensor_storage_map: Dict[int, []] = defaultdict(set)
+        self.tensor_storage_map: Dict[int, Dict[torch.device, torch.Tensor]] = {}
+        self.tensor_alloc_set = set()


Looks like tensor_alloc_set is populated, but never be referenced. Shall we clean it up?

shengfukevin · 2024-08-29T18:44:39Z

et_replay/tools/et_replay.py

@@ -492,9 +487,64 @@ def add_unique_tensor(node_name, node_id, t_id, shape, input, device=-1):
                if t_id in self.input_tensor_ids:
                    output_set.add(self.tensors_mapping[(node.id, t_id, False)])

-    def allocate_tensors(self):
-        start_ns = time.time_ns()
+    def get_tensor_from_storage(self, node, storage_id, data_offset, elem_bytes, device, shape, data_type):


tensor stride is missing. Please refer to the code in main branch to fix it.

shengfukevin · 2024-08-29T18:45:34Z

et_replay/tools/et_replay.py

@@ -519,11 +569,16 @@ def allocate_tensors(self):
                    )
            tensor_strides = node.get_input_tensor_strides()
            for idx, (data_type, t_id, shape) in enumerate(get_input_tensors(node)):
-                device = self.device
+                tensor_id, storage_id, storage_offset, element_num, item_size, device_str = t_id


This may fail, t_id may not have device_str

shengfukevin · 2024-08-29T21:27:54Z

et_replay/tools/et_replay.py

+
+        self.tensor_storage_sizes[storage_id] = max(storage_offset + element_num * item_size, self.tensor_storage_sizes[storage_id])
+
+        self.tensor_registry_permanent[replay_t_id] = ("lazy_alloc", (storage_id, device),


This code hard coded the tensor allocation to "lazy_alloc", I suggest to make it optional since lazy_alloc may affect the performance. If the tensors are allocated ahead of time, we usually does not include allocation time during benchmark. However, it is lazy_alloc, the tensor allocation time is always included during benchmark time.

shengfukevin · 2024-08-29T21:37:05Z

et_replay/tools/et_replay.py

+    def lazy_alloc_tensors(self, inputs, node):
+        for i in range(len(inputs)):
+            if isinstance(inputs[i], tuple) and inputs[i][0] == "lazy_alloc":
+                inputs[i] = inputs[i][2](node)


lazy_alloc should only run once, and self.tensor_registry should be updated to use the allocated tensor next time. This code seems to allocate the tensor during each access.

shengfukevin · 2024-08-29T21:47:34Z

et_replay/tools/et_replay.py

-                del self.tensor_registry[replay_t_id]
+                need_del_replay_t_ids_in_input.add(replay_t_id)
+            elif replay_t_id in self.instantiate and device is not None:
+                 self.recycle_instantiate_tensors(node.id, storage_id, device)


I think the reference count is an overkill for recycling the tensor in self.instantiate. Why not just use the same idea for self.replay_tensor_id_to_last_node_id_map? We can add storage_tensor_id_to_last_node_id_map, and populate it in add_unique_tensor.

shengfukevin · 2024-08-29T21:59:40Z

et_replay/tools/et_replay.py

@@ -1579,6 +1618,12 @@ def readComputeArgs(self, check_args: bool = True):
            default=True,
            help="when a et_id is being replayed multiple times, setting this to false will use temsors from previous runs.",
        )
+        parser.add_argument(
+            "--recycle_storages",


Maybe change to "recycle_tensor_storage"? Also if this is true, we need to allocate the storage tensor again when next iteration starts.

We need another flag "--lazy-alloc-tensor"

songyant and others added 4 commits August 19, 2024 16:47

split the tensor allocate operation of comm tensors from function pre…

3a84d81

…pComms to function generate_io_tensors

split one comm node replay functional code into function replaySingle…

24e5024

… and invoke replaySingle to replay every node in function replayTrace

add tensor lazy allocate and recyle strategy to comp replay

d7dfd41

merge with main

9278256

shengfukevin requested review from kingchc, louisfeng, sunghlin, shengbao-zheng and briancoutinho as code owners August 28, 2024 22:20

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024

shengfukevin commented Aug 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tensor lazy allocate and recyle strategy to comp replay #176

add tensor lazy allocate and recyle strategy to comp replay #176

shengfukevin commented Aug 28, 2024

facebook-github-bot commented Aug 28, 2024

shengfukevin left a comment

shengfukevin Aug 28, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024

shengfukevin Aug 29, 2024


		self.tensor_storage_sizes[storage_id] = max(storage_offset + element_num * item_size, self.tensor_storage_sizes[storage_id])

		self.tensor_registry_permanent[replay_t_id] = ("lazy_alloc", (storage_id, device),

add tensor lazy allocate and recyle strategy to comp replay #176

Are you sure you want to change the base?

add tensor lazy allocate and recyle strategy to comp replay #176

Conversation

shengfukevin commented Aug 28, 2024

Summary

Test Plan

facebook-github-bot commented Aug 28, 2024

shengfukevin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment