Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler tf native training #420

Open
wants to merge 104 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
c674741
Merge pull request #1 from awslabs/master
sophiayue1116 Jan 20, 2021
37b70f0
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
bb67bb0
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
32cb5fd
add tests
sophiayue1116 Jan 11, 2021
8e1d6f2
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
7f8b51e
add tests
sophiayue1116 Jan 13, 2021
ac60d3a
update profiler for native tf training
sophiayue1116 Jan 18, 2021
d40de5e
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
7cde99e
remove print statement
sophiayue1116 Jan 19, 2021
3cb92b9
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
f1ee1e4
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
bc0388d
add tests
sophiayue1116 Jan 11, 2021
d5d5835
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
2b44d54
add tests
sophiayue1116 Jan 13, 2021
d77b338
update profiler for native tf training
sophiayue1116 Jan 18, 2021
428cdd9
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
f421f1d
Cache TF Versions (#421)
NihalHarish Jan 13, 2021
21c4b22
remove print statement
sophiayue1116 Jan 19, 2021
d11c4ac
clean up the code
sophiayue1116 Jan 20, 2021
4044d3a
clean up code
sophiayue1116 Jan 20, 2021
14f0275
update format
sophiayue1116 Jan 20, 2021
9494b98
Merge remote-tracking branch 'upstream/master'
sophiayue1116 Jan 22, 2021
794126f
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
4d1ddc0
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
64ce546
add tests
sophiayue1116 Jan 11, 2021
7e9c830
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
5e2172a
add tests
sophiayue1116 Jan 13, 2021
d45c586
update profiler for native tf training
sophiayue1116 Jan 18, 2021
1cd9c7e
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
e0da3aa
remove print statement
sophiayue1116 Jan 19, 2021
dbdf4a1
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
4e1fd39
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
cf29f10
add tests
sophiayue1116 Jan 11, 2021
7e9af0d
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
1acfdb5
add tests
sophiayue1116 Jan 13, 2021
68f6838
update profiler for native tf training
sophiayue1116 Jan 18, 2021
63ea8d3
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
c07b169
Cache TF Versions (#421)
NihalHarish Jan 13, 2021
11af410
remove print statement
sophiayue1116 Jan 19, 2021
aa67756
clean up the code
sophiayue1116 Jan 20, 2021
6caa2b8
clean up code
sophiayue1116 Jan 20, 2021
ce8c450
update format
sophiayue1116 Jan 20, 2021
861ce68
revise on PR
sophiayue1116 Jan 22, 2021
f8a0a0b
revise on PR
sophiayue1116 Jan 22, 2021
6ea117c
update _on_any_mode_end() func for the posthookclose python profiling
sophiayue1116 Jan 22, 2021
a59d82b
rename the debugger native training flag and update the path join in …
sophiayue1116 Jan 22, 2021
94ea4f3
update format
sophiayue1116 Jan 25, 2021
28a7882
update the comments
sophiayue1116 Jan 25, 2021
015aca1
update comments
sophiayue1116 Jan 25, 2021
a35fae3
add docstring, update helper function names and improve the unit tests
sophiayue1116 Jan 26, 2021
d72cd1e
update docstring and function name
sophiayue1116 Jan 27, 2021
34a74e5
Merge branch 'master' into master
NihalHarish Jan 27, 2021
03a4d68
Merge remote-tracking branch 'upstream/master'
sophiayue1116 Jan 29, 2021
0fd68ac
Merge branch 'master' of https://github.com/sophiayue1116/sagemaker-d…
sophiayue1116 Jan 29, 2021
4a7b74b
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
440f0b1
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
eae6a87
add tests
sophiayue1116 Jan 11, 2021
cc76995
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
80ff23a
add tests
sophiayue1116 Jan 13, 2021
544b3e7
update profiler for native tf training
sophiayue1116 Jan 18, 2021
31dfb8d
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
0a06450
remove print statement
sophiayue1116 Jan 19, 2021
1f2a2d3
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
7fb9c5b
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
21699de
add tests
sophiayue1116 Jan 11, 2021
8241846
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
92c5749
add tests
sophiayue1116 Jan 13, 2021
28509f4
update profiler for native tf training
sophiayue1116 Jan 18, 2021
b9bf0c2
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
19e2a9c
Cache TF Versions (#421)
NihalHarish Jan 13, 2021
a636c38
remove print statement
sophiayue1116 Jan 19, 2021
1781732
clean up the code
sophiayue1116 Jan 20, 2021
4b81d12
clean up code
sophiayue1116 Jan 20, 2021
375ad1d
update format
sophiayue1116 Jan 20, 2021
7dcd401
revise on PR
sophiayue1116 Jan 22, 2021
ae97c0d
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
3a35cfb
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
81234de
add tests
sophiayue1116 Jan 11, 2021
77bb197
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
72b04e3
add tests
sophiayue1116 Jan 13, 2021
eb6477d
update profiler for native tf training
sophiayue1116 Jan 18, 2021
4b6450b
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
4b7644d
remove print statement
sophiayue1116 Jan 19, 2021
f1d6486
add changes for enabling profiling for native tf2 training
sophiayue1116 Jan 5, 2021
865a740
add changes for enabling profiling in the native tf2 training
sophiayue1116 Jan 5, 2021
cc8d4e7
add tests
sophiayue1116 Jan 11, 2021
d7c4a10
add python profiler as attr for kerashook
sophiayue1116 Jan 12, 2021
24704ca
add tests
sophiayue1116 Jan 13, 2021
e3fbf91
update profiler for native tf training
sophiayue1116 Jan 18, 2021
1a1029c
Modify distributed_training_utils.py import for TF 2.4 (#422)
NihalHarish Jan 13, 2021
7fbf934
Cache TF Versions (#421)
NihalHarish Jan 13, 2021
61d0fa3
remove print statement
sophiayue1116 Jan 19, 2021
797de1b
clean up the code
sophiayue1116 Jan 20, 2021
2462ec8
clean up code
sophiayue1116 Jan 20, 2021
ed694bd
update format
sophiayue1116 Jan 20, 2021
c8d47c9
update _on_any_mode_end() func for the posthookclose python profiling
sophiayue1116 Jan 22, 2021
e0df694
rename the debugger native training flag and update the path join in …
sophiayue1116 Jan 22, 2021
defa97f
update format
sophiayue1116 Jan 25, 2021
ff382d3
update the comments
sophiayue1116 Jan 25, 2021
d86e040
update comments
sophiayue1116 Jan 25, 2021
d842f8a
add docstring, update helper function names and improve the unit tests
sophiayue1116 Jan 26, 2021
499d1a3
update docstring and function name
sophiayue1116 Jan 27, 2021
e4aa05e
update unit tests
sophiayue1116 Jan 29, 2021
f704e48
update tests
sophiayue1116 Jan 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions smdebug/core/hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -558,18 +558,19 @@ def _cleanup(self):
if self.first_process is True:
remove_claim_file(self.out_dir)

def _increment_step(self):
def _increment_step(self, write_state=True):
# Update the last_state to the last step number that was saved or seen
self._write_state()
if write_state:
self._write_state()
self.written_tensor_name_for_step.clear()
self._collections_to_save_for_step = None

self.step += 1
self.mode_steps[self.mode] += 1
self.written_tensor_name_for_step.clear()

# Increment Global step number irrespective of what mode it is
if self.mode != ModeKeys.GLOBAL:
self.mode_steps[ModeKeys.GLOBAL] = self.step
self._collections_to_save_for_step = None

# Called in the internal AWS codebase to determine
# if a particular tensor value should be saved
Expand Down
Loading