Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler tf native training #420

Open
wants to merge 104 commits into
base: master
Choose a base branch
from

Conversation

sophiayue1116
Copy link

@sophiayue1116 sophiayue1116 commented Jan 11, 2021

Description of changes:

This commit is to enable profiler in the tf2 native training (design doc: https://quip-amazon.com/v0MwAkTizZl9/Profiler-for-TensorFlow2-native-training). The corresponding integration tests for tf 2.2 and 2.3 passed successfully.
TF2.2 integration test: https://console.aws.amazon.com/codesuite/codebuild/072677473360/projects/smprofiler_tf2_integration_tests/build/smprofiler_tf2_integration_tests%3A2bff3f63-b797-4c5e-9992-0fdf17f13bec?region=us-east-1
TF2.3 integration test:
https://console.aws.amazon.com/codesuite/codebuild/072677473360/projects/smprofiler_tf_2_3_integration_tests/build/smprofiler_tf_2_3_integration_tests%3A801a9a02-30b5-4af6-9835-9b01a6ed6ce4/?region=us-east-1

The changes include:

  1. Added profiling_start_batch(), profiling_end_batch() and profiling_end() functions inside keras.py to enable the profiler functionalities in the native train loop.
  2. Added python_profiler as KerasHook's attribute to have a better practice and be better for testing the python profiling.
  3. Added is_profiler_native_training (default to False) as KerasHook's attribute to indicate enabling profiler in the tensorflow2 native training. It is used to handle the different use cases (only debugger enabled, only profiler enabled, both debugger and profiler enabled).
  4. Added _decrement_step() function to decrease the step number when both profiler and debugger are enabled. In this case, step number is first increased by 1 inside profiling_start_batch() and decreased by 1 inside wrap_tape() before calling the _wrap_tape_push() function, in order to keep the debugger code unchanged inside _wrap_tape_push() function.
  5. Added _handle_start_python_profiling(), _handle_end_python_profiling(), _handle_start_detailed_profiling(), _handle_end_detailed_profiling(), _handle_start_dataloader_profiling(), _handle_end_dataloader_profiling() methods inside keras.py to reduce the code.
  6. Updated _increment_step() function in the hook.py to be able to separate the functionalities of step increase and write state.
  7. Added unit tests for profiler only and profiler + debugger use cases.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-io
Copy link

codecov-io commented Jan 18, 2021

Codecov Report

Merging #420 (f704e48) into master (6788e32) will decrease coverage by 14.18%.
The diff coverage is 5.68%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #420       +/-   ##
===========================================
- Coverage   76.91%   62.72%   -14.19%     
===========================================
  Files         113      113               
  Lines       10195    10237       +42     
===========================================
- Hits         7841     6421     -1420     
- Misses       2354     3816     +1462     
Impacted Files Coverage Δ
smdebug/tensorflow/keras.py 0.00% <0.00%> (-90.10%) ⬇️
smdebug/core/hook.py 89.33% <100.00%> (-4.56%) ⬇️
smdebug/tensorflow/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/constants.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/singleton_utils.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/collection.py 0.00% <0.00%> (-95.88%) ⬇️
smdebug/tensorflow/session.py 0.00% <0.00%> (-91.83%) ⬇️
smdebug/tensorflow/tensor_ref.py 0.00% <0.00%> (-88.71%) ⬇️
smdebug/tensorflow/utils.py 0.00% <0.00%> (-87.62%) ⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6788e32...f704e48. Read the comment docs.

@sophiayue1116 sophiayue1116 marked this pull request as ready for review January 20, 2021 07:47
@sophiayue1116 sophiayue1116 marked this pull request as draft January 20, 2021 07:53
@sophiayue1116 sophiayue1116 marked this pull request as ready for review January 20, 2021 08:29
Copy link
Contributor

@ndodda-amazon ndodda-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, comments are mostly around reducing code redundancy in your implementation and the tests.

smdebug/tensorflow/keras.py Outdated Show resolved Hide resolved
smdebug/tensorflow/keras.py Show resolved Hide resolved
smdebug/tensorflow/keras.py Outdated Show resolved Hide resolved
smdebug/tensorflow/keras.py Outdated Show resolved Hide resolved
smdebug/tensorflow/keras.py Show resolved Hide resolved
tests/profiler/tensorflow2/test_native_tf2_profiler.py Outdated Show resolved Hide resolved
tests/profiler/tensorflow2/test_native_tf2_profiler.py Outdated Show resolved Hide resolved
tests/profiler/tensorflow2/test_native_tf2_profiler.py Outdated Show resolved Hide resolved
tests/profiler/tensorflow2/test_native_tf2_profiler.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants