optimize clip_by_norm #183

zhangting2020 · 2023-09-22T10:57:31Z

优化点：

mul + set_value，其中set_value导致大量的memcpy，替换成inplace的操作
冗余的clip_norm计算，增加need_grad_norm，仅在tensorboard需要观察该值时再进行计算，否则会引入大量的norm算子
冗余的cast，PR代码中的 paddle_dtype打印结果为“float32”，但是实际tensor.dtype得到的是paddle.float32，会导致判断失败，从而引入无意义的cast。另外O2下梯度为fp16，原始写法需要每次将clip_coef_clamped转换成fp16，实际只需要计算一次，其他的梯度直接使用即可，所以添加了clip_coef_clamped_low_precison 变量。

效果：

O1：0.739 steps/s -> 2.194 steps/s
O2：1.067 steps/s -> 2.655 steps/s

CLAassistant · 2023-09-22T10:57:37Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

[email protected] seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

opt clip_by_norm

f408e09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize clip_by_norm #183

optimize clip_by_norm #183

zhangting2020 commented Sep 22, 2023

CLAassistant commented Sep 22, 2023

optimize clip_by_norm #183

Are you sure you want to change the base?

optimize clip_by_norm #183

Conversation

zhangting2020 commented Sep 22, 2023

CLAassistant commented Sep 22, 2023