You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I just read the paper today, and there are still two points that remains unclear to me.
I looked at the code to try understanding it better but it still remains not clear.
The first point :
In model.py the features function transforming the input state into feature space are defined in nipsHead, universeHead, ...
In these definitions and their usage, I see no trace of normalization (something like l2 normalize).
I am expecting to see a normalization because it seems very easy for the network to cheat. If it want to maximize the reward, it just have to scale the features up. (And scale down in the inverse model to not be penalized).
The second point :
It seems to me that every time the parameters of the features function are modified, the intrinsic rewards therefore the rewards for the whole episode are modified. Therefore we need to recompute the generalized advantages for the whole episode. Does this mean that we must process episodes in their entirety ? How does it play with experience replay ? Is there an approximation to avoid recomputing the advantages after an update ?
Thanks.
The text was updated successfully, but these errors were encountered:
Hello, I just read the paper today, and there are still two points that remains unclear to me.
I looked at the code to try understanding it better but it still remains not clear.
The first point :
In model.py the features function transforming the input state into feature space are defined in nipsHead, universeHead, ...
In these definitions and their usage, I see no trace of normalization (something like l2 normalize).
I am expecting to see a normalization because it seems very easy for the network to cheat. If it want to maximize the reward, it just have to scale the features up. (And scale down in the inverse model to not be penalized).
The second point :
It seems to me that every time the parameters of the features function are modified, the intrinsic rewards therefore the rewards for the whole episode are modified. Therefore we need to recompute the generalized advantages for the whole episode. Does this mean that we must process episodes in their entirety ? How does it play with experience replay ? Is there an approximation to avoid recomputing the advantages after an update ?
Thanks.
The text was updated successfully, but these errors were encountered: