You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create two separate networks that compete to explore the environment (together they form 1 agent)
Idea is to have a reinforcement learning setup where:
The prediction network learns an unsupervised representation of the environment, and predicts what will happen next
We could use adversarial techniques for unsupervised learning, or we could use something less fancy like denoising autoencoders
The exploration network controls the actions of the agent, and gets a reward proportional to the MSE of the prediction network's prediction and reality
This is an artificial reward signal, not tied to the true environment reward
The exploration network has no backprop into the weights of the prediction network, so it can't suggest degenerate representations (e.g. learning to output random noise to maximize surprise).
Influence is solely through the actions of the exploration network causing mispredictions. e.g. reality is always in between the exploration network and the prediction network
Considerations:
The exploration network needs to quickly adapt to changing dynamics (model this like a multi-arm bandit that periodically changes the payout probabilities of the arms). Things like RL^2 are probably a good idea here.
The inputs to the exploration network might need to be the raw input, and maybe some memory like an LSTM
The text was updated successfully, but these errors were encountered:
Create two separate networks that compete to explore the environment (together they form 1 agent)
Idea is to have a reinforcement learning setup where:
The exploration network has no backprop into the weights of the prediction network, so it can't suggest degenerate representations (e.g. learning to output random noise to maximize surprise).
Influence is solely through the actions of the exploration network causing mispredictions. e.g. reality is always in between the exploration network and the prediction network
Considerations:
The text was updated successfully, but these errors were encountered: