This project attempts to use Reinforcement Learning to train a model to perform various control task on a manipulator.
Training the model to generalise for random target location.
- The first task for the manipulator is to reach a fixed target location.
- The next task is to learn a generalised policy for the arm to reach any random target location.
- Following a trajectory between two locations
- Developed a gym environment for Reinforcement Learning on manipulator control using PyBullet Physics Simulator.
- The arm's URDF file and Physics Simulator are defined under
/ents/arm.py
. - The environments are defined under
/envs/
. The action space and observation space are defined here, also thestep
function is defined here.
- In this environment the agent starts in a fixed state and it is supposed to reach a fixed target location.
- In this environment the agent starts from a fixed state and it is supposed to reach a target location whose position is decided at the start of every episode.
- The area where the target spawns can be limited.
- Episode Length
- 6500 timesteps
- Torque Sensitivity
- 200
- Velocity Sensitivity
- 10
- Position Sensitivity
- 1
- Maximum Force
- 300
- Arm Base Position
- 0, 0, 0.005
There are 3 different Action Types:
- Torque Control (Maximum Torque=200)
- Velocity Control (Maximum Velocity = 10)
- Position Control (Maximum Force = 300)
Keeping in mind, the hardware implementation of the project, we have used Position Control
as the action space.
We also have trained policies for Torque Control
action space
The observation space(dim=13) consists of:
Index | Observation | Range |
---|---|---|
0-6 | Joint Position (angular) | - |
7-9 | End Effector Relative location wrt target (cartesian) | -10 to 10 |
10-12 | Velocity of End Effector (cartesian) | -10 to 10 |
The models used in this project, for Torque Control
are from Stable Baselines3
a set of reliable implementations of reinforcement learning algorithms using PyTorch.
We have use Proximal Policy Optimization (PPO) algorithm to train our manipulator agent.
Model HyperParameters
- Network Architecture:
- Policy Network
- [512, 512, 256, 128]
- Value Function Approximator
- [512, 512, 256, 128]
- Policy Network
- Activation Function
- Relu
- Learning Rate
- 0.0001
The reward function is defined as the negative of Euclidean Distance between the target location and end effector.
reward = -1*np.linalg.norm(np.array(eeloc)-np.array(self.goal))
Defined a cube of side 0.1 units around the goal location, where the agent obtains 0 reward.
The observation space(dim=17) consists of:
Index | Observation | Range |
---|---|---|
0-6 | Joint Position (angular) | - |
7-13 | Joint Angular Velocity | -10 to 10 |
10-12 | Target Location (cartesian) | -2 to 2 |
The models used in this project are from Stable Baselines3
a set of reliable implementations of reinforcement learning algorithms using PyTorch.
We have use Proximal Policy Optimization (PPO) algorithm to train our manipulator agent.
Model HyperParameters
- Network Architecture:
- Policy Network
- [512, 512, 256, 128]
- Value Function Approximator
- [512, 512, 256, 128]
- Policy Network
- Activation Function
- Relu
- Learning Rate
- 0.0001
Firstly, The reward function is defined as the negative of Euclidean Distance between the target location and end effector.
reward1 = -1*np.linalg.norm(np.array(eeloc)-np.array(self.goal))
Rendering has a delay of 0.1 sec after every action for better visualisation
The vibrations observed above after reaching the target location were decreased by penalizing the agent based on the value of the angular velocity of every joint of the arm.
for ang_vel in ang_vel_joints:
if ang_vel>3:
reward2 = -1*ed_dis
The Euclidean Distance based reward as defined above was untouched.
Rendering has a delay of 0.1 sec after every action for better visualisation
- The vibrations that were observed after reaching the specific target location in the first simulation was possibly because there were no restrictions on the angular velocity of any joint of the manipulator. Thus, even after reaching its location the joints had significant angular velocity, and hence the shaky movement of the arm.
- We overcame this issue by defining another reward function along with the Euclidean Distance, which is to penalize the agent, if the angular velocity of any of its joint is greater than a certain threshold (=3). To keep this reward in proportional to the first reward function, the penalty is equal to
- As we can observe in the second simulation, the vibrations were reduced considerably, and the manipulator was quite stable after reaching the target location.
- Euclidean Distance of end effector with respect to target location for first case
- Euclidean Distance of end effector with respect to target location for second case, where the agent is penalized on its angular velocity