openai · datajanko · Sep 16, 2022 · Sep 16, 2022
diff --git a/docs/spinningup/rl_intro3.rst b/docs/spinningup/rl_intro3.rst
@@ -338,9 +338,8 @@ is called the **reward-to-go** from that point, and this policy gradient express
 
     **But how is this better?** A key problem with policy gradients is how many sample trajectories are needed to get a low-variance sample estimate for them. The formula we started with included terms for reinforcing actions proportional to past rewards, all of which had zero mean, but nonzero variance: as a result, they would just add noise to sample estimates of the policy gradient. By removing them, we reduce the number of sample trajectories needed.
 
-An (optional) proof of this claim can be found `here`_, and it ultimately depends on the EGLP lemma.
+An (optional) proof of this claim can be found `here <../spinningup/extra_pg_proof1.html>`_, and it ultimately depends on the EGLP lemma.
 
-.. _`here`: ../spinningup/extra_pg_proof1.html
 
 Implementing Reward-to-Go Policy Gradient
 =========================================
@@ -474,4 +473,4 @@ In this chapter, we described the basic theory of policy gradient methods and co
 .. _`advantage of an action`: ../spinningup/rl_intro.html#advantage-functions
 .. _`this page`: ../spinningup/extra_pg_proof2.html
 .. _`Generalized Advantage Estimation`: https://arxiv.org/abs/1506.02438
-.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html
+.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html