Blog Post··6 min read
Policy Gradients: The Math Behind RLHF
The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.
The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.