Tag: policy-gradient

Blog Post·2024-06-19·6 min read

Policy Gradients: The Math Behind RLHF

The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.