On-Policy Learning

Fact

A reinforcement learning method using its current policy’s own trial data.

On-policy learning is like a basketball player watching their own game tape. They fix their own air ball, not the kid on the next court.

The AI tests its current plan, then updates that same plan. You meet it in Policy Gradient and PPO, and it is steady but needs many practice runs.

RL
On-policy learning is one way to update a policy in RL.

Off-policy-learning
The main difference is whether the data came from the current policy.

Policy Gradient
Many Policy Gradient methods use samples from the current policy.

PPO
PPO is a classic on-policy training method.