ON-POLICY DISTILLATION NOTES

living note / rl for llms primer
note

0. note structure

this page is now organized as a living opd / gkd notebook.

i’m keeping the intuitive blog-style notes and the iclr 2024 paper notes in one place, but separated by purpose:

intuition first — why on-policy distillation is different from supervised kd / seqkd
formal paper notes — gkd objective, algorithm, divergence choices, and paper results
implementation notes — how i map the paper into an actual training loop
my experiments — small-scale gsm8k sanity checks and what the results actually mean
paper add-on template — where future papers should go so this does not become messy again

1. intuition: why on-policy distillation matters

my chess analogy still feels like the cleanest way to understand opd.

on-policy rl is like playing games yourself and getting a win/loss signal after the game. useful, but the feedback is sparse because you do not know exactly which move caused the result.
off-policy / supervised distillation is like watching a grandmaster play. you see strong moves, but those positions may not be the positions you personally reach.
on-policy distillation combines both: you play your own game, and a stronger coach grades the moves you actually made.

so for llm post-training, opd means:

the student samples its own trajectories, and a stronger teacher gives feedback on the exact token states the student visits.

this is why the phrase learning from self-generated mistakes is accurate. the student is not only copying clean teacher outputs. it is exposing its own failure modes, and the teacher is correcting those visited states.

[insert image here: chess.com move grading analogy from the blog]

0. note structure

1. intuition: why on-policy distillation matters

2. where opd fits in llm training