- living note / rl for llms primer
- note
0. note structure
this page is now organized as a living opd / gkd notebook.
i’m keeping the intuitive blog-style notes and the iclr 2024 paper notes in one place, but separated by purpose:
- intuition first — why on-policy distillation is different from supervised kd / seqkd
- formal paper notes — gkd objective, algorithm, divergence choices, and paper results
- implementation notes — how i map the paper into an actual training loop
- my experiments — small-scale gsm8k sanity checks and what the results actually mean
- paper add-on template — where future papers should go so this does not become messy again
1. intuition: why on-policy distillation matters
my chess analogy still feels like the cleanest way to understand opd.
- on-policy rl is like playing games yourself and getting a win/loss signal after the game. useful, but the feedback is sparse because you do not know exactly which move caused the result.
- off-policy / supervised distillation is like watching a grandmaster play. you see strong moves, but those positions may not be the positions you personally reach.
- on-policy distillation combines both: you play your own game, and a stronger coach grades the moves you actually made.
so for llm post-training, opd means:
the student samples its own trajectories, and a stronger teacher gives feedback on the exact token states the student visits.
this is why the phrase learning from self-generated mistakes is accurate. the student is not only copying clean teacher outputs. it is exposing its own failure modes, and the teacher is correcting those visited states.
[insert image here: chess.com move grading analogy from the blog]
2. where opd fits in llm training