0. note structure

this page is now organized as a living opd / gkd notebook.

i’m keeping the intuitive blog-style notes and the iclr 2024 paper notes in one place, but separated by purpose:

1. intuition: why on-policy distillation matters

my chess analogy still feels like the cleanest way to understand opd.

so for llm post-training, opd means:

the student samples its own trajectories, and a stronger teacher gives feedback on the exact token states the student visits.

this is why the phrase learning from self-generated mistakes is accurate. the student is not only copying clean teacher outputs. it is exposing its own failure modes, and the teacher is correcting those visited states.

[insert image here: chess.com move grading analogy from the blog]

2. where opd fits in llm training