A mastery model that survived a real curriculum
How we replaced a checklist of standards with a mastery model that teachers could act on — and that statistics could defend.
Confidentiality — Institution, datasets, and proprietary tooling are abstracted. What follows is the transferable decision logic, not internal exposure.
01 — Context
A mathematics program had adopted a mastery framing on paper, but operated as a coverage checklist in practice. Teachers marked standards “done”; the system reported green; and nobody could say what a green cell actually entitled you to believe about a student.
The brief looked like a reporting problem. It was a measurement problem wearing a reporting costume.
02 — The real decision
The decision was not “which dashboard?” It was: what is the unit of mastery, and what evidence licenses the claim that a student holds it? Everything downstream — item design, reporting, intervention — is determined by that one answer.
A mastery model is a claim about a learner. If you can’t say what would make the claim false, you are not measuring — you are decorating.
03 — My role
I led the curricular strategy and owned the measurement logic end to end — defining the constructs, designing and validating items, and specifying how the model would be read by teachers and rendered by the product. I worked between the classroom, the psychometrics, and the engineering, and translated in all three directions.
04 — Constraints
- Minutes, not hours
- Teachers had to interpret a result at a glance.
- Finite item volume
- We could not test every sub-skill directly.
- Audit-ready
- Reporting had to survive a skeptical head of department.
05 — The logic used
We modeled each mastery target as a latent construct with an explicit evidence model, calibrated with Item Response Theory so that difficulty and discrimination were properties of items, not opinions. Where targets were sequential, Bayesian Knowledge Tracing carried belief forward instead of resetting it every assessment.
construct → evidence model → item bank
response → IRT calibration → ability estimate
prior × evidence → BKT posterior → mastery claim
06 — Alternatives considered
A raw percent-correct cutoff was simplest but conflated easy and hard evidence. A pure machine-learned classifier predicted well but couldn’t be explained to a teacher or defended in an audit. We chose the model we could argue for, accepting a small cost in raw fit for a large gain in legibility and accountability.
07 — The system designed
The output was not a dashboard but a small, honest object: a mastery claim, the evidence behind it, and a stated confidence — designed so a teacher could disagree with it intelligently.
This student can model a linear relationship from a table — not just complete the worksheet that contained one.
- Claim
- States the construct, the conditions, and the “again”.
- Evidence
- Four items across two difficulties, plus one transfer task.
- Confidence
- High — held out, not self-confirmed.
08 — Abstracted artifacts
09 — Validation & quality criteria
- Items passed fit statistics and were reviewed for construct relevance, not just difficulty.
- Mastery claims were checked against held-out performance, not against themselves.
- A claim a teacher couldn’t act on was treated as a defect, not a feature.
10 — Reflections
The hardest work was deciding what not to measure. A smaller set of well-evidenced claims beat a complete map of guesses — and it is the part that transfers to every measurement problem I have touched since.