Calibration cadence

Quarterly. After each cycle the scoring weights, per-dimension point allocations, grade boundaries, and hard-fail triggers are re-evaluated against the empirical distribution of scanned models. Each cycle produces a methodology changelog post on the index page (Lighthouse-style "what changed in v2 and why").

What we calibrate against: the empirical distribution of real-world models in our scan history. Unlike traditional scoring tools that calibrate against a labelled corpus (this model is "trustworthy", that one is not), Dokima has no such oracle — no one can authoritatively declare a model trustworthy. Instead, calibration ensures the scoring rubric produces a useful spread of grades across real-world models, that no dimension is so saturated as to be uninformative, that adversarial gaming gets caught and tightened against, and that grade boundaries align with meaningful tiers of model quality.

Minimum-viable quarterly scope

The standing quarterly recalibration runs on Dimensions 1, 4, and 6 only (serialisation safety, namespace provenance, regulatory alignment). These three are the highest-gameability dimensions: format choice, account-history pattern, and regulatory boilerplate are the surfaces where adversarial drift compounds fastest. Dimensions 2, 3, and 5 are skipped from the standing quarterly cadence and are re-evaluated on the triggers below.

Why minimum-viable rather than full-sweep-every-quarter: Dokima is built and operated by a single person on bounded weekly hours. A full seven-dimension recalibration involves pulling scan history, running statistical analysis per dimension, proposing weight changes, signing off, shipping, and version-bumping the methodology. Promising the full sweep every quarter and then missing it would be worse for credibility than committing to the minimum-viable scope and meeting it. The minimum-viable scope concentrates effort where adversarial drift is most likely while the drift-alert mechanism below provides safety-net coverage for the other four dimensions.

Automated drift alerts

Per-dimension distribution drift exceeding a documented threshold (set after the first quarterly review based on observed baseline noise) triggers manual review of that dimension regardless of whether it sits in the standing quarterly scope. The drift alert is the safety net under the minimum-viable scope above — it catches gameability drift in Dimensions 2, 3, 5 between full sweeps.

The threshold sits in the Baselines page once locked. Until the first quarterly review, the threshold is the operator's running judgment.

Full seven-dimension sweep triggers

A full sweep across all seven dimensions fires when ANY of the following becomes true:

a drift alert on any dimension fires;
OR scaling milestones make full-sweep cycles sustainable on the operator's available time.

Each full sweep produces a methodology changelog entry per the index page.

What calibration does NOT include

No labelled trusted-model corpus (none exists; Dokima exists because none exists).
No download-and-test pipeline (still pure metadata).
No machine-learning model training; no active learning. Pure descriptive statistics over scan history.

Methodology version stamping

Existing scores under prior methodology versions remain valid at their original score with a methodology version stamp (methodology_version: v1 etc. in the API response). When users view a score under an older methodology version, the report page surfaces "scored under methodology v{N}; current is v{M}" and offers a one-click rescan.

The version stamp + the weights SHA256 stamp are the load-bearing inputs for the auditability invariant: any historical score is independently re-derivable given the methodology version it was produced under.