Competitive positioning

The AI-model security space is crowded with names that all gesture at the same problem ("is this model safe to use?") but split into very different markets in practice. This page lays out where Dokima sits relative to the closest tools, so a reader can decide whether the scoring rubric on this site is the right reference for their question — and where to look if it is not.

Short version, in one paragraph: Dokima is the open-rubric per-model security score for Hugging Face models. Every weight, threshold, and grade boundary on this site is public; every score is independently reproducible from public Hugging Face metadata; coverage is the full Hugging Face corpus (~2 million models and growing) rather than a curated few-dozen. The closest peer products either keep their methodology closed, restrict scope to a small curated set, focus on the supply chain rather than the model itself, or target governance-process compliance rather than per-model security. Dokima is the score; the peers are mostly different products that happen to share an adjacent vocabulary.

How Dokima compares to the closest peers

The table below positions Dokima against the four products most often mentioned in the same conversation. Each row is a deliberate trade-off, not a "Dokima wins everywhere" pitch — different buyers should pick different tools.

Product	What it actually does	Coverage	Methodology	Where Dokima differs
Dokima (this site)	Per-model trustworthiness score across seven dimensions, scored from Hugging Face metadata; embeddable badge; free public scanner; paid API.	Full Hugging Face corpus (~2M+ models, scored on demand and via the recon corpus)	Fully public; every weight and threshold lives on this page set	—
Snyk for AI	Software composition analysis extended to AI artefacts; finds vulnerable dependencies, license risk, and known-bad models in your codebase. The Snyk strength is supply-chain and dependency reachability.	Vulnerability database (known CVEs), some known-malicious model list	Closed (vendor scoring)	Snyk lives in the developer SCA workflow; Dokima lives at the model card layer. Snyk tells you what is broken in code that uses a model; Dokima tells you whether the model itself is documented, licensed, evaluated, and provenance-checked.
RiskRubric.ai	Curated per-model risk rubric run on a small set of leading commercial and open models.	~54 named models last published	Partial; methodology paper public, full evaluation harness less so	RiskRubric is hand-curated and reads more like a quarterly research note than a continuous scoring service. Dokima scores the entire long tail (where most of the supply-chain risk lives) on demand, with reproducibility guarantees.
Endor Labs	Reachability-aware SCA for traditional software, extending into AI-model dependencies and "AI-bill-of-materials" concepts. Endor's published AI-model evaluation reads unstructured artefacts (model cards, README prose) through an LLM-driven pass to extract risk signals.	Code dependency graphs the customer scans	Closed	Endor's question is "in your application, which AI components are actually reached at runtime?" Dokima's question is "for a specific model identifier, what is its standalone trustworthiness?" Complementary tools — one decides which models to investigate, the other decides whether to use a specific model. Dokima additionally commits to byte-identical reproducibility across reruns of the same commit SHA, which an LLM-driven pipeline cannot guarantee (see next section).
VerifyWise	AI-governance and AI-compliance workflow tool (questionnaire-driven, policy-driven, audit-trail-driven).	Customer-defined inventory	Customer-driven; the workflow IS the methodology	VerifyWise organises your governance process. Dokima produces a score input you can feed into that process. They sit at different layers of the stack.

The honest summary: none of these tools answer the same question Dokima answers. They each solve an adjacent problem well. A mature security programme uses several of them.

Why scan-to-scan reproducibility matters

A trust score is only useful if a reader can rely on the number. If the same model scanned twice produces two different scores, the score is noise rather than signal — and a buyer who is asked to defend a procurement decision against an "but it scored differently last week" challenge has nothing to point to.

Dokima commits to byte-identical reproducibility on rerun against the same commit SHA. Two scans of author/model@<sha> produce the same seven-dimension vector, the same total, the same grade, the same remediation list. We achieve this by parsing structured metadata fields with deterministic typed parsers — the model card YAML frontmatter, the file listing, the namespace overview endpoint, the commits endpoint. Where the metadata is unstructured (free-text prose in the model card body), we apply pattern-based extractors with explicit fall-throughs, never a generative or sampled component.

Peer products that route unstructured artefacts through an LLM-driven evaluation pass face a structural problem here: large language models are non-deterministic by construction unless explicitly seeded, and even seeded passes drift across model-version upgrades. A scoring service whose pipeline includes "LLM reads the README and decides" cannot promise a reader that the score they got today is the score they will get tomorrow, even with no upstream change to the model itself. This is not a criticism of the LLM-driven approach as a research methodology — it is a load-bearing constraint on what such an approach can promise as a stable scoring layer.

The practical consequence: when a model author embeds the Dokima badge in their model card, the score the badge displays is verifiable against this site. When a procurement reviewer cites a Dokima grade in an audit document, that grade survives independent re-derivation. When the scoring rubric itself changes (we publish the changelog), the score change is attributable to a specific weight or threshold revision rather than an opaque drift in a generative component. Reproducibility is the load-bearing property under all three of those use cases.

We score the same problem space several peers score; the complementary signals are stronger than competing claims.

What Hugging Face itself does NOT score

A common implicit assumption when discussing "AI model safety" is that Hugging Face's own safety scanners cover most of the surface. They do not. Hugging Face runs a few specific automated checks (the unsafe tag, Pickle imports, Picklescan-style scanning of common file formats) and a small number of curated programmes (Protect AI / Guardian integrations, malware tagging). What HF does not produce a verdict on, by design:

Biosecurity — whether a model meaningfully advances bioweapons design or pathogen synthesis. No automated probe in the HF stack.
Election manipulation — disinformation generation capacity, deepfake potency, narrative-cascade behaviour. Not a HF surface.
Dangerous capabilities more broadly — autonomous-weapons-relevant capabilities, mass-targeting capabilities, advanced cyberattack tooling. Researcher-led evaluations exist; the platform does not gate on them.
Jailbreak resilience — how well safety training holds up under adversarial prompting. The model card may quote evaluations; HF does not run them.
Hallucination rate and calibration — base rates and confidence calibration on factual recall. Researcher domain; not platform-scored.

Dokima does not score these either. Doing so credibly requires running the model — which we deliberately do not do for free-tier scans (cost, legal exposure, and reproducibility against a static commit SHA all become harder). The Dokima scope is what is decidable from metadata: serialisation safety, documentation completeness, licence clarity, namespace provenance, presence and freshness of safety/bias evaluation citations, regulatory-alignment signals, and ecosystem context.

The point of this section is to be honest about the gap: a high Dokima score is necessary-but-not-sufficient evidence that a model is safe to deploy in a specific application. A "Strong" or "Exemplary" grade tells you the artefact is well-formed, well-documented, and well-provenanced. It does not tell you the model is appropriate for your use case, your regulatory posture, or your risk tolerance. That assessment requires capability evaluations the platform layer cannot perform.

This is the same shape as a Lighthouse performance score for a website: a 100 means the bundle is well-structured and fast, not that the content is good. The Dokima score is the audit floor that frees a buyer to focus their evaluation budget on the questions that do require human or capability-based judgement.

Why publish this comparison

Two reasons. First, buyers ask: "how is this different from Snyk-for-AI?" — and a credible answer requires distinguishing the layers in the stack rather than pretending the products overlap more than they do. Second, the security-scoring market is going to consolidate over the next 18 to 36 months and the question of "which scoring services share a methodology basis?" will matter for procurement. Publishing the comparison up front lets the discussion happen on accurate ground rather than via vendor marketing.

The competitive landscape table will be republished when peer products materially change scope; the "what HF does not score" callout is expected to be stable through at least one major Hugging Face platform release.