← Back to methodology index

Competitive positioning

The AI-model security space is crowded with names that all gesture at the same problem ("is this model safe to use?") but split into very different markets in practice. This page lays out where Dokima sits relative to the closest tools, so a reader can decide whether the scoring rubric on this site is the right reference for their question — and where to look if it is not.

Short version, in one paragraph: Dokima is the open-rubric per-model security score for Hugging Face models. Every weight, threshold, and grade boundary on this site is public; every score is independently reproducible from public Hugging Face metadata; coverage is the full Hugging Face corpus (~2 million models and growing) rather than a curated few-dozen. The closest peer products either keep their methodology closed, restrict scope to a small curated set, focus on the supply chain rather than the model itself, or target governance-process compliance rather than per-model security. Dokima is the score; the peers are mostly different products that happen to share an adjacent vocabulary.


How Dokima compares to the closest peers

The table below positions Dokima against the four products most often mentioned in the same conversation. Each row is a deliberate trade-off, not a "Dokima wins everywhere" pitch — different buyers should pick different tools.

ProductWhat it actually doesCoverageMethodologyWhere Dokima differs
Dokima (this site)Per-model trustworthiness score across seven dimensions, scored from Hugging Face metadata; embeddable badge; free public scanner; paid API.Full Hugging Face corpus (~2M+ models, scored on demand and via the recon corpus)Fully public; every weight and threshold lives on this page set
Snyk for AISoftware composition analysis extended to AI artefacts; finds vulnerable dependencies, license risk, and known-bad models in your codebase. The Snyk strength is supply-chain and dependency reachability.Vulnerability database (known CVEs), some known-malicious model listClosed (vendor scoring)Snyk lives in the developer SCA workflow; Dokima lives at the model card layer. Snyk tells you what is broken in code that uses a model; Dokima tells you whether the model itself is documented, licensed, evaluated, and provenance-checked.
RiskRubric.aiCurated per-model risk rubric run on a small set of leading commercial and open models.~54 named models last publishedPartial; methodology paper public, full evaluation harness less soRiskRubric is hand-curated and reads more like a quarterly research note than a continuous scoring service. Dokima scores the entire long tail (where most of the supply-chain risk lives) on demand, with reproducibility guarantees.
Endor LabsReachability-aware SCA for traditional software, extending into AI-model dependencies and "AI-bill-of-materials" concepts.Code dependency graphs the customer scansClosedEndor's question is "in your application, which AI components are actually reached at runtime?" Dokima's question is "for a specific model identifier, what is its standalone trustworthiness?" Complementary tools — one decides which models to investigate, the other decides whether to use a specific model.
VerifyWiseAI-governance and AI-compliance workflow tool (questionnaire-driven, policy-driven, audit-trail-driven).Customer-defined inventoryCustomer-driven; the workflow IS the methodologyVerifyWise organises your governance process. Dokima produces a score input you can feed into that process. They sit at different layers of the stack.

The honest summary: none of these tools answer the same question Dokima answers. They each solve an adjacent problem well. A mature security programme uses several of them.


What Hugging Face itself does NOT score

A common implicit assumption when discussing "AI model safety" is that Hugging Face's own safety scanners cover most of the surface. They do not. Hugging Face runs a few specific automated checks (the unsafe tag, Pickle imports, Picklescan-style scanning of common file formats) and a small number of curated programmes (Protect AI / Guardian integrations, malware tagging). What HF does not produce a verdict on, by design:

  • Biosecurity — whether a model meaningfully advances bioweapons design or pathogen synthesis. No automated probe in the HF stack.
  • Election manipulation — disinformation generation capacity, deepfake potency, narrative-cascade behaviour. Not a HF surface.
  • Dangerous capabilities more broadly — autonomous-weapons-relevant capabilities, mass-targeting capabilities, advanced cyberattack tooling. Researcher-led evaluations exist; the platform does not gate on them.
  • Jailbreak resilience — how well safety training holds up under adversarial prompting. The model card may quote evaluations; HF does not run them.
  • Hallucination rate and calibration — base rates and confidence calibration on factual recall. Researcher domain; not platform-scored.

Dokima does not score these either. Doing so credibly requires running the model — which we deliberately do not do for free-tier scans (cost, legal exposure, and reproducibility against a static commit SHA all become harder). The Dokima scope is what is decidable from metadata: serialisation safety, documentation completeness, licence clarity, namespace provenance, presence and freshness of safety/bias evaluation citations, regulatory-alignment signals, and ecosystem context.

The point of this section is to be honest about the gap: a high Dokima score is necessary-but-not-sufficient evidence that a model is safe to deploy in a specific application. A "Strong" or "Exemplary" grade tells you the artefact is well-formed, well-documented, and well-provenanced. It does not tell you the model is appropriate for your use case, your regulatory posture, or your risk tolerance. That assessment requires capability evaluations the platform layer cannot perform.

This is the same shape as a Lighthouse performance score for a website: a 100 means the bundle is well-structured and fast, not that the content is good. The Dokima score is the audit floor that frees a buyer to focus their evaluation budget on the questions that do require human or capability-based judgement.


Why publish this comparison

Two reasons. First, buyers ask: "how is this different from Snyk-for-AI?" — and a credible answer requires distinguishing the layers in the stack rather than pretending the products overlap more than they do. Second, the security-scoring market is going to consolidate over the next 18 to 36 months and the question of "which scoring services share a methodology basis?" will matter for procurement. Publishing the comparison up front lets the discussion happen on accurate ground rather than via vendor marketing.

The competitive landscape table will be republished when peer products materially change scope; the "what HF does not score" callout is expected to be stable through at least one major Hugging Face platform release.