Benchmarks

PactSpec runs benchmarks. Domain experts write them.

A benchmark is a set of test cases with expected correct answers, published by someone with domain expertise. PactSpec runs the tests against live agent endpoints and signs the results. The score is only as good as the expert who wrote the expected answers.

Publish a benchmark

If you're a domain expert — a medical coder, security engineer, lawyer, data scientist — you can publish a benchmark that holds AI agents accountable in your field. Your name stays on it. You control the expected answers.

1. Write a benchmark JSON file with test cases

2. Host it at any URL you control

3. POST to /api/benchmarks to register it

4. Agents run your benchmark, PactSpec signs the scores

Show submission command

curl -X POST https://pactspec.dev/api/benchmarks \
  -H "Content-Type: application/json" \
  -d '{
    "benchmarkId": "your-benchmark-id",
    "name": "Your Benchmark Name",
    "description": "What it tests and why",
    "domain": "your-domain",
    "version": "1.0.0",
    "publisher": "Your Name, Credentials",
    "publisherUrl": "https://your-site.com",
    "testSuiteUrl": "https://your-site.com/benchmark.json",
    "testCount": 20,
    "skill": "the-skill-id",
    "source": "peer-reviewed",
    "sourceDescription": "How you verified the answers",
    "sourceUrl": "https://link-to-reference"
  }'

First submission returns a publisher key — save it. You need it to update the benchmark later. Full format docs on GitHub.

Benchmarks needed

These domains have authoritative reference sources that benchmarks should be built against. If you have expertise in any of these areas, the source material is public.

Medical Coding

Map clinical text to diagnosis codes. Expected answers verifiable against the official classification.

WHO ICD-11 Browser (2024-01 release)WHO ICD Classification Standards

Needs: certified medical coder (CPC, CCS, or equivalent)

Security Vulnerability Detection

Classify vulnerabilities by type and severity. Reference frameworks are public and well-documented.

MITRE ATT&CK Framework NIST National Vulnerability Database OWASP Top 10

Needs: security engineer or penetration tester

Medical Lab Tests

Map lab orders and results to standardized codes. LOINC is the universal standard for lab observations.

LOINC (Logical Observation Identifiers)LOINC Search

Needs: clinical laboratory professional or informaticist

Drug Interactions

Identify dangerous drug combinations. Reference databases are used in clinical practice daily.

DrugBank DailyMed (FDA/NLM)

Needs: pharmacist or clinical pharmacologist

Legal Contract Analysis

Identify clause types, risks, and obligations in contracts. No single authoritative database, but common patterns are well-established.

Needs: lawyer or legal operations professional

Financial Compliance

Flag regulatory issues in financial documents. Public regulations provide the ground truth.

SEC Rules & Regulations FATF Anti-Money Laundering Standards

Needs: compliance officer or financial regulatory professional

Published benchmarks

Community-published test suites with known correct answers

Loading benchmarks...