Benchmarks
PactSpec runs benchmarks. Domain experts write them.
A benchmark is a set of test cases with expected correct answers, published by someone with domain expertise. PactSpec runs the tests against live agent endpoints and signs the results. The score is only as good as the expert who wrote the expected answers.
Publish a benchmark
If you're a domain expert — a medical coder, security engineer, lawyer, data scientist — you can publish a benchmark that holds AI agents accountable in your field. Your name stays on it. You control the expected answers.
Show submission command
curl -X POST https://pactspec.dev/api/benchmarks \
-H "Content-Type: application/json" \
-d '{
"benchmarkId": "your-benchmark-id",
"name": "Your Benchmark Name",
"description": "What it tests and why",
"domain": "your-domain",
"version": "1.0.0",
"publisher": "Your Name, Credentials",
"publisherUrl": "https://your-site.com",
"testSuiteUrl": "https://your-site.com/benchmark.json",
"testCount": 20,
"skill": "the-skill-id",
"source": "peer-reviewed",
"sourceDescription": "How you verified the answers",
"sourceUrl": "https://link-to-reference"
}'First submission returns a publisher key — save it. You need it to update the benchmark later. Full format docs on GitHub.
Benchmarks needed
These domains have authoritative reference sources that benchmarks should be built against. If you have expertise in any of these areas, the source material is public.
Map clinical text to diagnosis codes. Expected answers verifiable against the official classification.
Needs: certified medical coder (CPC, CCS, or equivalent)
Classify vulnerabilities by type and severity. Reference frameworks are public and well-documented.
Needs: security engineer or penetration tester
Map lab orders and results to standardized codes. LOINC is the universal standard for lab observations.
Needs: clinical laboratory professional or informaticist
Identify dangerous drug combinations. Reference databases are used in clinical practice daily.
Needs: pharmacist or clinical pharmacologist
Identify clause types, risks, and obligations in contracts. No single authoritative database, but common patterns are well-established.
Needs: lawyer or legal operations professional
Flag regulatory issues in financial documents. Public regulations provide the ground truth.
Needs: compliance officer or financial regulatory professional
Published benchmarks
Community-published test suites with known correct answers