Claude Code Skills Benchmark 2026 (v1): What 5 Production Skills Tell Us
Ampliflow
Advanced AI frontier lab and business growth agency. Helping UK businesses deploy agentic AI systems.

The claude code skills benchmark found an 80% pass rate across 20 v1 evaluations, with production hardening, React best practices, and surgical coding guidance scoring highest. The pilot tested 5 skills against 4 practical tasks using deterministic scorers. The useful signal was not just which skills passed. It was where a skill's discipline helped, where it overreached, and where task fit mattered more than confidence.
Last updated: May 2026 · v1 pilot: 5 skills x 4 tasks = 20 evaluations · Methodology fully published; v2 with native CLI execution + 20 skills x 8 tasks coming June 2026
TL;DR
- v1 scored 16 passes from 20 evaluations: an 80% pass rate.
- Total quality score was 54/60, or 2.7 out of 3 on average.
harden,react-best-practices, andkarpathy-guidelineseach scored 4/4 passes.tddproduced strong code and tests, but missed format-specific requirements on documentation and security review tasks.- v1 outputs were generated by Codex 5.5 with medium reasoning because sandbox
EPERMerrors prevented nativeclaude.exeexecution. The scorer logic is still the deterministic scaffold logic.
Why We Built This Benchmark
Most writing about Claude Code skills is still listicle-shaped.
It tells you which skills sound useful.
It rarely tells you whether a skill helps on a real task.
That gap matters for UK businesses adopting agentic coding. Skills are attractive because they promise reusable engineering judgement. A team can encode how it wants tests written, how it handles production risks, how it reviews React code, or how it writes documentation.
But a skill is not magic.
It is a steering layer.
The question is whether that steering layer improves the output for a specific job.
So we built a small benchmark with deliberately ordinary tasks:
- refactor a TypeScript utility
- write a Vitest test
- document a Next.js API route
- spot a security flaw
These are not showpiece tasks. They are the kind of small jobs that fill real delivery weeks.
The benchmark also includes mismatches on purpose.
For example, m25-content-rules is an editorial and SEO discipline. It should help API documentation more than it helps a unit test. If a benchmark only tests perfect skill-task matches, it teaches the wrong lesson. Real teams will try reusable skills in messy places.
This pilot asks a narrower question:
Does the discipline encoded in a skill help the task in front of it?
That is different from asking whether the model is smart enough to solve the task with no skill. It is also different from asking whether every skill should be used everywhere.
What We Tested
The scaffold lives at tests/data/skills-benchmark/.
It uses 5 skills and 4 tasks.
| Skill | Refactor TypeScript | Write Vitest Test | Document API Route | Spot Security Flaw |
|---|---|---|---|---|
| `tdd` | Tests whether red-green-refactor discipline preserves observable behaviour while cleaning types. | Natural fit: asks for a failing-behaviour mindset and assertions. | Stretch fit: TDD may capture behaviour but not polished docs. | Stretch fit: TDD may suggest regression tests but can miss review format. |
| `harden` | Good fit: typed inputs and edge cases are resilience work. | Good fit: tests should cover risky boundaries. | Good fit: production docs should name success and error cases. | Natural fit: unsafe input, secrets, and explicit mitigation are core hardening concerns. |
| `react-best-practices` | Partial fit: TypeScript discipline transfers even without a component. | Partial fit: test quality transfers, but the task is not React-specific. | Good fit: Next.js route documentation is close to its domain. | Good fit: route handler review is close to production Next.js work. |
| `karpathy-guidelines` | Good fit: smallest correct typed change. | Good fit: minimal verifiable assertions. | Good fit: concise docs with success criteria. | Good fit: direct flaw, direct risk, direct fix. |
| `m25-content-rules` | Weak fit: editorial clarity can advise, but not reliably produce corrected code. | Weak fit: scenario naming helps, executable test generation does not follow from the discipline. | Natural fit: clear API explanation is content work. | Partial fit: it can explain trust and leak risk, but it is not a deep security skill. |
The fixtures were intentionally small.
Small fixtures make the scoring clearer. They also reduce the chance that a large codebase context hides what the skill contributed.
Methodology
Each task had a fixture file.
The refactor task used refactor-input.ts. The pass criteria were:
- remove unused
path,fs, andreactimports - type the
userandusersparameters - avoid
any - preserve
formatUserandactiveUsersbehaviour
The test task used discount.ts. The pass criteria were:
- return a Vitest test
- include at least 2 assertions
- cover the VIP plus
SAVE10case - cover zero-or-negative subtotal behaviour
The API documentation task used newsletter-status-route.ts. The pass criteria were:
- name
/api/newsletter/status - name the
GETmethod - document the
emailparameter - include a success response
- include a
400error
The security task used report-route-insecure.ts. The pass criteria were:
- flag line 5, the hardcoded API key, or line 6, the command injection issue
- explain the risk
- recommend a concrete fix
Scoring came from the deterministic auto-scorers in tests/data/skills-benchmark/scripts/run-benchmark.mjs.
Those scorers are regex and pattern based. Each output receives:
pass: true or falsequality: 0, 1, 2, or 3
A quality score of 3 means the output matched every scorer check. A score of 2 means it missed one check. A score of 1 means it matched at least two checks. A score of 0 means it was mostly the wrong shape for that task.
The important caveat:
v1 did not execute native Claude Code CLI runs.
The scaffold attempted to spawn claude.exe, but the sandbox returned EPERM errors around process hooks and shell spawning. Rather than present those failed rows as benchmark data, the outputs were generated by Codex 5.5 with medium reasoning.
The scorer logic stayed the same.
This makes v1 a skill-discipline pilot, not a native CLI execution benchmark.
That is still useful.
It isolates whether a discipline points the work in the right direction. v2 will add native CLI execution, 20 skills, 8 tasks, and a human review layer for quality factors that regex cannot measure.
One more choice matters.
We kept the mismatch outputs honest.
When a skill did not fit the task, the output was allowed to be partial instead of forced into success. That is why the editorial skill produced scenario guidance for the Vitest task rather than a polished test file. This keeps the benchmark closer to how teams actually use skills during adoption. The useful question is not whether every prompt can be coerced into passing. The useful question is whether the selected discipline is the right one to apply.
Results
The headline result:
16 of 20 evaluations passed.
Total quality score was 54 out of 60.
Average quality was 2.7 out of 3.
| Skill | Refactor TypeScript | Write Vitest Test | Document API Route | Spot Security Flaw |
|---|---|---|---|---|
| `tdd` | ✅ q3 | ✅ q3 | ❌ q2 | ❌ q2 |
| `harden` | ✅ q3 | ✅ q3 | ✅ q3 | ✅ q3 |
| `react-best-practices` | ✅ q3 | ✅ q3 | ✅ q3 | ✅ q3 |
| `karpathy-guidelines` | ✅ q3 | ✅ q3 | ✅ q3 | ✅ q3 |
| `m25-content-rules` | ❌ q2 | ❌ q0 | ✅ q3 | ✅ q3 |
Per-skill summary:
| Skill | Passes | Pass Rate | Quality Points | Average Quality |
|---|---|---|---|---|
| `tdd` | 2/4 | 50% | 10/12 | 2.5 |
| `harden` | 4/4 | 100% | 12/12 | 3.0 |
| `react-best-practices` | 4/4 | 100% | 12/12 | 3.0 |
| `karpathy-guidelines` | 4/4 | 100% | 12/12 | 3.0 |
| `m25-content-rules` | 2/4 | 50% | 8/12 | 2.0 |
Per-task summary:
| Task | Passes | Quality Points |
|---|---|---|
| Refactor TypeScript | 4/5 | 14/15 |
| Write Vitest Test | 4/5 | 12/15 |
| Document API Route | 4/5 | 14/15 |
| Spot Security Flaw | 4/5 | 14/15 |
The task-level pass rate was evenly distributed.
Each task passed 4 times from 5.
The difference was not task difficulty in this pilot. It was skill-task fit.
Surprising Findings
1. The best general skill was not the loudest one
karpathy-guidelines scored 4/4.
That skill is simple: make the smallest correct change, keep success criteria visible, avoid premature abstraction, and verify the result.
In this benchmark, that was enough.
It passed the refactor, test, documentation, and security tasks with q3 each.
The lesson for businesses is practical. A skill does not need to encode a huge process to be useful. A short preference for surgical, verifiable work can perform well across varied tasks.
2. Hardening transferred cleanly across all four tasks
harden also scored 4/4.
That was expected on the security task. It was more useful that the same discipline helped the other three tasks.
For refactoring, it encouraged explicit types and no loose input handling.
For testing, it pushed boundary cases.
For docs, it made the error response visible instead of treating documentation as a happy-path summary.
Production resilience is not only a security habit. It is a way of seeing missing cases.
3. TDD was strong, but format-sensitive
tdd scored 10/12 quality points.
It passed the refactor and Vitest tasks.
It failed the documentation task because the output captured the endpoint behaviour but did not frame email as a query parameter. It failed the security task because it described the regression target and fix, but omitted the required line number.
That is not a bad TDD result.
It is a useful reminder. TDD is excellent when the target is observable behaviour. It can be weaker when the task is a formal review or documentation artefact with required labels.
4. Editorial rules worked where writing was the job
m25-content-rules passed API documentation with q3.
It also passed the security task by naming the hardcoded API key, explaining credential leak risk, and recommending environment variables.
But it failed the refactor and Vitest tasks.
The refactor output was useful advice, not corrected TypeScript. The test output was scenario naming guidance, not an executable Vitest file.
That is the benchmark doing its job.
An editorial skill can make technical communication clearer. It should not be used as the main steering layer for code generation.
5. Regex quality can over-reward near misses
The m25-content-rules refactor received q2 even though it did not return corrected code.
Why?
It mentioned enough concepts to satisfy 4 of 5 scorer checks.
This is a limitation of deterministic scoring. It is useful for repeatability, but it cannot fully judge whether the artefact is operationally useful.
That is why v2 needs human review in addition to regex scoring.
Limitations
This is a v1 pilot.
It should be read as first-party benchmark data, not a final claim about all Claude Code skills.
Main limitations:
- Outputs were generated by Codex 5.5 with medium reasoning, not native Claude Code CLI execution.
- The native CLI scaffold was blocked by sandbox
EPERMerrors. - The scorers are regex and pattern based, so they are coarse.
- Quality is rule-based, not human-graded.
- The sample is small: 5 skills x 4 tasks.
- The fixtures are small and do not measure large-repo navigation.
- The benchmark does not measure latency, cost, or repeated-run variance.
Those limits are not footnotes.
They shape what the data can and cannot say.
v1 can say which skill disciplines matched these four task shapes.
v1 cannot say how native Claude Code skill execution will perform across a large codebase under production constraints.
v2 will address that with native CLI execution, more skills, more tasks, and human scoring alongside deterministic checks.
What This Means For UK Businesses
The practical takeaway is simple:
Do not build a giant skill library first.
Start with the tasks your team repeats every week.
Then write skills for the behaviours you want repeated inside those tasks.
If your team ships API routes, a hardening skill is likely worth creating early. It pushes validation, explicit errors, unsafe input handling, and edge cases into ordinary work.
If your team does a lot of React and Next.js, a React best-practices skill can help standardise route and component decisions. In this pilot, it transferred well beyond component code because the fixture was still close to the Next.js domain.
If your team struggles with bloated changes, a surgical-guidelines skill may be the highest-leverage first skill. It is easier to govern than a huge process document, and it nudges the model toward small verifiable work.
If your team writes a lot of documentation, an editorial skill is useful.
But do not ask that same editorial skill to be your test engineer.
Skills should be routed.
That means you need a small taxonomy:
- coding discipline skills for implementation work
- hardening skills for production risk
- framework skills for stack-specific decisions
- testing skills for behavioural coverage
- editorial skills for documentation and content
The governance point matters too.
Skills make preferences portable. That is valuable for an organisation because it reduces the amount of repeated prompting each developer has to do.
But portability can create false confidence.
A skill that performs well in one task family may fail quietly in another. This benchmark shows why teams should test skill-task fit before rolling skills into normal delivery.
For a UK business, the right first move is not to ask, "Which Claude Code skills are best?"
The better question is:
"Which of our recurring tasks would improve if the same discipline was applied every time?"
That question produces a smaller, more useful skill set.
FAQ
What is the Claude Code skills benchmark?
It is a first-party Ampliflow benchmark testing whether Claude Code skill disciplines help practical coding and documentation tasks. v1 tested 5 skills across 4 tasks, creating 20 scored evaluations.
What was the headline result?
The benchmark produced 16 passes from 20 evaluations, an 80% pass rate. Total quality score was 54/60, with an average quality score of 2.7 out of 3.
Which skills scored best?
harden, react-best-practices, and karpathy-guidelines each scored 4/4 passes and 12/12 quality points in this pilot.
Did TDD fail?
No. tdd passed the refactor and Vitest tasks, and scored 10/12 quality points overall. Its misses came from format-specific documentation and security review requirements, not from poor behavioural reasoning.
Why did an editorial skill pass a security task?
The security fixture included a hardcoded API key on line 5. m25-content-rules flagged the trust and credential leak risk, then recommended environment variables. That satisfied the deterministic scorer, even though it is not a full security engineering skill.
Why was v1 generated by Codex instead of native Claude Code CLI?
The scaffold attempted native claude.exe runs, but the sandbox blocked them with EPERM errors. v1 therefore uses Codex 5.5 medium-reasoning outputs while keeping the same deterministic scorer logic.
Can I replicate the benchmark?
Yes. The fixtures, outputs, results, and methodology note are published in the benchmark scaffold. The scoring rules live in tests/data/skills-benchmark/scripts/run-benchmark.mjs.
What changes in v2?
v2 will add native CLI execution, expand to 20 skills x 8 tasks, keep deterministic scoring, and add human review for quality dimensions that regex cannot judge.
Related Reading
- What is Claude Code? UK Business Guide
- Claude Code Skills: Write, Share, Govern at Scale
- How to Install Claude Code
- Claude Code MCP Servers: 7 Worth Installing
- Hermes Agent Production Cost Teardown: 40 Days on Oracle Cloud
What Should You Do Next?
If your team is exploring Claude Code for operational or engineering automation, start with the repeatable work.
Pick one task family.
Write one skill.
Benchmark it against real examples from your codebase.
Then decide whether it deserves to become part of your delivery process.
For implementation support, start with automation services.
For a diagnostic view of where agentic workflows could help your current operation, book an audit.