What is the headline finding of this study?

Three measurement surfaces disagree because they observe different things. The registry is 97.0% installable on Linux. Third-party scanners (Snyk ToxicSkills) find 36.82% of skills contain at least one security flaw. Users, across 16,635 classified public reports, spend 55.5% of GitHub complaint volume on operational correctness: quality, silent-failure, compat, install. Security is 1.4% of GitHub user-reported pain. The finding is an observability gap: the ecosystem measures installability and latent security better than per-skill runtime correctness.

What is the one apt install line for OpenClaw skills?

sudo apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrep. That single command satisfies every declared binary dependency for 6,784 of 6,993 public OpenClaw skills on Ubuntu 22.04/24.04 or Debian 12.

Which OpenClaw skills are macOS-only?

About ten. Airfoil (AirPlay control), Photoshop Automator (Adobe ExtendScript), Nerve Bridge (Trae IDE), a macOS wallpaper switcher, and a cluster of pbpaste-driven text transformation skills. The text skills have Linux equivalents in the catalog.

Why does Hacker News show so much noise in the dataset?

Hacker News search matched 54.68% noise on the "openclaw"/"skill" keyword set; the word has unrelated English usage. The noise tag filters this. GitHub issues, scoped to three specific repositories (vercel-labs/skills, anthropics/claude-code, anthropics/claude-agent-sdk-typescript), are where the signal-to-noise is highest.

Where can I download the datasets?

Both datasets are CC BY 4.0. Archived at doi.org/10.5281/zenodo.19691714. The catalog CSV is at data/latest.csv and the ecosystem classification is at data/ecosystem-research-2026-04/ in github.com/vesselofone/openclaw-skills.

The AI Agent Skills Ecosystem:
Installability, Security, and User Pain

By Mehul Bhardwaj · Vessel · April 20, 2026(last updated April 20, 2026)

Abstract

We study the AI-agent skills ecosystem from three angles: registry installability, scanner-detected security risk, and user-reported pain. As of April 2026, 97.0% of OpenClaw skills install on Linux, Snyk flags 36.8% of skills for latent code flaws, and 55.5% of user complaints filed on GitHub describe wrong, missing, or broken output. Each number has a different denominator; the gap between them is the finding. Registries catch what fails to install. Scanners catch latent risk. No public tool watches for the skill that installs cleanly, passes audit, and silently produces wrong output, the failure users actually hit.

Registry

97.0%

Linux-installable after one apt line.

Scanner

36.8%

Snyk scanner-flagged skills.

GitHub Pain

55.5%

Operational correctness complaints.

Definition

By operational correctness we mean failures where a skill installs and runs, but the output is wrong, incomplete, silently empty, schema-invalid, or broken by environment drift. 55.5% of GitHub user complaints describe failures of this kind. Neither public scanners nor registries measure it directly.

Blind spot 1 · Registry

The registry shows 97.0% of OpenClaw skills are apt-installable on a standard Linux host: the package resolves, the check passes, the number looks clean. But installable means the archive arrived. It says nothing about the browser binary at a specific path, the OAuth flow that must complete before first use, or the API key that must be in the environment. “Passed the scan, installed cleanly, and still doesn’t work right” is not a rare edge case. It is the dominant complaint.

Blind spot 2 · Supply chain

Supply-chain attacks like ClawHavoc (a malicious-skill campaign, January 2026) produce no crash, no error, and no GitHub issue. Public ClawHavoc reports describe hundreds of skills distributing AMOS, a macOS credential-harvesting trojan, that installed cleanly, passed scanners, ran silently, and exfiltrated without the user knowing. Casual skill installers don’t appear in GitHub issue trackers because they don’t know they were hit.

Methods

We scanned all 6,993 public OpenClaw skills for Linux installability and dependencies, synthesized published security research (Snyk ToxicSkills and public ClawHavoc reporting), and classified 16,635 user reports across three ecosystems into 12 pain categories spanning install, correctness, security, and discovery (κ = 0.71). Security pain on GitHub is 1.4% [1.18%, 1.64%] (95% Wilson CI). Both datasets and scripts are open: doi.org/10.5281/zenodo.19691714.

The four failure modes, by who detects them

Failure modes by detector: registry, scanner, and user reports each catch a different slice.
Failure mode	Registry catches?	Scanner catches?	Users report?
Latent code flaw OAuth over-scope, injection patterns, credential mishandling.	NoRegistry installs the archive; does not analyse the code.	YesSnyk ToxicSkills flags 36.8% of skills.	RarelySurfaces only after exploitation; latent risks are invisible.
Supply-chain compromise Skill installs cleanly, exfiltrates silently. Public ClawHavoc reports range from 341 to 824 skills.	NoNo build-time provenance check at the registry layer.	Post-campaignKoi catches via reputation / fingerprints, not pre-publish.	NeverNo crash, no error, no GitHub issue. Survivorship absence.
Operational correctness Skill ran, returned plausible-looking output, was wrong.	No“Installable” is silent on whether the output is correct.	NoStatic scanners cannot evaluate runtime output.	Yes, 55.5%Quality + silent-failure + compat + install on GitHub.
Discovery Hard to find, compare, or evaluate a skill before installing.	Search onlyListing exists; ranking and evaluation signals are weak.	NoDiscovery is not a security or installability concern.	Yes (floor)Non-developers post on Reddit; rarely file issues. 1.7% GH is a lower bound.
Each row is the same gap viewed from a different angle. The largest miss, operational correctness, is the row where users see the failure but registries and static scanners do not measure it.

Study design and data

View A: LLM-classified, validated at κ = 0.71. 16,635 public reports classified by Claude Haiku 4.5, validated against Sonnet 4.6 on a 61-item sample. Zero operational↔security confusions were observed in the sample. Findings are cross-ecosystem: OpenClaw, Vercel skills.sh, and Anthropic’s Claude Code (the latter two appear only here, as they use different formats incompatible with the catalog scan).
View B: cited as-published reference data. Snyk ToxicSkills and public ClawHavoc reporting; we quote published numbers with attribution and did not rerun either audit. Numbers are directionally consistent; some exact figures are not independently verifiable.
View C: first-party, reproducible. Full scan of the openclaw/skills monorepo (6,993 skills, April 2026). Each skill ships with a SKILL.md file containing metadata, declared binary dependencies, and a system prompt; the catalog scan reads these directly.

Datasets

Catalog: 6,993skills, 23 columns · User reports: 16,635 classified mentions. CC BY 4.0.

DOI10.5281/zenodo.19691714

Scripts

Reproducible build. Clone the repo, run the script, diff the CSV. MIT.

View on GitHub

Cite this study (BibTeX)

@misc{vessel_agent_skills_ecosystem_2026_04,
  author       = {Bhardwaj, Mehul},
  title        = {The AI Agent Skills Ecosystem: Installability, Security, and User Pain},
  year         = {2026},
  month        = {4},
  publisher    = {Vessel},
  doi          = {10.5281/zenodo.19691714},
  url          = {https://doi.org/10.5281/zenodo.19691714},
  howpublished = {\url{https://vesselofone.com/research/ai-agent-skills-ecosystem}},
  license      = {CC-BY-4.0}
}

View A · User-reported pain

The measurement nobody publishes

First-party, LLM-classified: 16,635 of 16,840 public user reports from GitHub Issues (n = 10,256), Hacker News (n = 5,097), and Reddit (n = 1,282), classified into 12 analytical categories using Claude Haiku 4.5 with a 12-type enum schema. These are collected-report proportions, not ecosystem prevalence estimates. Full methodology, prompt, and validation data.

What each source reveals differently

Silent failure

GitHub

18.4%

0.5%

Developers reproduce wrong output and file a ticket. HN readers don’t know the output was wrong; they only see that the skill ran.

Security

5.1%

GitHub

1.4%

Security is discussed on HN, not filed as bugs. OAuth over-scoping and credential exposure are invisible to users until exploited.

Discovery

8.3%

GitHub

1.7%

Non-developers post “I can’t find a skill for X” on Reddit. They don’t file GitHub issues. The GitHub discovery figure is a floor.

The asymmetry defines the monitoring surface

The 18.4% vs 0.5% divergence isn’t about different user populations. Both GitHub and HN skew towards developers. When a developer finds wrong output, they open a GitHub issue. The same developer doesn’t post about it on HN; HN is where security risks and ecosystem concerns get discussed, not where reproducible bugs get filed. Silent failure only surfaces when the person running the skill actively verifies the output. The checks that do that (schema validation, expected-structure assertions, confidence signals) need to be in the system, not left to whoever happens to check.

Bug reports by cluster × source

The 10 categories collapse into 4 clusters. Each column is a cluster; within each cluster, the three sources show what share of their reports landed there. GitHub leads operational correctness at 55.5%; HN leads security at 5.1%; Reddit leads docs & discovery at 15.0%. Same data, three vantage points.

Cluster share by source: each row is a cluster, columns are sources, percentages within source.
Cluster	GitHubn = 10,256	Hacker Newsn = 5,097	Redditn = 1,282
Operational quality · silent-failure · compat · install	55.5%	8.8%	16.8%
Docs & discovery docs · discovery · registry	9.9%	5.4%	15.0%
Maintenance abandoned · cross-skill conflicts	3.1%	1.3%	3.1%
Security runtime consent violations	1.4%	5.1%	3.8%

Percentages are each cluster’s share of that source’s reports; highlighted cell = leading source for that cluster. Rows do not sum to 100% because other and noise are excluded. GitHub security CI: [1.18%, 1.64%]; discovery CI: [1.49%, 2.00%]^*.

Pain × source · full breakdown

All 10 signal categories drilled down by source, grouped by cluster, with raw counts and within-source percent. Bars are normalised within each column so a 1.4% security bar in GitHub doesn’t disappear next to GitHub’s 20% quality bar. Bottom two rows (other, noise) are dimmed: residual buckets shown for transparency, not findings.

HN: 54.7% noise filtered · Reddit: 32.0% noise filtered

Pain type

GitHub n = 10,256

Hacker News n = 5,097

Reddit n = 1,282

quality
operational
20.72%2,125
6.96%355
7.57%97
silent-failure
operational
18.44%1,891
0.49%25
1.79%23
compat
operational
9.04%927
0.47%24
2.18%28
install
operational
7.26%745
0.82%42
5.30%68
docs
docs & discovery
5.35%549
2.10%107
6.16%79
discovery
docs & discovery
1.73%177
2.53%129
8.27%106
registry-meta
docs & discovery
2.87%294
0.73%37
0.55%7
maintenance
maintenance
2.36%242
0.80%41
1.95%25
cross-skill
maintenance
0.69%71
0.47%24
1.17%15
security
security
1.39%143
5.06%258
3.82%49
other
residual
22.82%2,340
24.88%1,268
29.25%375
noise
residual
7.33%752
54.68%2,787
31.98%410

leading source for this rowSignal-row bars normalised within column; residual rows show absolute share.scroll →

^* The discovery figure (1.7%) is a lower bound, not a prevalence estimate. GitHub Issues are filed by developers who can reproduce and articulate a bug; non-developer users who can’t find a skill abandon silently or post on Reddit instead. See Limitation 8.

View B · Scanner and audit findings

What security scanners catch, and what they miss

First-party scan: all 6,993 skills via SKILL.md pattern analysis, ClawHub signals, and GHSA cross-reference. Reproduce: python scripts/security-scan.py in the openclaw-skills repo. Cited as-published: Snyk ToxicSkills and public ClawHavoc reporting. Their corpora are not fully disclosed, so we treat those numbers as reference data, not precision comparisons.

first-party

Dangerous

9.2%

642 of 6,993 skills · active threat signals

first-party

Caution

43.4%

3,036 of 6,993 skills · risk patterns present

cited

Snyk-flagged

36.8%

Snyk ToxicSkills· 3,984 skills scanned

Three lenses on the install-time threat surface, not three readings of the same number. The two first-party columns are reproducible; the Snyk column is cited as-published. The runtime security figure in View A (1.4% of GitHub reports) measures a different surface and cannot be summed with these. The three failure types below explain why.

How each security failure type surfaces

Security failure types by scanner detection and user visibility
Failure type	Scanner detection	User-visible
Latent code flaws OAuth over-scoping, injection, credential mishandling.	High Static patterns. This is what Snyk’s 36.8% measures.	After exploit only No surface event until the flaw is triggered.
Supply chain compromise ClawHavoc-class campaigns: malware via legitimate-looking skills (public reports range from 341 to 824 flagged skills).	Post-campaign only Reputation methods catch known campaigns; novel ones evade.	Never Installs cleanly, exfiltrates silently. No crash, no error, no issue.
Runtime consent violations Agents acting visibly outside intended scope.	None No public scanner covers this class today.	~1% of reports The only failure mode users can see, and even there the signal is faint.

Snyk’s findings, ranked

OAuth over-scoping fires on 70.1% of audited skills, but Snyk classifies it as informational — rule-detectable, rarely exploitable on its own. The threshold that matters is “at least one non-informational finding,” which holds for 36.8%. Critical-severity findings (RCE, unauthenticated sinks) hit 13.4%, roughly 534 of the 3,984 audited skills.

Snyk ToxicSkills audit: published findingsn = skills audited (not disclosed)

OAuth scope wider than task
Skill requests permissions it demonstrably does not use.
70.1%high
Command-injection pattern
Shell-exec in SKILL.md body with user-controlled input.
43.4%high
At least one security flaw
Any non-informational finding across all rules.
36.82%medium
Critical severity
RCE, creds-in-repo, unauthenticated sink.
13.4%critical

Bar length = prevalence in Snyk’s 3,984-skill corpus. Severity badge is independent of bar length.

Scanner coverage matrix

The four scanners in the matrix don’t overlap on much. Snyk catches static patterns. Koi catches live campaigns via reputation. Our first-party scan found 9.2% dangerous and 43.4% caution across all 6,993 skills.None of them catch silent wrong-output bugs or environment-drift breakage; those require runtime evaluation, which no public scanner does today.

Scanner coverage matrix: what each published audit measures.
Finding class	Snyk ToxicSkills	Koi ClawHavoc	Vessel skill-check
Credentials hard-coded in SKILL.md	caughtPattern: `token\s*=`, known API formats.	not measuredNot a primary focus.	caughtEntropy + known-prefix regex.
OAuth scopes wider than declared task	caughtRequested scopes vs. task description.	not measuredNot measured.	caughtIntent-vs-scope classifier.
Command injection in SKILL.md shell blocks	caughtAST grep for shell-exec patterns.	not measuredNot measured.	caughtSame AST pattern set.
Adversarial prompt instructions (prompt injection)	not measuredNatural language, no fixed pattern.	caughtKnown-campaign instruction fingerprints.	partialTrained patterns only; novel phrasings slip past.
Malicious skills in active campaigns	not measuredPoint-in-time scan, no reputation.	caughtLive campaign tracking (ClawHavoc).	not measuredSingle-skill audit, not reputation.
The gap · not covered by any published scanner
Silent wrong-output bugs	not caughtRequires runtime evaluation.	not caughtRequires runtime evaluation.	not caughtRequires runtime evaluation.
Environment-drift / install-breaks-for-me	not caughtNot in scope.	not caughtNot in scope.	not caughtNot in scope.

The community moved on the supply chain gap before the formal audits landed. Our scrape of OpenClaw skill repos from January through April 2026 captures multiple independently-built scanners — SkillScan, Aguara, ClawSec Monitor, ClawSecure among them — each launched in direct response to the absence of registry-level trust signals. Their published prevalence numbers vary by methodology, but every one of them put dangerous-or-malicious in double digits.

Named incidents in the same window confirm the supply chain risk is not theoretical: top-downloaded ClawHub skills distributing malware, plain-text malicious payloads visible in SKILL.md files, flagged weeks before the formal scanner reports were published.

What scanners fundamentally miss

A SKILL.md body is adversarial natural language, not an executable you can hash. A static scanner finds the patterns it was trained on; it does not catch a skill that installs cleanly, passes audit, and still emits silently wrong output. That gap is what we measured in View A.

Supply chain attacks are the deeper blind spot. The ~1% security figure in View A is not evidence that risk is low; it is a survivorship artifact. The users most exposed to ClawHavoc-style attacks are casual installers who don’t know they were compromised, so they are structurally absent from issue trackers.

Cited sources

Snyk ToxicSkills (2026)

Static code scan for credential leaks, OAuth scope width, and injection patterns. Published corpus: 3,984 skills from ClawHub and skills.sh. Read the Snyk write-up.

ClawHavoc reporting (Jan-Feb 2026)

Public reports describe a live malicious-skill campaign with counts ranging from 341 early identified skills to 824 later flagged skills. Reputation + campaign fingerprint methodology. ClawTank early report · LaunchMyOpenClaw scale report.

View C · Registry installability

The registry: 97.0% installable on Linux

First-party, reproducible: full scan of 6,993 public OpenClaw skills. We parsed every SKILL.md, resolved declared binaries against a curated install map, and publish the full funnel, dep taxonomy, and blocker list. This is a dependency-resolution analysis, not an execution test.

If you self-host a Linux box, this is the only install step you need. One apt line covers 6,784 of 6,993 public OpenClaw skills. No Docker, no uv, no language managers.

One command · covers 97.0% of skills

apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrep

Satisfies every declared binary dependency for 6,784 of 6,993 skills. No Homebrew, no uv, no Docker, no language managers.

How the 97.0% breaks down

Five buckets, in order. Most skills need nothing at all — pure prompts plus an API key. The apt line absorbs almost everything else. What’s left is a small tail: a pip/curl-installable batch and a handful tied to the Mac desktop.

Skills scanned

every skill in openclaw/skills

6,993

Declare their deps (SKILL.md)

97.8% transparent

6,841

Runs as-is, zero install

82.6% of scanned

5,777

Covered by the apt line

+1,007 resolved via apt

6,784

Outside the apt line

59 brew-only · 150 pip/curl/macOS-desktop

209

What the apt line unlocks

5,777 of the 6,993 skills ship with zero binary dependencies. The chart below is the long tail: the 1,216 skills that do declare binaries, and the 15 most-declared bins among them. Almost every bin at the head is either preinstalled on Ubuntu or one apt away.

python3
536 skillsapt install python3
curl
313 skillsapt install curl
node
178 skillsapt install nodejs
jq
78 skillsapt install jq
python
45 skillsapt install python3
npx
32 skillsapt install nodejs
bash
30 skillspreinstalled
uv
28 skillscurl install script
npm
28 skillsapt install npm
mcporter
24 skillsdirect download
pip
23 skillsapt install python3-pip
git
21 skillsapt install git
ffmpeg
19 skillsapt install ffmpeg
pip3
13 skillsapt install python3-pip
yt-dlp
9 skillspip install yt-dlp

apt-installable download / runtime fetch / brew

Beyond the apt line

About 3.0%of the monorepo declares a binary the apt command doesn’t cover. Most are one pip install or a curl script away: yt-dlp, uv, bun, Foundry’s cast. Named bins below.

cast
apt-missing
8 skills
pbpaste
apt-missing
6 skills
osascript
apt-missing
4 skills
google-chrome-stable
apt-missing
4 skills
xvfb-run
apt-missing
4 skills
ssh
apt-missing
3 skills
awk
apt-missing
3 skills
dbus-launch
apt-missing
3 skills
vdirsyncer
brew-only
2 skills
khal
brew-only
2 skills

apt-missing = not packaged for apt-get. brew-only = macOS-only Homebrew formula.

Excludes 35 self-references (openclaw/claude itself) and 12 pip/npm-installable tools.

How the most-installed skills fare

The top of the distribution is the test that matters for most self-hosters: do the skills people actually install run on a server? Nine of the top 10 run as-is after the apt line. caldav-calendar is the only genuine macOS blocker in the top tier.

#	Skill	Downloads	Deps	Linux
1	Humanizer `humanizer` · biostartechnology	92,489	`none`	✓ Runs as-is
2	Proactive Agent Lite `proactive-agent-lite` · bestrocky	33,577	`none`	✓ Runs as-is
3	Xiaohongshu (小红书) Automation `xiaohongshu-mcp` · borye	31,510	`none`	✓ Runs as-is
4	Tavily AI Search `tavily` · bert-builder	30,814	`none`	✓ Runs as-is
5	Pdf `pdf` · awspace	30,151	`none`	✓ Runs as-is
6	Docker Essentials `docker-essentials` · arnarsson	27,684	`docker`	✓ apt install
7	AgentMail `agentmail` · adboio	27,495	`none`	✓ Runs as-is
8	Web Search `web-search` · billyutw	26,571	`none`	✓ Runs as-is
9	Humanizer `ai-humanizer` · brandonwise	26,545	`none`	✓ Runs as-is
10	Caldav Calendar `caldav-calendar` · asleep123	25,777	`vdirsyncer, khal`	✗ macOS only

What you can’t get on Linux

After the pip/curl tail, the residual is small. 10skills depend on macOS-only binaries for their primary capability. Four control a desktop app (AirPlay speakers, Adobe Photoshop, the Trae IDE, the macOS wallpaper daemon) and genuinely don’t port. The other six read text from the Mac clipboard. The transformation itself is general, so the capability transfers to Linux by swapping pbpaste for file input or xclip/wl-paste. Closest catalog matches below.

Skill	Downloads	Mac bin	Linux replacement in catalog
Airfoil Control AirPlay speakers via Airfoil from the command line. Connect, disconnect, set volume, and manage multi-room audio with simple CLI commands.	2,096	`osascript`	No catalog equivalent. Desktop-app capability.
Photoshop Automator Automate Adobe Photoshop on Windows via ExtendScript to run scripts, update text layers, create layers, apply filters, play actions, and export images.	1,123	`osascript`	No catalog equivalent. Desktop-app capability.
Nerve Bridge Skill Bi-directional control of Trae via macOS AppleScript with built-in feedback mechanism. Use when needing to execute code/commands in Trae IDE and wait for com...	591	`osascript`	No catalog equivalent. Desktop-app capability.
reply-coach	280	`pbpaste`	`copy-editing` file-based text rewrite
reviewer-rebuttal-coach	255	`pbpaste`	`copy-editing` file-based rewrite, swap pbpaste for file input
collab-offer-polisher	253	`pbpaste`	`medical-email-polisher` closest catalog match for polishing business messages
wallpaper-auto-switch-pro-executable	249	`osascript`	No catalog equivalent. Desktop-app capability.
policy-to-checklist	248	`pbpaste`	`afrexai-qa-test-plan` document-to-checklist generator
claim-risk-auditor	244	`pbpaste`	`verify-claims` direct claim / fact-check equivalent
rubric-gap-analyzer	244	`pbpaste`	`afrexai-interview-architect` rubric-based scoring, file input

Decision framework

A skill can install cleanly, pass every scanner, and still produce silently wrong output. Quality (20.7%) and silent-failure (18.4%) are the top two user-reported failure modes, and neither is what registries or scanners are built to catch. The minimum steps to close the gap, by role.

If you write skills

publishing to a registry

Write at least one correctness test.
18.4%
silent-failure rate
A behavioral check, not a lint. One golden input → expected output. Catches the “ran and returned garbage” class of bug that exits 0 and looks fine.
Fail loudly when uncertain.
20.7%
quality complaints
Return an explicit error or confidence signal. The gap between quality and silent-failure is whether the user knows something went wrong.
Pin binary versions and declare every env var.
11.4%
maintenance + compat
Most break 3–6 months after a working install when something upstream changes. SKILL.md is the contract.

If you install skills

deploying into a workflow

Run a scanner before installing.
36.8%
have a scanner-flagged issue
Static scan is cheap insurance for the latent risks users can’t see by reading SKILL.md.
Test with your own representative inputs.
55.5%
reported pain is operational
“Installs cleanly + passes scanner” is not a production filter. Neither registries nor scanners measure correctness on your data.
Monitor for wrong output, not just errors.
0%
error rate ≠ correctness
Silent-failure means users hit wrong-output bugs before crash bugs. Spot-check outputs or assert against an expected schema in production.
Schedule a re-review at 6 months.
11.4%
surfaces post-install
APIs change, deps drift, model behavior shifts. The bug that wasn’t there at install will be there in two quarters.

If you package skills into a product

embedding into something other people consume

Build correctness monitoring into your runtime.
55.5%
lands on you, not the author
Customers will hit operational bugs first and escalate to you, not the upstream skill author. Public registries and scanners don’t cover this layer — it’s yours to build.
Surface confidence and “I’m uncertain” signals.
18.4%
silent-failure → known unknown
The dominant failure mode is a skill that ran and was silently wrong. A confidence signal converts a silent failure into something users can act on.
Track correctness independently of security.
36.8%
scanner-clean ≠ correct
Scanner-clean skills still generate the majority of support load. Correctness is a separate measurement from safety and needs its own dashboard.

Implications

Four predictions for 2026–2028, each with an explicit falsification criterion. Confident enough to be wrong about. Each is a direct read on what the three-view gap would have to produce if the structural diagnosis here is right.

Skill-level runtime correctness telemetry emerges as a paid layer.
Agent-level evaluation is already crowded. LangSmith, Braintrust, Patronus, Langfuse, and Phoenix all measure end-to-end agent behaviour. The gap is the install-site layer below: per-skill output-shape monitoring, schema assertions, and confidence signals attached to the individual tool the agent just called. The 55.5% operational pain figure says someone has to fill it.
Falsified if notAt least one VC-backed startup ships per-skill output validity monitoring at the install site (not agent-level).by 2027-04
At least one more ClawHub-class supply-chain incident lands in the next 12 months.
ClawHavoc was not a one-off. It was the first incident large enough to be named. The structural conditions that made it possible (no pre-publish review, no provenance check, install-time trust) have not changed. The economics for an attacker (hundreds of flagged skills in public reporting, no crash, no error) only get better as the install base grows.
Falsified if not≥1 publicly-named malicious-skill campaign affecting >100 skills, reported by a recognised security vendor or news outlet.by 2027-04
Enterprise procurement starts asking for correctness SLAs, not just security attestations.
Today’s agent-vendor RFPs ask SOC2-style security questions. The next round will ask the question View A actually exposed: what is your wrong-output rate, and what do you do when a skill returns plausible-but-bad output? Procurement catches up with the failure mode roughly two years after the field documents it.
Falsified if notA widely-circulated agent-vendor RFP template (Gartner, IAPP, or a Fortune 500 sample) explicitly asks for skill-level correctness measurements.by 2028-04
Skill registries add some form of test/eval requirement for marketplace inclusion.
Registries today gate on installability and (sometimes) static security scan. Neither catches the dominant failure mode. The first registry to publish a “verified skill” tier with an automated correctness check wins the trust premium, and the rest follow within a release cycle.
Falsified if notClawHub, skills.sh, or another major registry ships a verified-skill tier conditioned on at least one automated correctness check (golden input → expected output, schema assertion, or runtime invariant).by 2027-10

Predictions will be re-scored alongside each refresh of this study. The changelog records hits and misses.

Methodology

View A · User-reported pain

Sources. GitHub Issues API (three repos, paginated to GitHub’s 100-page cap). Hacker News via Algolia (11 keywords: "openclaw", "SKILL.md", "claude skill", "claude code skill", "clawhub", "claw skill", "openclaw/skills", "skill registry", "agent marketplace", "skill broken", "skill doesn’t work"). Reddit public JSON across 12 subreddits. Total scraped: 16,840. Classified: 16,635 (98.8%; 205 unclassified due to rate-limit exhaustion during the run).

Signal quality by source.

On-topic share within each scrape, after the classifier filters noise (keyword false positives, off-topic English uses of "skill"). GitHub is the cleanest channel; Hacker News is the noisiest. Headline percentages in View A use the on-topic counts as their denominator.

GitHub Issues9,504 on-topic · 752 off-topic · 92.7%

3 repos: vercel-labs/skills, anthropics/claude-code, anthropics/claude-agent-sdk-typescript. Reproducible, specific bugs.

Reddit872 on-topic · 410 off-topic · 68.0%

12 subreddits. Some operators troubleshoot in public; lots of adjacent LLM chatter.

Hacker News2,310 on-topic · 2,787 off-topic · 45.3%

Algolia keyword matches. Majority are unrelated English uses of "skill".

on-topic (classified to a pain type)off-topic (filtered as noise)

Classifier. Claude Haiku 4.5 (claude-haiku-4-5-20251001) via the Anthropic API, tool-use structured output with an enum-constrained schema. Single-label classification with a short (<50 char) free-text reason. Ephemeral prompt cache on the system message (12-type taxonomy definitions) for a ~90% input-token discount after first hit.

Taxonomy. 12 types: compat, silent-failure, maintenance, security, quality, docs, cross-skill, discovery, registry-meta, install, other, noise. Full definitions in the dataset README.

View B · Scanner and audit findings

Cited as-published: Snyk ToxicSkills (2026) and public ClawHavoc reporting (January-February 2026). We did not rerun these external audits; we quote their headline numbers with attribution and treat them as reference data, not precision-comparable measurements.

View C · Registry installability

Full-catalog scan of openclaw/skills: parse every SKILL.md, extract declared binary dependencies, resolve each against a curated Linux install map (apt / direct download / runtime fetch / brew-only / no-Linux-path). ClawHub metadata and last-commit activity joined per skill. Script: catalog-coverage.py.

Statistical reporting

Headline percentages are reported with 95% Wilson score intervals. Wilson is preferred over the normal approximation at small p or small n: it stays inside [0, 1] and gives sensible bounds even when the count is in the single digits (matters for the 1.39% security and 1.73% discovery figures).

Confusion matrix

Sonnet 4.6 was rerun on a 61-item stratified random sample from the Haiku output, blind to Haiku’s labels. This is model-model agreement, not human-adjudicated ground truth. The 5-cluster collapse below shows where the two models agree (diagonal) and where they disagree (off-diagonal). The “op” cluster here includes operational correctness (quality, silent-failure, compat, install) plus ecosystem health types (maintenance, cross-skill, registry-meta) — a superset of the headline stat, which counts only the four correctness types. The single load-bearing axis of this study, operational vs. security, has zero observed op↔security cells in either direction in this validation sample. Adjacent-boundary disagreements (quality ↔ silent-failure, both operational) account for the bulk of the 16 off-diagonal pairs.

Exact-label agreement

73.8%

45/61 sample items

Cohen’s κ

0.71

Substantial agreement (Landis & Koch)

Op↔security confusions

On the load-bearing axis of the study

Sonnet 4.6 vs Haiku 4.5 · 5-cluster collapse

diagonal = agreement · off-diagonal = confusion

Haiku ↓ / Sonnet →	operational	security	discovery	docs	residual
operational	22	0	2	3	1
security	0	3	0	0	1
discovery	0	0	5	1	0
docs	3	0	0	7	0
residual	3	1	1	0	8

Diagonal cells (emerald) are same-cluster agreement. Off-diagonal cells (silver) are cross-cluster confusions; intensity scales with cell count. The op↔security row/column intersections are zero in this validation sample, which is the load-bearing axis of the study.

Snapshot from aggregate validation notes. Per-row labels will be published asvalidation-sample.csv in the dataset repo; this view will then be computed from data.

Classifier validation

On a 61-item stratified random sample, Claude Sonnet 4.6 agreed with the Haiku 4.5 classifications at 73.8%(κ = 0.71, substantial agreement). Zero operational↔security confusions were observed in the sample. With 0 observed events, the rule-of-three upper bound is roughly 4.9% for this validation sample. All 16 disagreements fell at adjacent-category boundaries (e.g. quality vs. silent-failure, both operational; compat vs. docs). 9 blank Hacker News comments were excluded as unclassifiable. Raw sample and per-row labels in the dataset repo.

Sensitivity analysis. The operational:security ratio at baseline is 40:1 (5,688 vs. 143 GitHub reports). For the ratio to drop below 5:1, the classifier would need to have misclassified more than 30% of all operational reports as security, more than 100× the observed 0% operational↔security confusion rate. The finding is structural, not marginal.

Reproducibility

View C reruns end-to-end in about 5 minutes against a public-read GITHUB_TOKEN. View A is a multi-step pipeline (collectors → classifier → aggregation) that needs an ANTHROPIC_API_KEY; a full rerun is roughly 2 hours of wall time and $10–30 of Haiku inference.

bash

# clone and install (~30s)
$ git clone https://github.com/vesselofone/openclaw-skills.git
$ cd openclaw-skills && pip install -r requirements.txt

# View C · catalog scan (~5 min, public-read token)
$ export GITHUB_TOKEN=ghp_your_public_read_token
$ python3 scripts/catalog-coverage.py --all \
    --with-metadata --with-commits \
    --output coverage-output --work-dir coverage-output/repos

# View A · ecosystem classification (~2h, ~$10-30 Haiku)
$ export ANTHROPIC_API_KEY=sk-ant-...
# see scripts/ for collectors and classifier entrypoints

Data availability and ethics

Dataset license: CC BY 4.0
Scripts: MIT
Snapshot: April 2026 point-in-time scrape
Classifier: Claude Haiku 4.5 with enum-constrained output
Public data handling: The study uses public GitHub, Hacker News, and Reddit text for aggregate classification. The page reports category counts and examples of methodology, not individual user profiles.

Limitations

Static analysis only. View C does not execute skills; it resolves declared dependencies. SKILL.md completeness is voluntary. Interpret the installability number as dependency reachability, not runtime success.
Single-label classification. A mention that describes both bad output and a compat bug must be assigned one bucket. Edge cases are forced, so category percentages are analytical approximations rather than adjudicated facts.
GitHub pagination cap. Basic pagination tops out at 100 pages per repo. The anthropics/claude-code corpus tail is not covered. Treat GitHub percentages as proportions of collected reports, not full ecosystem prevalence.
HN noise. Keyword matching on “openclaw”/"skill" yields 54.68% unrelated hits. The noise tag filters them; do not use the HN subset without filtering.
Reddit rate-limit floor. Unauthenticated Reddit reads cap out. 1,282 rows is a floor, not a true count, so Reddit is directional evidence for discovery pain rather than a prevalence estimate.
Classifier precision. Validation measures model-model agreement, not human-labeled ground truth. On a 61-item validation sample, Sonnet 4.6 agreed with Haiku at 73.8% (κ = 0.71), with zero observed operational↔security confusions. The sensitivity analysis shows the operational:security ratio would require >30% systematic misclassification to drop below 5:1.
Moderation status is not security. ClawHub flags reflect staff moderation, not a security audit. They are registry metadata, not an independent safety label.
User sample is developer-skewed. 61.7% of classified reports come from GitHub Issues, which are filed predominantly by developers integrating skills, a population that is unusually good at finding things. Pain modes that are visible to that population (silent wrong output, compat, install failures in their own environment) are probably faithfully represented. Pain modes that are more acute for non-developers(discovery, decision-fatigue across similar skills, onboarding friction) are likely under-counted. The 1.73% discovery figure should be read as a lower bound among developer reporters, not as a prevalence estimate across all skill users.

Frequently asked

Common questions from readers and reviewers. If yours isn’t here, open an issue on the dataset repo.

Why juxtapose registry, scanner, and user views at all?

Because each view answers a different question, and policy conversations about the skills ecosystem routinely confuse them. “Is the ecosystem healthy?” depends on whether you mean installable, safe, or operationally correct. The three answers point in different directions: high installability, a materially lower scanner-clean surface, and a GitHub pain profile dominated by operational correctness. That spread changes what you’d build next.

Why is security around 1% of user-reported pain but 37% of scanner findings?

The survivorship argument is in the introduction. In short: ClawHavoc-style attacks can produce no crash, no error, and no GitHub issue. The 1.39% measures conscious, reportable pain from technical users. It does not measure silent compromise.

Why Claude Haiku 4.5 instead of Opus for classification?

Cost and throughput. 16,000 single-label classifications with ephemeral prompt-caching on the taxonomy run in ~2 hours on Haiku at roughly $10–30 all-in. Opus would push the run into hundreds of dollars without meaningfully changing the distribution for a 12-class, enum-constrained schema. We publish the classifier prompt; anyone can rerun on a different model to verify.

Why is Hacker News so noisy?

54.68% of HN matches are unrelated keyword collisions (the word "skill" has strong non-AI-agent English usage). The noise tag filters them. If you cite an HN subset, filter first.

Do OpenClaw skills work on Linux?

Yes. 97.0% of the 6,993 public skills run on Linux after one apt install line. A small tail (~209) needs a pip install or a curl script, and about 10 skills depend on macOS-only binaries for their primary capability.

What is the one apt install line?

sudo apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrep

Satisfies every declared binary dependency for 6,784 of 6,993 public skills on Ubuntu 22.04/24.04 or Debian 12.

Is this a security audit?

No. View C is static compatibility analysis, not a security audit. View B cites external audits. View A classifies what users report, not what is objectively true. Per-skill security scoring is a separate pass; a free per-slug auditor is at vesselofone.com/tools/skill-check.

Will this study be refreshed?

View C is reproducible: clone the repo, re-run the script, diff the CSV. View A is point-in-time (April 2026); a future refresh will ship as a GitHub release with an immutable Zenodo DOI. The changelog lives on this page.

Changelog

2026-04-20 · v1.0. Initial publication. Catalog scan (6,993 skills, one apt line). Ecosystem classification (16,635 mentions, 12-type pain taxonomy).

The measurement nobody publishes

What each source reveals differently

Bug reports by cluster × source

Pain × source · full breakdown

What security scanners catch, and what they miss

Snyk’s findings, ranked

Scanner coverage matrix

The registry: 97.0% installable on Linux

How the 97.0% breaks down

What the apt line unlocks

Beyond the apt line

How the most-installed skills fare

What you can’t get on Linux

Decision framework

Implications

Methodology

View A · User-reported pain

View B · Scanner and audit findings

View C · Registry installability

Statistical reporting

Confusion matrix

Classifier validation

Reproducibility

Data availability and ethics

Limitations

Frequently asked

Changelog

Your expertise deserves its own machine.