The AI Agent Skills Ecosystem:
Installability, Security, and User Pain

By Mehul Bhardwaj · Vessel · (last updated )

Abstract

We study the AI-agent skills ecosystem from three angles: registry installability, scanner-detected security risk, and user-reported pain. As of April 2026, 97.0% of OpenClaw skills install on Linux, Snyk flags 36.8% of skills for latent code flaws, and 55.5% of user complaints filed on GitHub describe wrong, missing, or broken output. Each number has a different denominator; the gap between them is the finding. Registries catch what fails to install. Scanners catch latent risk. No public tool watches for the skill that installs cleanly, passes audit, and silently produces wrong output, the failure users actually hit.

Registry

97.0%

Linux-installable after one apt line.

Scanner

36.8%

Snyk scanner-flagged skills.

GitHub Pain

55.5%

Operational correctness complaints.

Definition

By operational correctness we mean failures where a skill installs and runs, but the output is wrong, incomplete, silently empty, schema-invalid, or broken by environment drift. 55.5% of GitHub user complaints describe failures of this kind. Neither public scanners nor registries measure it directly.

Blind spot 1 · Registry

The registry shows 97.0% of OpenClaw skills are apt-installable on a standard Linux host: the package resolves, the check passes, the number looks clean. But installable means the archive arrived. It says nothing about the browser binary at a specific path, the OAuth flow that must complete before first use, or the API key that must be in the environment. “Passed the scan, installed cleanly, and still doesn’t work right” is not a rare edge case. It is the dominant complaint.

Blind spot 2 · Supply chain

Supply-chain attacks like ClawHavoc (a malicious-skill campaign, January 2026) produce no crash, no error, and no GitHub issue. Public ClawHavoc reports describe hundreds of skills distributing AMOS, a macOS credential-harvesting trojan, that installed cleanly, passed scanners, ran silently, and exfiltrated without the user knowing. Casual skill installers don’t appear in GitHub issue trackers because they don’t know they were hit.

Methods

We scanned all 6,993 public OpenClaw skills for Linux installability and dependencies, synthesized published security research (Snyk ToxicSkills and public ClawHavoc reporting), and classified 16,635 user reports across three ecosystems into 12 pain categories spanning install, correctness, security, and discovery (κ = 0.71). Security pain on GitHub is 1.4% [1.18%, 1.64%] (95% Wilson CI). Both datasets and scripts are open: doi.org/10.5281/zenodo.19691714.

The four failure modes, by who detects them

Failure modes by detector: registry, scanner, and user reports each catch a different slice.
Failure modeRegistry catches?Scanner catches?Users report?

Latent code flaw

OAuth over-scope, injection patterns, credential mishandling.

NoRegistry installs the archive; does not analyse the code.
YesSnyk ToxicSkills flags 36.8% of skills.
RarelySurfaces only after exploitation; latent risks are invisible.

Supply-chain compromise

Skill installs cleanly, exfiltrates silently. Public ClawHavoc reports range from 341 to 824 skills.

NoNo build-time provenance check at the registry layer.
Post-campaignKoi catches via reputation / fingerprints, not pre-publish.
NeverNo crash, no error, no GitHub issue. Survivorship absence.

Operational correctness

Skill ran, returned plausible-looking output, was wrong.

No“Installable” is silent on whether the output is correct.
NoStatic scanners cannot evaluate runtime output.
Yes, 55.5%Quality + silent-failure + compat + install on GitHub.

Discovery

Hard to find, compare, or evaluate a skill before installing.

Search onlyListing exists; ranking and evaluation signals are weak.
NoDiscovery is not a security or installability concern.
Yes (floor)Non-developers post on Reddit; rarely file issues. 1.7% GH is a lower bound.
Each row is the same gap viewed from a different angle. The largest miss, operational correctness, is the row where users see the failure but registries and static scanners do not measure it.

Study design and data

  • View A: LLM-classified, validated at κ = 0.71. 16,635 public reports classified by Claude Haiku 4.5, validated against Sonnet 4.6 on a 61-item sample. Zero operational↔security confusions were observed in the sample. Findings are cross-ecosystem: OpenClaw, Vercel skills.sh, and Anthropic’s Claude Code (the latter two appear only here, as they use different formats incompatible with the catalog scan).
  • View B: cited as-published reference data. Snyk ToxicSkills and public ClawHavoc reporting; we quote published numbers with attribution and did not rerun either audit. Numbers are directionally consistent; some exact figures are not independently verifiable.
  • View C: first-party, reproducible. Full scan of the openclaw/skills monorepo (6,993 skills, April 2026). Each skill ships with a SKILL.md file containing metadata, declared binary dependencies, and a system prompt; the catalog scan reads these directly.

Datasets

Catalog: 6,993skills, 23 columns · User reports: 16,635 classified mentions. CC BY 4.0.

DOI10.5281/zenodo.19691714

Scripts

Reproducible build. Clone the repo, run the script, diff the CSV. MIT.

View on GitHub
Cite this study (BibTeX)
@misc{vessel_agent_skills_ecosystem_2026_04,
  author       = {Bhardwaj, Mehul},
  title        = {The AI Agent Skills Ecosystem: Installability, Security, and User Pain},
  year         = {2026},
  month        = {4},
  publisher    = {Vessel},
  doi          = {10.5281/zenodo.19691714},
  url          = {https://doi.org/10.5281/zenodo.19691714},
  howpublished = {\url{https://vesselofone.com/research/ai-agent-skills-ecosystem}},
  license      = {CC-BY-4.0}
}

View A · User-reported pain

The measurement nobody publishes

First-party, LLM-classified: 16,635 of 16,840 public user reports from GitHub Issues (n = 10,256), Hacker News (n = 5,097), and Reddit (n = 1,282), classified into 12 analytical categories using Claude Haiku 4.5 with a 12-type enum schema. These are collected-report proportions, not ecosystem prevalence estimates. Full methodology, prompt, and validation data.

What each source reveals differently

Silent failure

GitHub

18.4%

HN

0.5%

Developers reproduce wrong output and file a ticket. HN readers don’t know the output was wrong; they only see that the skill ran.

Security

HN

5.1%

GitHub

1.4%

Security is discussed on HN, not filed as bugs. OAuth over-scoping and credential exposure are invisible to users until exploited.

Discovery

Reddit

8.3%

GitHub

1.7%

Non-developers post “I can’t find a skill for X” on Reddit. They don’t file GitHub issues. The GitHub discovery figure is a floor.

The asymmetry defines the monitoring surface

The 18.4% vs 0.5% divergence isn’t about different user populations. Both GitHub and HN skew towards developers. When a developer finds wrong output, they open a GitHub issue. The same developer doesn’t post about it on HN; HN is where security risks and ecosystem concerns get discussed, not where reproducible bugs get filed. Silent failure only surfaces when the person running the skill actively verifies the output. The checks that do that (schema validation, expected-structure assertions, confidence signals) need to be in the system, not left to whoever happens to check.

Bug reports by cluster × source

The 10 categories collapse into 4 clusters. Each column is a cluster; within each cluster, the three sources show what share of their reports landed there. GitHub leads operational correctness at 55.5%; HN leads security at 5.1%; Reddit leads docs & discovery at 15.0%. Same data, three vantage points.

Cluster share by source: each row is a cluster, columns are sources, percentages within source.
ClusterGitHubn = 10,256Hacker Newsn = 5,097Redditn = 1,282
Operational
quality · silent-failure · compat · install
55.5%8.8%16.8%
Docs & discovery
docs · discovery · registry
9.9%5.4%15.0%
Maintenance
abandoned · cross-skill conflicts
3.1%1.3%3.1%
Security
runtime consent violations
1.4%5.1%3.8%

Percentages are each cluster’s share of that source’s reports; highlighted cell = leading source for that cluster. Rows do not sum to 100% because other and noise are excluded. GitHub security CI: [1.18%, 1.64%]; discovery CI: [1.49%, 2.00%]*.

Pain × source · full breakdown

All 10 signal categories drilled down by source, grouped by cluster, with raw counts and within-source percent. Bars are normalised within each column so a 1.4% security bar in GitHub doesn’t disappear next to GitHub’s 20% quality bar. Bottom two rows (other, noise) are dimmed: residual buckets shown for transparency, not findings.

HN: 54.7% noise filtered · Reddit: 32.0% noise filtered

Pain type
GitHub n = 10,256
Hacker News n = 5,097
Reddit n = 1,282
  • quality
    operational
    20.72%2,125
    6.96%355
    7.57%97
  • silent-failure
    operational
    18.44%1,891
    0.49%25
    1.79%23
  • compat
    operational
    9.04%927
    0.47%24
    2.18%28
  • install
    operational
    7.26%745
    0.82%42
    5.30%68
  • docs
    docs & discovery
    5.35%549
    2.10%107
    6.16%79
  • discovery
    docs & discovery
    1.73%177
    2.53%129
    8.27%106
  • registry-meta
    docs & discovery
    2.87%294
    0.73%37
    0.55%7
  • maintenance
    maintenance
    2.36%242
    0.80%41
    1.95%25
  • cross-skill
    maintenance
    0.69%71
    0.47%24
    1.17%15
  • security
    security
    1.39%143
    5.06%258
    3.82%49
  • other
    residual
    22.82%2,340
    24.88%1,268
    29.25%375
  • noise
    residual
    7.33%752
    54.68%2,787
    31.98%410
leading source for this rowscroll →

* The discovery figure (1.7%) is a lower bound, not a prevalence estimate. GitHub Issues are filed by developers who can reproduce and articulate a bug; non-developer users who can’t find a skill abandon silently or post on Reddit instead. See Limitation 8.

View B · Scanner and audit findings

What security scanners catch, and what they miss

First-party scan: all 6,993 skills via SKILL.md pattern analysis, ClawHub signals, and GHSA cross-reference. Reproduce: python scripts/security-scan.py in the openclaw-skills repo. Cited as-published: Snyk ToxicSkills and public ClawHavoc reporting. Their corpora are not fully disclosed, so we treat those numbers as reference data, not precision comparisons.

first-party

Dangerous

9.2%

642 of 6,993 skills · active threat signals

first-party

Caution

43.4%

3,036 of 6,993 skills · risk patterns present

cited

Snyk-flagged

36.8%

Snyk ToxicSkills· 3,984 skills scanned

Three lenses on the install-time threat surface, not three readings of the same number. The two first-party columns are reproducible; the Snyk column is cited as-published. The runtime security figure in View A (1.4% of GitHub reports) measures a different surface and cannot be summed with these. The three failure types below explain why.

How each security failure type surfaces

Security failure types by scanner detection and user visibility
Failure typeScanner detectionUser-visible

Latent code flaws

OAuth over-scoping, injection, credential mishandling.

High

Static patterns. This is what Snyk’s 36.8% measures.

After exploit only

No surface event until the flaw is triggered.

Supply chain compromise

ClawHavoc-class campaigns: malware via legitimate-looking skills (public reports range from 341 to 824 flagged skills).

Post-campaign only

Reputation methods catch known campaigns; novel ones evade.

Never

Installs cleanly, exfiltrates silently. No crash, no error, no issue.

Runtime consent violations

Agents acting visibly outside intended scope.

None

No public scanner covers this class today.

~1% of reports

The only failure mode users can see, and even there the signal is faint.

Snyk’s findings, ranked

OAuth over-scoping fires on 70.1% of audited skills, but Snyk classifies it as informational — rule-detectable, rarely exploitable on its own. The threshold that matters is “at least one non-informational finding,” which holds for 36.8%. Critical-severity findings (RCE, unauthenticated sinks) hit 13.4%, roughly 534 of the 3,984 audited skills.

Snyk ToxicSkills audit: published findingsn = skills audited (not disclosed)
  • OAuth scope wider than task

    Skill requests permissions it demonstrably does not use.

    70.1%high
  • Command-injection pattern

    Shell-exec in SKILL.md body with user-controlled input.

    43.4%high
  • At least one security flaw

    Any non-informational finding across all rules.

    36.82%medium
  • Critical severity

    RCE, creds-in-repo, unauthenticated sink.

    13.4%critical

Bar length = prevalence in Snyk’s 3,984-skill corpus. Severity badge is independent of bar length.

Scanner coverage matrix

The four scanners in the matrix don’t overlap on much. Snyk catches static patterns. Koi catches live campaigns via reputation. Our first-party scan found 9.2% dangerous and 43.4% caution across all 6,993 skills.None of them catch silent wrong-output bugs or environment-drift breakage; those require runtime evaluation, which no public scanner does today.

Scanner coverage matrix: what each published audit measures.
Finding classSnyk ToxicSkillsKoi ClawHavocVessel skill-check
Credentials hard-coded in SKILL.md
caughtPattern: `token\s*=`, known API formats.
not measuredNot a primary focus.
caughtEntropy + known-prefix regex.
OAuth scopes wider than declared task
caughtRequested scopes vs. task description.
not measuredNot measured.
caughtIntent-vs-scope classifier.
Command injection in SKILL.md shell blocks
caughtAST grep for shell-exec patterns.
not measuredNot measured.
caughtSame AST pattern set.
Adversarial prompt instructions (prompt injection)
not measuredNatural language, no fixed pattern.
caughtKnown-campaign instruction fingerprints.
partialTrained patterns only; novel phrasings slip past.
Malicious skills in active campaigns
not measuredPoint-in-time scan, no reputation.
caughtLive campaign tracking (ClawHavoc).
not measuredSingle-skill audit, not reputation.
The gap · not covered by any published scanner
Silent wrong-output bugs
not caughtRequires runtime evaluation.
not caughtRequires runtime evaluation.
not caughtRequires runtime evaluation.
Environment-drift / install-breaks-for-me
not caughtNot in scope.
not caughtNot in scope.
not caughtNot in scope.

The community moved on the supply chain gap before the formal audits landed. Our scrape of OpenClaw skill repos from January through April 2026 captures multiple independently-built scanners — SkillScan, Aguara, ClawSec Monitor, ClawSecure among them — each launched in direct response to the absence of registry-level trust signals. Their published prevalence numbers vary by methodology, but every one of them put dangerous-or-malicious in double digits.

Named incidents in the same window confirm the supply chain risk is not theoretical: top-downloaded ClawHub skills distributing malware, plain-text malicious payloads visible in SKILL.md files, flagged weeks before the formal scanner reports were published.

What scanners fundamentally miss

A SKILL.md body is adversarial natural language, not an executable you can hash. A static scanner finds the patterns it was trained on; it does not catch a skill that installs cleanly, passes audit, and still emits silently wrong output. That gap is what we measured in View A.

Supply chain attacks are the deeper blind spot. The ~1% security figure in View A is not evidence that risk is low; it is a survivorship artifact. The users most exposed to ClawHavoc-style attacks are casual installers who don’t know they were compromised, so they are structurally absent from issue trackers.

Cited sources

Snyk ToxicSkills (2026)

Static code scan for credential leaks, OAuth scope width, and injection patterns. Published corpus: 3,984 skills from ClawHub and skills.sh. Read the Snyk write-up.

ClawHavoc reporting (Jan-Feb 2026)

Public reports describe a live malicious-skill campaign with counts ranging from 341 early identified skills to 824 later flagged skills. Reputation + campaign fingerprint methodology. ClawTank early report · LaunchMyOpenClaw scale report.

View C · Registry installability

The registry: 97.0% installable on Linux

First-party, reproducible: full scan of 6,993 public OpenClaw skills. We parsed every SKILL.md, resolved declared binaries against a curated install map, and publish the full funnel, dep taxonomy, and blocker list. This is a dependency-resolution analysis, not an execution test.

If you self-host a Linux box, this is the only install step you need. One apt line covers 6,784 of 6,993 public OpenClaw skills. No Docker, no uv, no language managers.

One command · covers 97.0% of skills

apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrep

Satisfies every declared binary dependency for 6,784 of 6,993 skills. No Homebrew, no uv, no Docker, no language managers.

How the 97.0% breaks down

Five buckets, in order. Most skills need nothing at all — pure prompts plus an API key. The apt line absorbs almost everything else. What’s left is a small tail: a pip/curl-installable batch and a handful tied to the Mac desktop.

Skills scanned

every skill in openclaw/skills

6,993

Declare their deps (SKILL.md)

97.8% transparent

6,841

Runs as-is, zero install

82.6% of scanned

5,777

Covered by the apt line

+1,007 resolved via apt

6,784

Outside the apt line

59 brew-only · 150 pip/curl/macOS-desktop

209

What the apt line unlocks

5,777 of the 6,993 skills ship with zero binary dependencies. The chart below is the long tail: the 1,216 skills that do declare binaries, and the 15 most-declared bins among them. Almost every bin at the head is either preinstalled on Ubuntu or one apt away.

  • python3
    536 skillsapt install python3
  • curl
    313 skillsapt install curl
  • node
    178 skillsapt install nodejs
  • jq
    78 skillsapt install jq
  • python
    45 skillsapt install python3
  • npx
    32 skillsapt install nodejs
  • bash
    30 skillspreinstalled
  • uv
    28 skillscurl install script
  • npm
    28 skillsapt install npm
  • mcporter
    24 skillsdirect download
  • pip
    23 skillsapt install python3-pip
  • git
    21 skillsapt install git
  • ffmpeg
    19 skillsapt install ffmpeg
  • pip3
    13 skillsapt install python3-pip
  • yt-dlp
    9 skillspip install yt-dlp
apt-installable download / runtime fetch / brew

Beyond the apt line

About 3.0%of the monorepo declares a binary the apt command doesn’t cover. Most are one pip install or a curl script away: yt-dlp, uv, bun, Foundry’s cast. Named bins below.

  • cast
    apt-missing
    8 skills
  • pbpaste
    apt-missing
    6 skills
  • osascript
    apt-missing
    4 skills
  • google-chrome-stable
    apt-missing
    4 skills
  • xvfb-run
    apt-missing
    4 skills
  • ssh
    apt-missing
    3 skills
  • awk
    apt-missing
    3 skills
  • dbus-launch
    apt-missing
    3 skills
  • vdirsyncer
    brew-only
    2 skills
  • khal
    brew-only
    2 skills

apt-missing = not packaged for apt-get. brew-only = macOS-only Homebrew formula.

Excludes 35 self-references (openclaw/claude itself) and 12 pip/npm-installable tools.

How the most-installed skills fare

The top of the distribution is the test that matters for most self-hosters: do the skills people actually install run on a server? Nine of the top 10 run as-is after the apt line. caldav-calendar is the only genuine macOS blocker in the top tier.

#SkillDownloadsDepsLinux
1
Humanizer
humanizer · biostartechnology
92,489noneRuns as-is
2
Proactive Agent Lite
proactive-agent-lite · bestrocky
33,577noneRuns as-is
3
Xiaohongshu (小红书) Automation
xiaohongshu-mcp · borye
31,510noneRuns as-is
4
Tavily AI Search
tavily · bert-builder
30,814noneRuns as-is
5
Pdf
pdf · awspace
30,151noneRuns as-is
6
Docker Essentials
docker-essentials · arnarsson
27,684dockerapt install
7
AgentMail
agentmail · adboio
27,495noneRuns as-is
8
Web Search
web-search · billyutw
26,571noneRuns as-is
9
Humanizer
ai-humanizer · brandonwise
26,545noneRuns as-is
10
Caldav Calendar
caldav-calendar · asleep123
25,777vdirsyncer, khalmacOS only

What you can’t get on Linux

After the pip/curl tail, the residual is small. 10skills depend on macOS-only binaries for their primary capability. Four control a desktop app (AirPlay speakers, Adobe Photoshop, the Trae IDE, the macOS wallpaper daemon) and genuinely don’t port. The other six read text from the Mac clipboard. The transformation itself is general, so the capability transfers to Linux by swapping pbpaste for file input or xclip/wl-paste. Closest catalog matches below.

SkillDownloadsMac binLinux replacement in catalog

Airfoil

Control AirPlay speakers via Airfoil from the command line. Connect, disconnect, set volume, and manage multi-room audio with simple CLI commands.

2,096
osascript

No catalog equivalent. Desktop-app capability.

Photoshop Automator

Automate Adobe Photoshop on Windows via ExtendScript to run scripts, update text layers, create layers, apply filters, play actions, and export images.

1,123
osascript

No catalog equivalent. Desktop-app capability.

Nerve Bridge Skill

Bi-directional control of Trae via macOS AppleScript with built-in feedback mechanism. Use when needing to execute code/commands in Trae IDE and wait for com...

591
osascript

No catalog equivalent. Desktop-app capability.

reply-coach

280
pbpaste
copy-editing

file-based text rewrite

reviewer-rebuttal-coach

255
pbpaste
copy-editing

file-based rewrite, swap pbpaste for file input

collab-offer-polisher

253
pbpaste
medical-email-polisher

closest catalog match for polishing business messages

wallpaper-auto-switch-pro-executable

249
osascript

No catalog equivalent. Desktop-app capability.

policy-to-checklist

248
pbpaste
afrexai-qa-test-plan

document-to-checklist generator

claim-risk-auditor

244
pbpaste
verify-claims

direct claim / fact-check equivalent

rubric-gap-analyzer

244
pbpaste
afrexai-interview-architect

rubric-based scoring, file input

Decision framework

A skill can install cleanly, pass every scanner, and still produce silently wrong output. Quality (20.7%) and silent-failure (18.4%) are the top two user-reported failure modes, and neither is what registries or scanners are built to catch. The minimum steps to close the gap, by role.

If you write skills

publishing to a registry

  1. Write at least one correctness test.

    18.4%

    silent-failure rate

    A behavioral check, not a lint. One golden input → expected output. Catches the “ran and returned garbage” class of bug that exits 0 and looks fine.

  2. Fail loudly when uncertain.

    20.7%

    quality complaints

    Return an explicit error or confidence signal. The gap between quality and silent-failure is whether the user knows something went wrong.

  3. Pin binary versions and declare every env var.

    11.4%

    maintenance + compat

    Most break 3–6 months after a working install when something upstream changes. SKILL.md is the contract.

If you install skills

deploying into a workflow

  1. Run a scanner before installing.

    36.8%

    have a scanner-flagged issue

    Static scan is cheap insurance for the latent risks users can’t see by reading SKILL.md.

  2. Test with your own representative inputs.

    55.5%

    reported pain is operational

    “Installs cleanly + passes scanner” is not a production filter. Neither registries nor scanners measure correctness on your data.

  3. Monitor for wrong output, not just errors.

    0%

    error rate ≠ correctness

    Silent-failure means users hit wrong-output bugs before crash bugs. Spot-check outputs or assert against an expected schema in production.

  4. Schedule a re-review at 6 months.

    11.4%

    surfaces post-install

    APIs change, deps drift, model behavior shifts. The bug that wasn’t there at install will be there in two quarters.

If you package skills into a product

embedding into something other people consume

  1. Build correctness monitoring into your runtime.

    55.5%

    lands on you, not the author

    Customers will hit operational bugs first and escalate to you, not the upstream skill author. Public registries and scanners don’t cover this layer — it’s yours to build.

  2. Surface confidence and “I’m uncertain” signals.

    18.4%

    silent-failure → known unknown

    The dominant failure mode is a skill that ran and was silently wrong. A confidence signal converts a silent failure into something users can act on.

  3. Track correctness independently of security.

    36.8%

    scanner-clean ≠ correct

    Scanner-clean skills still generate the majority of support load. Correctness is a separate measurement from safety and needs its own dashboard.

Implications

Four predictions for 2026–2028, each with an explicit falsification criterion. Confident enough to be wrong about. Each is a direct read on what the three-view gap would have to produce if the structural diagnosis here is right.

  1. Skill-level runtime correctness telemetry emerges as a paid layer.

    Agent-level evaluation is already crowded. LangSmith, Braintrust, Patronus, Langfuse, and Phoenix all measure end-to-end agent behaviour. The gap is the install-site layer below: per-skill output-shape monitoring, schema assertions, and confidence signals attached to the individual tool the agent just called. The 55.5% operational pain figure says someone has to fill it.

    Falsified if notAt least one VC-backed startup ships per-skill output validity monitoring at the install site (not agent-level).by 2027-04
  2. At least one more ClawHub-class supply-chain incident lands in the next 12 months.

    ClawHavoc was not a one-off. It was the first incident large enough to be named. The structural conditions that made it possible (no pre-publish review, no provenance check, install-time trust) have not changed. The economics for an attacker (hundreds of flagged skills in public reporting, no crash, no error) only get better as the install base grows.

    Falsified if not≥1 publicly-named malicious-skill campaign affecting >100 skills, reported by a recognised security vendor or news outlet.by 2027-04
  3. Enterprise procurement starts asking for correctness SLAs, not just security attestations.

    Today’s agent-vendor RFPs ask SOC2-style security questions. The next round will ask the question View A actually exposed: what is your wrong-output rate, and what do you do when a skill returns plausible-but-bad output? Procurement catches up with the failure mode roughly two years after the field documents it.

    Falsified if notA widely-circulated agent-vendor RFP template (Gartner, IAPP, or a Fortune 500 sample) explicitly asks for skill-level correctness measurements.by 2028-04
  4. Skill registries add some form of test/eval requirement for marketplace inclusion.

    Registries today gate on installability and (sometimes) static security scan. Neither catches the dominant failure mode. The first registry to publish a “verified skill” tier with an automated correctness check wins the trust premium, and the rest follow within a release cycle.

    Falsified if notClawHub, skills.sh, or another major registry ships a verified-skill tier conditioned on at least one automated correctness check (golden input → expected output, schema assertion, or runtime invariant).by 2027-10

Predictions will be re-scored alongside each refresh of this study. The changelog records hits and misses.

Methodology

View A · User-reported pain

Sources. GitHub Issues API (three repos, paginated to GitHub’s 100-page cap). Hacker News via Algolia (11 keywords: "openclaw", "SKILL.md", "claude skill", "claude code skill", "clawhub", "claw skill", "openclaw/skills", "skill registry", "agent marketplace", "skill broken", "skill doesn’t work"). Reddit public JSON across 12 subreddits. Total scraped: 16,840. Classified: 16,635 (98.8%; 205 unclassified due to rate-limit exhaustion during the run).

Signal quality by source.

On-topic share within each scrape, after the classifier filters noise (keyword false positives, off-topic English uses of "skill"). GitHub is the cleanest channel; Hacker News is the noisiest. Headline percentages in View A use the on-topic counts as their denominator.

GitHub Issues9,504 on-topic · 752 off-topic · 92.7%

3 repos: vercel-labs/skills, anthropics/claude-code, anthropics/claude-agent-sdk-typescript. Reproducible, specific bugs.

Reddit872 on-topic · 410 off-topic · 68.0%

12 subreddits. Some operators troubleshoot in public; lots of adjacent LLM chatter.

Hacker News2,310 on-topic · 2,787 off-topic · 45.3%

Algolia keyword matches. Majority are unrelated English uses of "skill".

on-topic (classified to a pain type)off-topic (filtered as noise)

Classifier. Claude Haiku 4.5 (claude-haiku-4-5-20251001) via the Anthropic API, tool-use structured output with an enum-constrained schema. Single-label classification with a short (<50 char) free-text reason. Ephemeral prompt cache on the system message (12-type taxonomy definitions) for a ~90% input-token discount after first hit.

Taxonomy. 12 types: compat, silent-failure, maintenance, security, quality, docs, cross-skill, discovery, registry-meta, install, other, noise. Full definitions in the dataset README.

View B · Scanner and audit findings

Cited as-published: Snyk ToxicSkills (2026) and public ClawHavoc reporting (January-February 2026). We did not rerun these external audits; we quote their headline numbers with attribution and treat them as reference data, not precision-comparable measurements.

View C · Registry installability

Full-catalog scan of openclaw/skills: parse every SKILL.md, extract declared binary dependencies, resolve each against a curated Linux install map (apt / direct download / runtime fetch / brew-only / no-Linux-path). ClawHub metadata and last-commit activity joined per skill. Script: catalog-coverage.py.

Statistical reporting

Headline percentages are reported with 95% Wilson score intervals. Wilson is preferred over the normal approximation at small p or small n: it stays inside [0, 1] and gives sensible bounds even when the count is in the single digits (matters for the 1.39% security and 1.73% discovery figures).

Confusion matrix

Sonnet 4.6 was rerun on a 61-item stratified random sample from the Haiku output, blind to Haiku’s labels. This is model-model agreement, not human-adjudicated ground truth. The 5-cluster collapse below shows where the two models agree (diagonal) and where they disagree (off-diagonal). The “op” cluster here includes operational correctness (quality, silent-failure, compat, install) plus ecosystem health types (maintenance, cross-skill, registry-meta) — a superset of the headline stat, which counts only the four correctness types. The single load-bearing axis of this study, operational vs. security, has zero observed op↔security cells in either direction in this validation sample. Adjacent-boundary disagreements (quality ↔ silent-failure, both operational) account for the bulk of the 16 off-diagonal pairs.

Exact-label agreement

73.8%

45/61 sample items

Cohen’s κ

0.71

Substantial agreement (Landis & Koch)

Op↔security confusions

0

On the load-bearing axis of the study

Sonnet 4.6 vs Haiku 4.5 · 5-cluster collapse

diagonal = agreement · off-diagonal = confusion

Haiku ↓ / Sonnet →operationalsecuritydiscoverydocsresidual
operational220231
security03001
discovery00510
docs30070
residual31108

Diagonal cells (emerald) are same-cluster agreement. Off-diagonal cells (silver) are cross-cluster confusions; intensity scales with cell count. The op↔security row/column intersections are zero in this validation sample, which is the load-bearing axis of the study.

Snapshot from aggregate validation notes. Per-row labels will be published asvalidation-sample.csv in the dataset repo; this view will then be computed from data.

Classifier validation

On a 61-item stratified random sample, Claude Sonnet 4.6 agreed with the Haiku 4.5 classifications at 73.8%(κ = 0.71, substantial agreement). Zero operational↔security confusions were observed in the sample. With 0 observed events, the rule-of-three upper bound is roughly 4.9% for this validation sample. All 16 disagreements fell at adjacent-category boundaries (e.g. quality vs. silent-failure, both operational; compat vs. docs). 9 blank Hacker News comments were excluded as unclassifiable. Raw sample and per-row labels in the dataset repo.

Sensitivity analysis. The operational:security ratio at baseline is 40:1 (5,688 vs. 143 GitHub reports). For the ratio to drop below 5:1, the classifier would need to have misclassified more than 30% of all operational reports as security, more than 100× the observed 0% operational↔security confusion rate. The finding is structural, not marginal.

Reproducibility

View C reruns end-to-end in about 5 minutes against a public-read GITHUB_TOKEN. View A is a multi-step pipeline (collectors → classifier → aggregation) that needs an ANTHROPIC_API_KEY; a full rerun is roughly 2 hours of wall time and $10–30 of Haiku inference.

bash
# clone and install (~30s)
$ git clone https://github.com/vesselofone/openclaw-skills.git
$ cd openclaw-skills && pip install -r requirements.txt

# View C · catalog scan (~5 min, public-read token)
$ export GITHUB_TOKEN=ghp_your_public_read_token
$ python3 scripts/catalog-coverage.py --all \
    --with-metadata --with-commits \
    --output coverage-output --work-dir coverage-output/repos

# View A · ecosystem classification (~2h, ~$10-30 Haiku)
$ export ANTHROPIC_API_KEY=sk-ant-...
# see scripts/ for collectors and classifier entrypoints

Data availability and ethics

Dataset license
CC BY 4.0
Scripts
MIT
Snapshot
April 2026 point-in-time scrape
Classifier
Claude Haiku 4.5 with enum-constrained output
Public data handling
The study uses public GitHub, Hacker News, and Reddit text for aggregate classification. The page reports category counts and examples of methodology, not individual user profiles.

Limitations

  1. Static analysis only. View C does not execute skills; it resolves declared dependencies. SKILL.md completeness is voluntary. Interpret the installability number as dependency reachability, not runtime success.
  2. Single-label classification. A mention that describes both bad output and a compat bug must be assigned one bucket. Edge cases are forced, so category percentages are analytical approximations rather than adjudicated facts.
  3. GitHub pagination cap. Basic pagination tops out at 100 pages per repo. The anthropics/claude-code corpus tail is not covered. Treat GitHub percentages as proportions of collected reports, not full ecosystem prevalence.
  4. HN noise. Keyword matching on “openclaw”/"skill" yields 54.68% unrelated hits. The noise tag filters them; do not use the HN subset without filtering.
  5. Reddit rate-limit floor. Unauthenticated Reddit reads cap out. 1,282 rows is a floor, not a true count, so Reddit is directional evidence for discovery pain rather than a prevalence estimate.
  6. Classifier precision. Validation measures model-model agreement, not human-labeled ground truth. On a 61-item validation sample, Sonnet 4.6 agreed with Haiku at 73.8% (κ = 0.71), with zero observed operational↔security confusions. The sensitivity analysis shows the operational:security ratio would require >30% systematic misclassification to drop below 5:1.
  7. Moderation status is not security. ClawHub flags reflect staff moderation, not a security audit. They are registry metadata, not an independent safety label.
  8. User sample is developer-skewed. 61.7% of classified reports come from GitHub Issues, which are filed predominantly by developers integrating skills, a population that is unusually good at finding things. Pain modes that are visible to that population (silent wrong output, compat, install failures in their own environment) are probably faithfully represented. Pain modes that are more acute for non-developers(discovery, decision-fatigue across similar skills, onboarding friction) are likely under-counted. The 1.73% discovery figure should be read as a lower bound among developer reporters, not as a prevalence estimate across all skill users.

Frequently asked

Common questions from readers and reviewers. If yours isn’t here, open an issue on the dataset repo.

Why juxtapose registry, scanner, and user views at all?

Because each view answers a different question, and policy conversations about the skills ecosystem routinely confuse them. “Is the ecosystem healthy?” depends on whether you mean installable, safe, or operationally correct. The three answers point in different directions: high installability, a materially lower scanner-clean surface, and a GitHub pain profile dominated by operational correctness. That spread changes what you’d build next.

Why is security around 1% of user-reported pain but 37% of scanner findings?

The survivorship argument is in the introduction. In short: ClawHavoc-style attacks can produce no crash, no error, and no GitHub issue. The 1.39% measures conscious, reportable pain from technical users. It does not measure silent compromise.

Why Claude Haiku 4.5 instead of Opus for classification?

Cost and throughput. 16,000 single-label classifications with ephemeral prompt-caching on the taxonomy run in ~2 hours on Haiku at roughly $10–30 all-in. Opus would push the run into hundreds of dollars without meaningfully changing the distribution for a 12-class, enum-constrained schema. We publish the classifier prompt; anyone can rerun on a different model to verify.

Why is Hacker News so noisy?

54.68% of HN matches are unrelated keyword collisions (the word "skill" has strong non-AI-agent English usage). The noise tag filters them. If you cite an HN subset, filter first.

Do OpenClaw skills work on Linux?

Yes. 97.0% of the 6,993 public skills run on Linux after one apt install line. A small tail (~209) needs a pip install or a curl script, and about 10 skills depend on macOS-only binaries for their primary capability.

What is the one apt install line?
sudo apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrep

Satisfies every declared binary dependency for 6,784 of 6,993 public skills on Ubuntu 22.04/24.04 or Debian 12.

Is this a security audit?

No. View C is static compatibility analysis, not a security audit. View B cites external audits. View A classifies what users report, not what is objectively true. Per-skill security scoring is a separate pass; a free per-slug auditor is at vesselofone.com/tools/skill-check.

Will this study be refreshed?

View C is reproducible: clone the repo, re-run the script, diff the CSV. View A is point-in-time (April 2026); a future refresh will ship as a GitHub release with an immutable Zenodo DOI. The changelog lives on this page.

Changelog

  • 2026-04-20 · v1.0. Initial publication. Catalog scan (6,993 skills, one apt line). Ecosystem classification (16,635 mentions, 12-type pain taxonomy).

Your expertise deserves its own machine.

Private VMs. No shared infrastructure. Your agents, your data, your rules.