The AI Agent Skills Ecosystem:
Installability, Security, and User Pain
By Mehul Bhardwaj · Vessel · (last updated )
Abstract
We study the AI-agent skills ecosystem from three angles: registry installability, scanner-detected security risk, and user-reported pain. As of April 2026, 97.0% of OpenClaw skills install on Linux, Snyk flags 36.8% of skills for latent code flaws, and 55.5% of user complaints filed on GitHub describe wrong, missing, or broken output. Each number has a different denominator; the gap between them is the finding. Registries catch what fails to install. Scanners catch latent risk. No public tool watches for the skill that installs cleanly, passes audit, and silently produces wrong output, the failure users actually hit.
Registry
97.0%
Linux-installable after one apt line.
Scanner
36.8%
Snyk scanner-flagged skills.
GitHub Pain
55.5%
Operational correctness complaints.
Definition
By operational correctness we mean failures where a skill installs and runs, but the output is wrong, incomplete, silently empty, schema-invalid, or broken by environment drift. 55.5% of GitHub user complaints describe failures of this kind. Neither public scanners nor registries measure it directly.
Blind spot 1 · Registry
The registry shows 97.0% of OpenClaw skills are apt-installable on a standard Linux host: the package resolves, the check passes, the number looks clean. But installable means the archive arrived. It says nothing about the browser binary at a specific path, the OAuth flow that must complete before first use, or the API key that must be in the environment. “Passed the scan, installed cleanly, and still doesn’t work right” is not a rare edge case. It is the dominant complaint.
Blind spot 2 · Supply chain
Supply-chain attacks like ClawHavoc (a malicious-skill campaign, January 2026) produce no crash, no error, and no GitHub issue. Public ClawHavoc reports describe hundreds of skills distributing AMOS, a macOS credential-harvesting trojan, that installed cleanly, passed scanners, ran silently, and exfiltrated without the user knowing. Casual skill installers don’t appear in GitHub issue trackers because they don’t know they were hit.
Methods
We scanned all 6,993 public OpenClaw skills for Linux installability and dependencies, synthesized published security research (Snyk ToxicSkills and public ClawHavoc reporting), and classified 16,635 user reports across three ecosystems into 12 pain categories spanning install, correctness, security, and discovery (κ = 0.71). Security pain on GitHub is 1.4% [1.18%, 1.64%] (95% Wilson CI). Both datasets and scripts are open: doi.org/10.5281/zenodo.19691714.
The four failure modes, by who detects them
| Failure mode | Registry catches? | Scanner catches? | Users report? |
|---|---|---|---|
Latent code flaw OAuth over-scope, injection patterns, credential mishandling. | NoRegistry installs the archive; does not analyse the code. | YesSnyk ToxicSkills flags 36.8% of skills. | RarelySurfaces only after exploitation; latent risks are invisible. |
Supply-chain compromise Skill installs cleanly, exfiltrates silently. Public ClawHavoc reports range from 341 to 824 skills. | NoNo build-time provenance check at the registry layer. | Post-campaignKoi catches via reputation / fingerprints, not pre-publish. | NeverNo crash, no error, no GitHub issue. Survivorship absence. |
Operational correctness Skill ran, returned plausible-looking output, was wrong. | No“Installable” is silent on whether the output is correct. | NoStatic scanners cannot evaluate runtime output. | Yes, 55.5%Quality + silent-failure + compat + install on GitHub. |
Discovery Hard to find, compare, or evaluate a skill before installing. | Search onlyListing exists; ranking and evaluation signals are weak. | NoDiscovery is not a security or installability concern. | Yes (floor)Non-developers post on Reddit; rarely file issues. 1.7% GH is a lower bound. |
| Each row is the same gap viewed from a different angle. The largest miss, operational correctness, is the row where users see the failure but registries and static scanners do not measure it. | |||
Study design and data
- View A: LLM-classified, validated at κ = 0.71. 16,635 public reports classified by Claude Haiku 4.5, validated against Sonnet 4.6 on a 61-item sample. Zero operational↔security confusions were observed in the sample. Findings are cross-ecosystem: OpenClaw, Vercel skills.sh, and Anthropic’s Claude Code (the latter two appear only here, as they use different formats incompatible with the catalog scan).
- View B: cited as-published reference data. Snyk ToxicSkills and public ClawHavoc reporting; we quote published numbers with attribution and did not rerun either audit. Numbers are directionally consistent; some exact figures are not independently verifiable.
- View C: first-party, reproducible. Full scan of the
openclaw/skillsmonorepo (6,993 skills, April 2026). Each skill ships with aSKILL.mdfile containing metadata, declared binary dependencies, and a system prompt; the catalog scan reads these directly.
Datasets
Catalog: 6,993skills, 23 columns · User reports: 16,635 classified mentions. CC BY 4.0.
Scripts
Reproducible build. Clone the repo, run the script, diff the CSV. MIT.
Cite this study (BibTeX)
@misc{vessel_agent_skills_ecosystem_2026_04,
author = {Bhardwaj, Mehul},
title = {The AI Agent Skills Ecosystem: Installability, Security, and User Pain},
year = {2026},
month = {4},
publisher = {Vessel},
doi = {10.5281/zenodo.19691714},
url = {https://doi.org/10.5281/zenodo.19691714},
howpublished = {\url{https://vesselofone.com/research/ai-agent-skills-ecosystem}},
license = {CC-BY-4.0}
}View A · User-reported pain
The measurement nobody publishes
First-party, LLM-classified: 16,635 of 16,840 public user reports from GitHub Issues (n = 10,256), Hacker News (n = 5,097), and Reddit (n = 1,282), classified into 12 analytical categories using Claude Haiku 4.5 with a 12-type enum schema. These are collected-report proportions, not ecosystem prevalence estimates. Full methodology, prompt, and validation data.
What each source reveals differently
Silent failure
GitHub
18.4%
HN
0.5%
Developers reproduce wrong output and file a ticket. HN readers don’t know the output was wrong; they only see that the skill ran.
Security
HN
5.1%
GitHub
1.4%
Security is discussed on HN, not filed as bugs. OAuth over-scoping and credential exposure are invisible to users until exploited.
Discovery
8.3%
GitHub
1.7%
Non-developers post “I can’t find a skill for X” on Reddit. They don’t file GitHub issues. The GitHub discovery figure is a floor.
The asymmetry defines the monitoring surface
The 18.4% vs 0.5% divergence isn’t about different user populations. Both GitHub and HN skew towards developers. When a developer finds wrong output, they open a GitHub issue. The same developer doesn’t post about it on HN; HN is where security risks and ecosystem concerns get discussed, not where reproducible bugs get filed. Silent failure only surfaces when the person running the skill actively verifies the output. The checks that do that (schema validation, expected-structure assertions, confidence signals) need to be in the system, not left to whoever happens to check.
Bug reports by cluster × source
The 10 categories collapse into 4 clusters. Each column is a cluster; within each cluster, the three sources show what share of their reports landed there. GitHub leads operational correctness at 55.5%; HN leads security at 5.1%; Reddit leads docs & discovery at 15.0%. Same data, three vantage points.
| Cluster | GitHubn = 10,256 | Hacker Newsn = 5,097 | Redditn = 1,282 |
|---|---|---|---|
Operational quality · silent-failure · compat · install | 55.5% | 8.8% | 16.8% |
Docs & discovery docs · discovery · registry | 9.9% | 5.4% | 15.0% |
Maintenance abandoned · cross-skill conflicts | 3.1% | 1.3% | 3.1% |
Security runtime consent violations | 1.4% | 5.1% | 3.8% |
Percentages are each cluster’s share of that source’s reports; highlighted cell = leading source for that cluster. Rows do not sum to 100% because other and noise are excluded. GitHub security CI: [1.18%, 1.64%]; discovery CI: [1.49%, 2.00%]*.
Pain × source · full breakdown
All 10 signal categories drilled down by source, grouped by cluster, with raw counts and within-source percent. Bars are normalised within each column so a 1.4% security bar in GitHub doesn’t disappear next to GitHub’s 20% quality bar. Bottom two rows (other, noise) are dimmed: residual buckets shown for transparency, not findings.
HN: 54.7% noise filtered · Reddit: 32.0% noise filtered
qualityoperational20.72%2,1256.96%3557.57%97silent-failureoperational18.44%1,8910.49%251.79%23compatoperational9.04%9270.47%242.18%28installoperational7.26%7450.82%425.30%68docsdocs & discovery5.35%5492.10%1076.16%79discoverydocs & discovery1.73%1772.53%1298.27%106registry-metadocs & discovery2.87%2940.73%370.55%7maintenancemaintenance2.36%2420.80%411.95%25cross-skillmaintenance0.69%710.47%241.17%15securitysecurity1.39%1435.06%2583.82%49otherresidual22.82%2,34024.88%1,26829.25%375noiseresidual7.33%75254.68%2,78731.98%410
* The discovery figure (1.7%) is a lower bound, not a prevalence estimate. GitHub Issues are filed by developers who can reproduce and articulate a bug; non-developer users who can’t find a skill abandon silently or post on Reddit instead. See Limitation 8.
View B · Scanner and audit findings
What security scanners catch, and what they miss
First-party scan: all 6,993 skills via SKILL.md pattern analysis, ClawHub signals, and GHSA cross-reference. Reproduce: python scripts/security-scan.py in the openclaw-skills repo. Cited as-published: Snyk ToxicSkills and public ClawHavoc reporting. Their corpora are not fully disclosed, so we treat those numbers as reference data, not precision comparisons.
Dangerous
9.2%
642 of 6,993 skills · active threat signals
Caution
43.4%
3,036 of 6,993 skills · risk patterns present
Three lenses on the install-time threat surface, not three readings of the same number. The two first-party columns are reproducible; the Snyk column is cited as-published. The runtime security figure in View A (1.4% of GitHub reports) measures a different surface and cannot be summed with these. The three failure types below explain why.
How each security failure type surfaces
| Failure type | Scanner detection | User-visible |
|---|---|---|
Latent code flaws OAuth over-scoping, injection, credential mishandling. | High Static patterns. This is what Snyk’s 36.8% measures. | After exploit only No surface event until the flaw is triggered. |
Supply chain compromise ClawHavoc-class campaigns: malware via legitimate-looking skills (public reports range from 341 to 824 flagged skills). | Post-campaign only Reputation methods catch known campaigns; novel ones evade. | Never Installs cleanly, exfiltrates silently. No crash, no error, no issue. |
Runtime consent violations Agents acting visibly outside intended scope. | None No public scanner covers this class today. | ~1% of reports The only failure mode users can see, and even there the signal is faint. |
Snyk’s findings, ranked
OAuth over-scoping fires on 70.1% of audited skills, but Snyk classifies it as informational — rule-detectable, rarely exploitable on its own. The threshold that matters is “at least one non-informational finding,” which holds for 36.8%. Critical-severity findings (RCE, unauthenticated sinks) hit 13.4%, roughly 534 of the 3,984 audited skills.
OAuth scope wider than task
Skill requests permissions it demonstrably does not use.
70.1%highCommand-injection pattern
Shell-exec in SKILL.md body with user-controlled input.
43.4%highAt least one security flaw
Any non-informational finding across all rules.
36.82%mediumCritical severity
RCE, creds-in-repo, unauthenticated sink.
13.4%critical
Bar length = prevalence in Snyk’s 3,984-skill corpus. Severity badge is independent of bar length.
Scanner coverage matrix
The four scanners in the matrix don’t overlap on much. Snyk catches static patterns. Koi catches live campaigns via reputation. Our first-party scan found 9.2% dangerous and 43.4% caution across all 6,993 skills.None of them catch silent wrong-output bugs or environment-drift breakage; those require runtime evaluation, which no public scanner does today.
| Finding class | Snyk ToxicSkills | Koi ClawHavoc | Vessel skill-check |
|---|---|---|---|
| Credentials hard-coded in SKILL.md | caughtPattern: `token\s*=`, known API formats. | not measuredNot a primary focus. | caughtEntropy + known-prefix regex. |
| OAuth scopes wider than declared task | caughtRequested scopes vs. task description. | not measuredNot measured. | caughtIntent-vs-scope classifier. |
| Command injection in SKILL.md shell blocks | caughtAST grep for shell-exec patterns. | not measuredNot measured. | caughtSame AST pattern set. |
| Adversarial prompt instructions (prompt injection) | not measuredNatural language, no fixed pattern. | caughtKnown-campaign instruction fingerprints. | partialTrained patterns only; novel phrasings slip past. |
| Malicious skills in active campaigns | not measuredPoint-in-time scan, no reputation. | caughtLive campaign tracking (ClawHavoc). | not measuredSingle-skill audit, not reputation. |
| The gap · not covered by any published scanner | |||
| Silent wrong-output bugs | not caughtRequires runtime evaluation. | not caughtRequires runtime evaluation. | not caughtRequires runtime evaluation. |
| Environment-drift / install-breaks-for-me | not caughtNot in scope. | not caughtNot in scope. | not caughtNot in scope. |
The community moved on the supply chain gap before the formal audits landed. Our scrape of OpenClaw skill repos from January through April 2026 captures multiple independently-built scanners — SkillScan, Aguara, ClawSec Monitor, ClawSecure among them — each launched in direct response to the absence of registry-level trust signals. Their published prevalence numbers vary by methodology, but every one of them put dangerous-or-malicious in double digits.
Named incidents in the same window confirm the supply chain risk is not theoretical: top-downloaded ClawHub skills distributing malware, plain-text malicious payloads visible in SKILL.md files, flagged weeks before the formal scanner reports were published.
What scanners fundamentally miss
A SKILL.md body is adversarial natural language, not an executable you can hash. A static scanner finds the patterns it was trained on; it does not catch a skill that installs cleanly, passes audit, and still emits silently wrong output. That gap is what we measured in View A.
Supply chain attacks are the deeper blind spot. The ~1% security figure in View A is not evidence that risk is low; it is a survivorship artifact. The users most exposed to ClawHavoc-style attacks are casual installers who don’t know they were compromised, so they are structurally absent from issue trackers.
Cited sources
Snyk ToxicSkills (2026)
Static code scan for credential leaks, OAuth scope width, and injection patterns. Published corpus: 3,984 skills from ClawHub and skills.sh. Read the Snyk write-up.
ClawHavoc reporting (Jan-Feb 2026)
Public reports describe a live malicious-skill campaign with counts ranging from 341 early identified skills to 824 later flagged skills. Reputation + campaign fingerprint methodology. ClawTank early report · LaunchMyOpenClaw scale report.
View C · Registry installability
The registry: 97.0% installable on Linux
First-party, reproducible: full scan of 6,993 public OpenClaw skills. We parsed every SKILL.md, resolved declared binaries against a curated install map, and publish the full funnel, dep taxonomy, and blocker list. This is a dependency-resolution analysis, not an execution test.
If you self-host a Linux box, this is the only install step you need. One apt line covers 6,784 of 6,993 public OpenClaw skills. No Docker, no uv, no language managers.
One command · covers 97.0% of skills
apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrepSatisfies every declared binary dependency for 6,784 of 6,993 skills. No Homebrew, no uv, no Docker, no language managers.
How the 97.0% breaks down
Five buckets, in order. Most skills need nothing at all — pure prompts plus an API key. The apt line absorbs almost everything else. What’s left is a small tail: a pip/curl-installable batch and a handful tied to the Mac desktop.
Skills scanned
every skill in openclaw/skills
Declare their deps (SKILL.md)
97.8% transparent
Runs as-is, zero install
82.6% of scanned
Covered by the apt line
+1,007 resolved via apt
Outside the apt line
59 brew-only · 150 pip/curl/macOS-desktop
What the apt line unlocks
5,777 of the 6,993 skills ship with zero binary dependencies. The chart below is the long tail: the 1,216 skills that do declare binaries, and the 15 most-declared bins among them. Almost every bin at the head is either preinstalled on Ubuntu or one apt away.
python3536 skillsapt install python3curl313 skillsapt install curlnode178 skillsapt install nodejsjq78 skillsapt install jqpython45 skillsapt install python3npx32 skillsapt install nodejsbash30 skillspreinstalleduv28 skillscurl install scriptnpm28 skillsapt install npmmcporter24 skillsdirect downloadpip23 skillsapt install python3-pipgit21 skillsapt install gitffmpeg19 skillsapt install ffmpegpip313 skillsapt install python3-pipyt-dlp9 skillspip install yt-dlp
Beyond the apt line
About 3.0%of the monorepo declares a binary the apt command doesn’t cover. Most are one pip install or a curl script away: yt-dlp, uv, bun, Foundry’s cast. Named bins below.
castapt-missing8 skillspbpasteapt-missing6 skillsosascriptapt-missing4 skillsgoogle-chrome-stableapt-missing4 skillsxvfb-runapt-missing4 skillssshapt-missing3 skillsawkapt-missing3 skillsdbus-launchapt-missing3 skillsvdirsyncerbrew-only2 skillskhalbrew-only2 skills
apt-missing = not packaged for apt-get. brew-only = macOS-only Homebrew formula.
Excludes 35 self-references (openclaw/claude itself) and 12 pip/npm-installable tools.
How the most-installed skills fare
The top of the distribution is the test that matters for most self-hosters: do the skills people actually install run on a server? Nine of the top 10 run as-is after the apt line. caldav-calendar is the only genuine macOS blocker in the top tier.
| # | Skill | Downloads | Deps | Linux |
|---|---|---|---|---|
| 1 | Humanizer humanizer · biostartechnology | 92,489 | none | ✓ Runs as-is |
| 2 | Proactive Agent Lite proactive-agent-lite · bestrocky | 33,577 | none | ✓ Runs as-is |
| 3 | Xiaohongshu (小红书) Automation xiaohongshu-mcp · borye | 31,510 | none | ✓ Runs as-is |
| 4 | Tavily AI Search tavily · bert-builder | 30,814 | none | ✓ Runs as-is |
| 5 | Pdf pdf · awspace | 30,151 | none | ✓ Runs as-is |
| 6 | Docker Essentials docker-essentials · arnarsson | 27,684 | docker | ✓ apt install |
| 7 | AgentMail agentmail · adboio | 27,495 | none | ✓ Runs as-is |
| 8 | Web Search web-search · billyutw | 26,571 | none | ✓ Runs as-is |
| 9 | Humanizer ai-humanizer · brandonwise | 26,545 | none | ✓ Runs as-is |
| 10 | Caldav Calendar caldav-calendar · asleep123 | 25,777 | vdirsyncer, khal | ✗ macOS only |
What you can’t get on Linux
After the pip/curl tail, the residual is small. 10skills depend on macOS-only binaries for their primary capability. Four control a desktop app (AirPlay speakers, Adobe Photoshop, the Trae IDE, the macOS wallpaper daemon) and genuinely don’t port. The other six read text from the Mac clipboard. The transformation itself is general, so the capability transfers to Linux by swapping pbpaste for file input or xclip/wl-paste. Closest catalog matches below.
| Skill | Downloads | Mac bin | Linux replacement in catalog |
|---|---|---|---|
Airfoil Control AirPlay speakers via Airfoil from the command line. Connect, disconnect, set volume, and manage multi-room audio with simple CLI commands. | 2,096 | osascript | No catalog equivalent. Desktop-app capability. |
Photoshop Automator Automate Adobe Photoshop on Windows via ExtendScript to run scripts, update text layers, create layers, apply filters, play actions, and export images. | 1,123 | osascript | No catalog equivalent. Desktop-app capability. |
Nerve Bridge Skill Bi-directional control of Trae via macOS AppleScript with built-in feedback mechanism. Use when needing to execute code/commands in Trae IDE and wait for com... | 591 | osascript | No catalog equivalent. Desktop-app capability. |
reply-coach | 280 | pbpaste | copy-editingfile-based text rewrite |
reviewer-rebuttal-coach | 255 | pbpaste | copy-editingfile-based rewrite, swap pbpaste for file input |
collab-offer-polisher | 253 | pbpaste | medical-email-polisherclosest catalog match for polishing business messages |
wallpaper-auto-switch-pro-executable | 249 | osascript | No catalog equivalent. Desktop-app capability. |
policy-to-checklist | 248 | pbpaste | afrexai-qa-test-plandocument-to-checklist generator |
claim-risk-auditor | 244 | pbpaste | verify-claimsdirect claim / fact-check equivalent |
rubric-gap-analyzer | 244 | pbpaste | afrexai-interview-architectrubric-based scoring, file input |
Decision framework
A skill can install cleanly, pass every scanner, and still produce silently wrong output. Quality (20.7%) and silent-failure (18.4%) are the top two user-reported failure modes, and neither is what registries or scanners are built to catch. The minimum steps to close the gap, by role.
If you write skills
publishing to a registry
Write at least one correctness test.
18.4%
silent-failure rate
A behavioral check, not a lint. One golden input → expected output. Catches the “ran and returned garbage” class of bug that exits 0 and looks fine.
Fail loudly when uncertain.
20.7%
quality complaints
Return an explicit error or confidence signal. The gap between quality and silent-failure is whether the user knows something went wrong.
Pin binary versions and declare every env var.
11.4%
maintenance + compat
Most break 3–6 months after a working install when something upstream changes. SKILL.md is the contract.
If you install skills
deploying into a workflow
Run a scanner before installing.
36.8%
have a scanner-flagged issue
Static scan is cheap insurance for the latent risks users can’t see by reading SKILL.md.
Test with your own representative inputs.
55.5%
reported pain is operational
“Installs cleanly + passes scanner” is not a production filter. Neither registries nor scanners measure correctness on your data.
Monitor for wrong output, not just errors.
0%
error rate ≠ correctness
Silent-failure means users hit wrong-output bugs before crash bugs. Spot-check outputs or assert against an expected schema in production.
Schedule a re-review at 6 months.
11.4%
surfaces post-install
APIs change, deps drift, model behavior shifts. The bug that wasn’t there at install will be there in two quarters.
If you package skills into a product
embedding into something other people consume
Build correctness monitoring into your runtime.
55.5%
lands on you, not the author
Customers will hit operational bugs first and escalate to you, not the upstream skill author. Public registries and scanners don’t cover this layer — it’s yours to build.
Surface confidence and “I’m uncertain” signals.
18.4%
silent-failure → known unknown
The dominant failure mode is a skill that ran and was silently wrong. A confidence signal converts a silent failure into something users can act on.
Track correctness independently of security.
36.8%
scanner-clean ≠ correct
Scanner-clean skills still generate the majority of support load. Correctness is a separate measurement from safety and needs its own dashboard.
Implications
Four predictions for 2026–2028, each with an explicit falsification criterion. Confident enough to be wrong about. Each is a direct read on what the three-view gap would have to produce if the structural diagnosis here is right.
Skill-level runtime correctness telemetry emerges as a paid layer.
Agent-level evaluation is already crowded. LangSmith, Braintrust, Patronus, Langfuse, and Phoenix all measure end-to-end agent behaviour. The gap is the install-site layer below: per-skill output-shape monitoring, schema assertions, and confidence signals attached to the individual tool the agent just called. The 55.5% operational pain figure says someone has to fill it.
Falsified if notAt least one VC-backed startup ships per-skill output validity monitoring at the install site (not agent-level).by 2027-04At least one more ClawHub-class supply-chain incident lands in the next 12 months.
ClawHavoc was not a one-off. It was the first incident large enough to be named. The structural conditions that made it possible (no pre-publish review, no provenance check, install-time trust) have not changed. The economics for an attacker (hundreds of flagged skills in public reporting, no crash, no error) only get better as the install base grows.
Falsified if not≥1 publicly-named malicious-skill campaign affecting >100 skills, reported by a recognised security vendor or news outlet.by 2027-04Enterprise procurement starts asking for correctness SLAs, not just security attestations.
Today’s agent-vendor RFPs ask SOC2-style security questions. The next round will ask the question View A actually exposed: what is your wrong-output rate, and what do you do when a skill returns plausible-but-bad output? Procurement catches up with the failure mode roughly two years after the field documents it.
Falsified if notA widely-circulated agent-vendor RFP template (Gartner, IAPP, or a Fortune 500 sample) explicitly asks for skill-level correctness measurements.by 2028-04Skill registries add some form of test/eval requirement for marketplace inclusion.
Registries today gate on installability and (sometimes) static security scan. Neither catches the dominant failure mode. The first registry to publish a “verified skill” tier with an automated correctness check wins the trust premium, and the rest follow within a release cycle.
Falsified if notClawHub, skills.sh, or another major registry ships a verified-skill tier conditioned on at least one automated correctness check (golden input → expected output, schema assertion, or runtime invariant).by 2027-10
Predictions will be re-scored alongside each refresh of this study. The changelog records hits and misses.
Methodology
View A · User-reported pain
Sources. GitHub Issues API (three repos, paginated to GitHub’s 100-page cap). Hacker News via Algolia (11 keywords: "openclaw", "SKILL.md", "claude skill", "claude code skill", "clawhub", "claw skill", "openclaw/skills", "skill registry", "agent marketplace", "skill broken", "skill doesn’t work"). Reddit public JSON across 12 subreddits. Total scraped: 16,840. Classified: 16,635 (98.8%; 205 unclassified due to rate-limit exhaustion during the run).
Signal quality by source.
On-topic share within each scrape, after the classifier filters noise (keyword false positives, off-topic English uses of "skill"). GitHub is the cleanest channel; Hacker News is the noisiest. Headline percentages in View A use the on-topic counts as their denominator.
3 repos: vercel-labs/skills, anthropics/claude-code, anthropics/claude-agent-sdk-typescript. Reproducible, specific bugs.
12 subreddits. Some operators troubleshoot in public; lots of adjacent LLM chatter.
Algolia keyword matches. Majority are unrelated English uses of "skill".
Classifier. Claude Haiku 4.5 (claude-haiku-4-5-20251001) via the Anthropic API, tool-use structured output with an enum-constrained schema. Single-label classification with a short (<50 char) free-text reason. Ephemeral prompt cache on the system message (12-type taxonomy definitions) for a ~90% input-token discount after first hit.
Taxonomy. 12 types: compat, silent-failure, maintenance, security, quality, docs, cross-skill, discovery, registry-meta, install, other, noise. Full definitions in the dataset README.
View B · Scanner and audit findings
Cited as-published: Snyk ToxicSkills (2026) and public ClawHavoc reporting (January-February 2026). We did not rerun these external audits; we quote their headline numbers with attribution and treat them as reference data, not precision-comparable measurements.
View C · Registry installability
Full-catalog scan of openclaw/skills: parse every SKILL.md, extract declared binary dependencies, resolve each against a curated Linux install map (apt / direct download / runtime fetch / brew-only / no-Linux-path). ClawHub metadata and last-commit activity joined per skill. Script: catalog-coverage.py.
Statistical reporting
Headline percentages are reported with 95% Wilson score intervals. Wilson is preferred over the normal approximation at small p or small n: it stays inside [0, 1] and gives sensible bounds even when the count is in the single digits (matters for the 1.39% security and 1.73% discovery figures).
Confusion matrix
Sonnet 4.6 was rerun on a 61-item stratified random sample from the Haiku output, blind to Haiku’s labels. This is model-model agreement, not human-adjudicated ground truth. The 5-cluster collapse below shows where the two models agree (diagonal) and where they disagree (off-diagonal). The “op” cluster here includes operational correctness (quality, silent-failure, compat, install) plus ecosystem health types (maintenance, cross-skill, registry-meta) — a superset of the headline stat, which counts only the four correctness types. The single load-bearing axis of this study, operational vs. security, has zero observed op↔security cells in either direction in this validation sample. Adjacent-boundary disagreements (quality ↔ silent-failure, both operational) account for the bulk of the 16 off-diagonal pairs.
Exact-label agreement
73.8%
45/61 sample items
Cohen’s κ
0.71
Substantial agreement (Landis & Koch)
Op↔security confusions
0
On the load-bearing axis of the study
Sonnet 4.6 vs Haiku 4.5 · 5-cluster collapse
diagonal = agreement · off-diagonal = confusion
| Haiku ↓ / Sonnet → | operational | security | discovery | docs | residual |
|---|---|---|---|---|---|
| operational | 22 | 0 | 2 | 3 | 1 |
| security | 0 | 3 | 0 | 0 | 1 |
| discovery | 0 | 0 | 5 | 1 | 0 |
| docs | 3 | 0 | 0 | 7 | 0 |
| residual | 3 | 1 | 1 | 0 | 8 |
Diagonal cells (emerald) are same-cluster agreement. Off-diagonal cells (silver) are cross-cluster confusions; intensity scales with cell count. The op↔security row/column intersections are zero in this validation sample, which is the load-bearing axis of the study.
Snapshot from aggregate validation notes. Per-row labels will be published asvalidation-sample.csv in the dataset repo; this view will then be computed from data.
Classifier validation
On a 61-item stratified random sample, Claude Sonnet 4.6 agreed with the Haiku 4.5 classifications at 73.8%(κ = 0.71, substantial agreement). Zero operational↔security confusions were observed in the sample. With 0 observed events, the rule-of-three upper bound is roughly 4.9% for this validation sample. All 16 disagreements fell at adjacent-category boundaries (e.g. quality vs. silent-failure, both operational; compat vs. docs). 9 blank Hacker News comments were excluded as unclassifiable. Raw sample and per-row labels in the dataset repo.
Sensitivity analysis. The operational:security ratio at baseline is 40:1 (5,688 vs. 143 GitHub reports). For the ratio to drop below 5:1, the classifier would need to have misclassified more than 30% of all operational reports as security, more than 100× the observed 0% operational↔security confusion rate. The finding is structural, not marginal.
Reproducibility
View C reruns end-to-end in about 5 minutes against a public-read GITHUB_TOKEN. View A is a multi-step pipeline (collectors → classifier → aggregation) that needs an ANTHROPIC_API_KEY; a full rerun is roughly 2 hours of wall time and $10–30 of Haiku inference.
# clone and install (~30s)
$ git clone https://github.com/vesselofone/openclaw-skills.git
$ cd openclaw-skills && pip install -r requirements.txt
# View C · catalog scan (~5 min, public-read token)
$ export GITHUB_TOKEN=ghp_your_public_read_token
$ python3 scripts/catalog-coverage.py --all \
--with-metadata --with-commits \
--output coverage-output --work-dir coverage-output/repos
# View A · ecosystem classification (~2h, ~$10-30 Haiku)
$ export ANTHROPIC_API_KEY=sk-ant-...
# see scripts/ for collectors and classifier entrypointsData availability and ethics
- Dataset license
- CC BY 4.0
- Scripts
- MIT
- Snapshot
- April 2026 point-in-time scrape
- Classifier
- Claude Haiku 4.5 with enum-constrained output
- Public data handling
- The study uses public GitHub, Hacker News, and Reddit text for aggregate classification. The page reports category counts and examples of methodology, not individual user profiles.
Limitations
- Static analysis only. View C does not execute skills; it resolves declared dependencies. SKILL.md completeness is voluntary. Interpret the installability number as dependency reachability, not runtime success.
- Single-label classification. A mention that describes both bad output and a compat bug must be assigned one bucket. Edge cases are forced, so category percentages are analytical approximations rather than adjudicated facts.
- GitHub pagination cap. Basic pagination tops out at 100 pages per repo. The
anthropics/claude-codecorpus tail is not covered. Treat GitHub percentages as proportions of collected reports, not full ecosystem prevalence. - HN noise. Keyword matching on “openclaw”/"skill" yields 54.68% unrelated hits. The noise tag filters them; do not use the HN subset without filtering.
- Reddit rate-limit floor. Unauthenticated Reddit reads cap out. 1,282 rows is a floor, not a true count, so Reddit is directional evidence for discovery pain rather than a prevalence estimate.
- Classifier precision. Validation measures model-model agreement, not human-labeled ground truth. On a 61-item validation sample, Sonnet 4.6 agreed with Haiku at 73.8% (κ = 0.71), with zero observed operational↔security confusions. The sensitivity analysis shows the operational:security ratio would require >30% systematic misclassification to drop below 5:1.
- Moderation status is not security. ClawHub flags reflect staff moderation, not a security audit. They are registry metadata, not an independent safety label.
- User sample is developer-skewed. 61.7% of classified reports come from GitHub Issues, which are filed predominantly by developers integrating skills, a population that is unusually good at finding things. Pain modes that are visible to that population (silent wrong output, compat, install failures in their own environment) are probably faithfully represented. Pain modes that are more acute for non-developers(discovery, decision-fatigue across similar skills, onboarding friction) are likely under-counted. The 1.73% discovery figure should be read as a lower bound among developer reporters, not as a prevalence estimate across all skill users.
Frequently asked
Common questions from readers and reviewers. If yours isn’t here, open an issue on the dataset repo.
Why juxtapose registry, scanner, and user views at all?
Because each view answers a different question, and policy conversations about the skills ecosystem routinely confuse them. “Is the ecosystem healthy?” depends on whether you mean installable, safe, or operationally correct. The three answers point in different directions: high installability, a materially lower scanner-clean surface, and a GitHub pain profile dominated by operational correctness. That spread changes what you’d build next.
Why is security around 1% of user-reported pain but 37% of scanner findings?
The survivorship argument is in the introduction. In short: ClawHavoc-style attacks can produce no crash, no error, and no GitHub issue. The 1.39% measures conscious, reportable pain from technical users. It does not measure silent compromise.
Why Claude Haiku 4.5 instead of Opus for classification?
Cost and throughput. 16,000 single-label classifications with ephemeral prompt-caching on the taxonomy run in ~2 hours on Haiku at roughly $10–30 all-in. Opus would push the run into hundreds of dollars without meaningfully changing the distribution for a 12-class, enum-constrained schema. We publish the classifier prompt; anyone can rerun on a different model to verify.
Why is Hacker News so noisy?
54.68% of HN matches are unrelated keyword collisions (the word "skill" has strong non-AI-agent English usage). The noise tag filters them. If you cite an HN subset, filter first.
Do OpenClaw skills work on Linux?
Yes. 97.0% of the 6,993 public skills run on Linux after one apt install line. A small tail (~209) needs a pip install or a curl script, and about 10 skills depend on macOS-only binaries for their primary capability.
What is the one apt install line?
sudo apt install python3 python3-pip nodejs npm curl jq git ffmpeg sqlite3 openssl ripgrepSatisfies every declared binary dependency for 6,784 of 6,993 public skills on Ubuntu 22.04/24.04 or Debian 12.
Is this a security audit?
No. View C is static compatibility analysis, not a security audit. View B cites external audits. View A classifies what users report, not what is objectively true. Per-skill security scoring is a separate pass; a free per-slug auditor is at vesselofone.com/tools/skill-check.
Will this study be refreshed?
View C is reproducible: clone the repo, re-run the script, diff the CSV. View A is point-in-time (April 2026); a future refresh will ship as a GitHub release with an immutable Zenodo DOI. The changelog lives on this page.
Changelog
- 2026-04-20 · v1.0. Initial publication. Catalog scan (6,993 skills, one apt line). Ecosystem classification (16,635 mentions, 12-type pain taxonomy).

