
We Classified 16,635 OpenClaw Skill Complaints. Wrong Output Is the #1 Failure Mode.
We collected and classified 16,635 user mentions about OpenClaw skills across GitHub Issues, Hacker News, and Reddit. Wrong or missing output: 39.1%. Installation failures: 7.3%. And why GitHub, HN, and Reddit each surface a different problem.
The OpenClaw registry tracks whether skills install. The dominant user complaint is not installation failure. It is a skill that installs cleanly, runs without error, and produces wrong or missing output, with no signal that anything went wrong.
We classified 16,635 user mentions to find out what actually breaks.
How We Built the Dataset
Three sources:
- GitHub Issues: scraped across
openclaw/openclaw,openclaw/skills, andNVIDIA/NemoClawvia the GitHub Issues API - Hacker News: 16,840 raw mentions via the Algolia API, filtered to 54.7% noise (keyword collisions unrelated to the skills ecosystem)
- Reddit: across 12 subreddits where OpenClaw skills discussions appear
Each mention was classified into one of 12 categories using Claude Haiku 4.5 with an enum-constrained schema. We validated a 500-item sample against Claude Sonnet 4.6. Exact agreement: 73.8%, κ = 0.71. Zero operational-versus-security confusions in the validation set.
The methodology, prompt, and validation data are published at vesselofone.com/research/ai-agent-skills-ecosystem. The full dataset is at doi.org/10.5281/zenodo.19691714 under CC BY 4.0. Reproducing the full classification pipeline requires an Anthropic API key and runs in approximately two hours at $10-30 in API cost.
What People Actually Complain About
Aggregated across all three sources:
| Category | Share of all mentions |
|---|---|
| Operational correctness failures | 20.7% |
| Silent failures (wrong output, no error) | 18.4% |
| Compatibility problems | 9.0% |
| Installation issues | 7.3% |
| Security concerns | 1.4% (GitHub) / 5.1% (Hacker News) |
| Discovery friction | 1.7% (GitHub) / 8.3% (Reddit) |
The top two categories together (39.1% of all classified mentions) describe the same underlying failure: a skill that appears to work but does not. The agent returns a result. The result is wrong. No error is raised. The user has to notice on their own.
Installation is 7.3%. It is the failure mode that gets the most tooling attention and causes the fewest complaints.
What You Find Depends on Where You Look
The three sources surface different problems. This is not sampling noise; it reflects how each platform is used.
GitHub skews toward operational correctness. Developers who encounter wrong output file detailed bug reports: reproduction steps, expected versus actual output, version numbers. 55.5% of GitHub-filed complaints describe correctness failures of this kind. These are the most actionable reports, but they require the user to know the output was wrong.
Hacker News surfaces security at 5.1% versus 1.4% on GitHub. Security researchers and people following incident reports congregate there. Discussions of supply-chain attacks and OAuth over-scoping appear in HN threads well before they produce GitHub issues.
Reddit surfaces discovery friction at 8.3% versus 1.7% on GitHub. Finding the right skill in a catalog of 6,993 is a genuine problem that never produces a bug report. Users who cannot find what they need leave a subreddit comment, not a GitHub issue.
If you only read GitHub Issues, you miss the security risk distribution and the discovery problem entirely.
Why the Security Numbers Understate Actual Risk
Security complaints represent 1.4% of GitHub mentions. That figure almost certainly understates actual exposure.
A successful supply-chain attack is silent by design. Users whose agents were compromised by the ClawHavoc campaign (skills that installed cleanly, ran without errors, and exfiltrated data through the agent's normal output channel) are not in this dataset. They do not know they were compromised. They filed no issues.
The 1.4% counts users who recognized a security problem and filed a report. It does not count users who were affected.
What This Means for Skill Authors and Installers
The dominant failure mode (silent wrong output) has no automatic detection. No error is raised. No alert fires. The skill appears healthy by every operational metric.
For skill authors: return explicit errors on unexpected input rather than plausible-looking wrong output. A skill that fails loudly is easier to debug than one that succeeds silently with bad results. Write correctness tests against representative data before publishing.
For installers: test skills with representative inputs before putting them in any workflow that touches real data. Monitor outputs, not just uptime. A skill that returns something is not the same as a skill that returns the right thing.
Full methodology and dataset: vesselofone.com/research/ai-agent-skills-ecosystem. Dataset at doi.org/10.5281/zenodo.19691714. Scan scripts at github.com/vesselofone/openclaw-skills under MIT + CC BY 4.0.
A free per-skill auditor covering SKILL.md intent, OAuth scope width, and injection patterns: vesselofone.com/tools/skill-check.
Vessel is managed OpenClaw hosting on private Linux VMs. Every agent we provision runs the skill auditor at setup. The research and dataset are open source.

