I've recently tried skills like Garry Tan's GStack, spent a week with it, and realized it has some flaws (I'll post separately about that).Here's my problem: how do I know if a skill or prompt is any good (e.g. GStack's /office-hours)?How do I compare similar skills (e.g. different "deep research" skills)?Spotting broken software is (relatively) easy — it crashes, prints errors. Broken skills don't. Perfectly polished, confident-sounding skills routinely mislead me and waste my time, to the point where I wish I weren't using an LLM at all.AI skills are software — and they should come with regression tests.LLM teams have tons of prompt regression tests. LLM-wrapper SaaS companies have tons of prompt regression tests. But when it comes to open-source skills, SKILL.md reads reasonable, yet ships with zero tests (e.g. GStack's /office-hours has none at the time of writing).Garry Tan, if you hear me — please consider shipping regression tests for your /office-hours, /plan-ceo-review, /plan-...
Want to discover more AI signals like this?
Explore Steek