Proactively audits a SkillsBench task's instruction.md against QC rubric dimensions before submission. Evaluates prescriptiveness, skill mentions, realism, and clarity. Cross-references test_outputs.py to flag testability risks when recommending content removal.
Trigger: Ask to audit, review, or pre-screen a task's instructions.
Checks if a task's test_outputs.py is missing tests for explicit output requirements defined in instruction.md. Reports which requirements have no corresponding test.
Trigger: Ask to find missing tests, check coverage gaps, or verify all outputs are tested.
Checks if a task's test_outputs.py contains tests that unfairly penalize correct agent responses through overly strict or unreasonable assertions not justified by the instructions.
Trigger: Ask to check for unfair tests, overly strict assertions, or tests that don't fairly assess outputs.