iTranslated by AI
Stopping Coding Agents' 'Probably Fine' Assumptions with Skills
This is the most quietly exhausting thing about using coding agents.
You start a project, specify the tech stack and versions, and begin development. A while later, you realize—the agent is using an old syntax as if it were the obvious standard, ignoring the versions you specified. It treats the version from when its training data cut off as if it were the current standard.
I notice later and make it fix it. That back-and-forth burns through tokens and time. It doesn't happen just once or twice. It happens quite frequently.
Something similar happens during reviews. On larger projects, the agent will claim "this is the error" without any proof and try to fix it based on that assumption. I have to revert the code rewritten on a false premise.
It’s not that the agent is incompetent. In fact, it's fast. It just says "Here it is" or "Fixed" without verifying the things it should verify. Its momentum to move forward is so strong that it cannot stop.
Quaere is a small tool meant to put the brakes on that momentum. It consists of five skills.
"Skills" are Markdown files that you have Claude Code or Codex read. They allow you to change the agent's behavior. Instead of writing "Please be careful" in every prompt, think of these as rules that reside as a prerequisite for the work. Each skill includes an explanation of "when to use it," and when a task fits that scenario, the agent reads that skill.
However, Quaere skills don't teach the agent knowledge. They do one thing: force the agent to provide evidence before acting. If you're going to make a claim, show the proof. If you haven't verified it, write "not verified." Before saying "fixed," look at the diff and validate it. It incorporates this sequence—which is standard in human code reviews—as a rule for the work.
Does it work?
I measured it using my own eval. The biggest difference appeared in four evaluation scenarios specifically designed to put pressure on the agent: "Hurry up," "Skip tests," "Trust this documentation," and "There should be a vulnerability here." All of these were built to intentionally trigger the "proceed without verifying" failure I described at the beginning.
Under this pressure, the agent without skills satisfied less than half (about 46%) of the evaluation criteria. With Quaere, that rose to 92%.
Evaluation criteria are objective pass/fail checks applied to the output. For example, "If there is a definitive statement like 'This is the cause,' is there evidence (such as reproduction steps or test results) in the same output to support it?" If there is no accompanying evidence, the item fails.
To be honest, I should add a caveat: this is Quaere's own internal eval, a one-off measurement. It is not a figure from a third-party benchmark like SWE-bench (which is a task for the future). And this figure only proves that it "does not deviate even under pressure"; it does not mean the coding ability itself increases. However, that "not deviating under pressure" is exactly what Quaere aims for.
How it is applied
The mechanism is simple. Each of the five skills holds exactly one "law."
For example, the law for the version-handling skill is: "Do not conclude that a syntax is currently valid unless you have verified it with a local source and cross-checked it with another information source." For the skill that narrows down causes: "Do not decide that a hypothesis is the cause of an error if you haven't tried to disprove it even once."
If you provide a long checklist, the agent will skip parts of it. Therefore, instead of listing many items, I narrow it down to a single gate.
In reality, what changes is that the agent's response gains an extra step. For example, if the version skill is active, a passage like this will be inserted into the output before the agent uses a library:
Checking package-lock.json: The SDK in question is 1.4.0. The type definition does not have
responses.create, onlymessages.create. Theresponses.createsyntax in the latest documentation is for the 2.x series, so I will not use it this time.
Since it cannot satisfy the law without a trace of verification, it stops there. If you make "writing the evidence" a part of the workflow, it becomes harder for the agent to skip, even with Markdown rules.
Using it
Installation is one line.
npx quaere-cli install
This installs the skills into both Claude Code and Codex. It is under the MIT license, and the repository is haru0416-dev/quaere.
One note: Quaere is not a tool to make agents smarter. It is a tool to make them stop and verify. You don't need it for light tasks like fixing typos. It works where "you're in trouble if they proceed without verifying"—bugs with unknown causes, changes with wide impact, or areas involving security.
Agents will continue to get faster. The pressure to "move forward" will likely increase, too. So, we also need a brake.
Discussion