iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🛡️

Using Claude to Determine CVE Relevance: A Triaging Design for Prioritizing Vulnerabilities via Dependency Tree Matching

に公開

Determining if a vulnerability applies to your company is a task where AI excels, using CVEs, dependency trees, and usage analysis

Dependabot and npm audit can accurately answer, "Are there any dependencies that match this vulnerability?" This should not be left to AI, as it is a mechanical comparison between lockfiles and version ranges.

However, the part that actually consumes time in operations is what comes next: "Out of the 40 items identified as affected, which ones should we patch this week?" This prioritization cannot be decided by CVSS scores alone. It is an ambiguous task that cannot be fully rule-based, requiring the cross-referencing of CVE content, your internal dependency tree, and how the package is actually used.

This is where AI excels. By providing the body of the security advisory and your dependency tree, you can narrow down the "urgency for your company" along with the reasons. In this article, I will present a triage design that divides roles into three layers: deterministic tools, AI, and humans.

Why "affected or not" is insufficient for smooth operations

Personally, I have set up a triage mechanism that fetches advisories with a CVSS score of 7.0 or higher from the GitHub Advisory Database every 5 minutes and streams them to Discord. However, this is a general-purpose filter that determines if it "affects cloud users in general," and it does not look at the dependency tree of specific projects. There are three limitations to this design.

  1. General relevance judgments produce noise: A CVE for axios might be flagged as "relevant to cloud users." However, if your company does not use that package, it is just noise.
  2. Deterministic tool output stops at a list of affected items: npm audit / pip-audit / govulncheck / Dependabot accurately answer whether an item is "affected" by comparing it with the lockfile. They are faster and more accurate than AI. However, their prioritization only goes as far as the generic CVSS score, and they do not provide the "urgency for your company."
  3. The real toil is in prioritization: Facing 40 affected items and deciding "which ones to do this week." This depends on reachability (whether it is actually called from your code), exposure (whether it is a public endpoint or an internal batch), and usage (whether you use the relevant feature). It cannot be determined by a single score like CVSS.

Given these three points, it is clear that trying to complete the determination in a single layer is a mistake. "Does it match?" and "Is it urgent for us?" are questions of different natures, and they require different tools.

Division of roles — Deterministic tools for "affected status," AI for "priority determination"

Assign tasks based on the question. If you force the AI to do everything, even tasks that can be solved deterministically become ambiguous, leading to hallucinations.

Question Responsibility Reason
Are there any dependencies that match this vulnerability? npm audit / pip-audit / govulncheck / Dependabot Comparing lockfiles and version ranges is a deterministic process. Giving it to AI creates room for misjudgment.
Assuming it matches, how urgent is it for our company right now? Claude Contextual inference of reachability, exposure, and usage is ambiguous and cannot be reduced to rules.
Advisories without structured version ranges Claude Fills the gap where deterministic tools cannot perform matches.

I will elaborate on the third row. Deterministic matching assumes that the advisory has machine-readable version ranges. In the OSV schema, this corresponds to affected[].ranges, and in the GitHub Advisory DB, it corresponds to the extended field vulnerable_version_range (OSV schema / github/advisory-database).

However, there have been reports of instances where these ranges are missing during the conversion from CVE to OSV (osv.dev issue #4489). Deterministic tools miss advisories lacking these ranges. I supplement this with AI, which can estimate the range from the explanatory text.

Note that Go's govulncheck analyzes the call graph to deterministically determine reachability as well. In other words, for Go, reachability can also be determined mechanically. AI's reachability inference is most effective for npm/pip ecosystems, where such analysis is not standard.

Now that the roles are decided, we will begin the implementation by creating the input for the AI—the dependency tree inventory.

Implementation — Creating an inventory and passing it to Claude

1. Consolidating the dependency tree into an inventory

Export the list of dependencies for each ecosystem and normalize them into a common format.

# npm: Include transitive dependencies
npm ls --all --json > /tmp/npm-tree.json

# pip: List of installed packages
pip list --format=json > /tmp/pip-list.json

# Go: Module graph
go list -m -json all > /tmp/go-mods.json

Align these into the following common format. Keeping track of the ecosystem, name, version, and whether it is a direct or transitive dependency is important, as "direct vs. transitive" will be a factor for urgency determination later.

[
  {"ecosystem": "npm", "name": "lodash", "version": "4.17.20", "type": "transitive"},
  {"ecosystem": "pip", "name": "requests", "version": "2.31.0", "type": "direct"}
]

2. Leave "Is it affected?" to deterministic tools

Run a deterministic SCA (Software Composition Analysis) in parallel with the inventory creation.

npm audit --json > /tmp/npm-audit.json
pip-audit -f json > /tmp/pip-audit.json
govulncheck -json ./... > /tmp/govulncheck.json

the "affected packages + GHSA/CVE ID" output by these tools will be the input for the subsequent AI judgment. Up to this point, "does it affect us?" has been narrowed down deterministically.

3. Have Claude determine the "urgency for our company"

Only pass advisories confirmed as affected to the AI. There are three inputs.

  • Advisory text (summary, impact, affected version range)
  • Target package and dependency type (direct/transitive)
  • Simple architectural context (e.g., "Public API is only payment-api. Others are internal batches")

Have the AI return the following structured JSON for each advisory.

{
  "ghsa_id": "GHSA-xxxx-xxxx-xxxx",
  "matched_package": "lodash",
  "dependency_type": "transitive",
  "reachability": "unknown",
  "exposure": "internal",
  "urgency": "medium",
  "reason": "lodash is an internal dependency for batch processing. No code path reachable from public endpoints could be confirmed. While it is a prototype pollution, input is limited to internal data.",
  "recommended_action": "Update lodash to 4.17.21 or higher in the next scheduled maintenance"
}

Define the criteria for each field:

  • reachability: Is the relevant feature called from your code? Three values: likely / unlikely / unknown. Have it return unknown when it cannot be determined. If you force a conclusion here, it will lead to hallucinations. A design that allows for abstention with a confidence level is effective for classification tasks in general.
  • exposure: Is it internet-facing, internal, or build-time only? Infer this from the architectural context.
  • urgency: high / medium / low. Not just the CVSS score, but the urgency "for your company" that takes reachability and exposure into account.
  • reason: Why this urgency was chosen. This serves as a basis for judgment when humans re-evaluate it.

In the previous entry of this series, I had Claude classify terraform plan ~ (in-place updates) into 4 categories (related article) and categorized CloudWatch Logs errors (related article). The structure here is the same. I delegate ambiguous areas where humans struggle to make decisions to the AI as classifications with confidence levels.

4. Integrate into cron and CI

Run this at two trigger points.

cron (Daily advisory monitoring): Replace the judgment step of your existing triage script with a prompt that handles the inventory. Add the results of the deterministic SCA as input, and notify only those that are affected and marked as urgency: high. Compared to when we streamed notifications based on general relevance, notifications are now narrowed down to "high-priority items that affect us."

CI (Immediate check when adding dependencies): When the lockfile changes in a PR, regenerate the inventory and run the AI judgment only for the newly affected parts. If urgency: high is returned, comment on the PR or block the merge.

# Run only when the lockfile changes in a PR
git diff --name-only origin/main | grep -qE 'package-lock\.json|requirements\.txt|go\.sum' || exit 0

# Re-generate inventory -> deterministic SCA -> AI judgment for affected items only (pseudo-flow)
build_inventory > inventory.json
npm audit --json > audit.json
# Pass the 'affected' items from audit.json along with inventory/architecture context to the prompt

With this, "does it affect us?" and "when should we do it?" are offloaded to the machine side. What remains is the human domain.

Areas for human judgment

AI returns "candidates to look at this week" and the reasons why. The final decision rests with the human.

  • Whether to schedule emergency maintenance: A decision based on business impact, SLA, and the effort required for maintenance notifications. AI's urgency is just one of the inputs.
  • Immediate patch or mitigation strategy: If version updates cause dependency conflicts, choose a temporary measure (e.g., disabling the feature via settings).
  • Manual verification of reachability: unknown: For items where the AI could not determine reachability, check them by reading the code.
  • Re-evaluation based on external signals: Whether the relevant CVE is actually being exploited (CISA's KEV catalog), or the probability of exploitation (EPSS score). A low priority item may flip to high the moment a public exploit is released.

By picking up every case where it "might affect us" with deterministic tools, narrowing down "when to do it" with reasons via AI, and having humans decide "how to do it," you can shrink the flood of daily security advisories down to a "few things to check today." Neither dumping everything onto the AI nor locking the AI out, but offloading only the ambiguous priority determination is what makes a triage design effective in operations.

Discussion