iTranslated by AI
dartrics: An AI-First Dart Metrics Tool for Humans and AI Agents
Introduction
Inspired by the Golden Week holidays, I decided to build a tool by AI, for AI, and of AI.
Every day, the amount of code written by AI agents continues to increase. While I find this convenient, it also feels like there are more issues to consider. In particular, AI agents seem to have a habit of incorporating surrounding code as reference models, which can lead to good results in some cases, but the opposite in others.
Additionally, considering the review process, the quality of the code when an AI submits it as "done" is crucial. Currently, I perform multiple round-trips from this point, but if the initial submission is poor, I sometimes have to request a complete rewrite. Since requesting a rewrite usually fixes the issue, I thought: if there were a way for the AI to maintain the quality of each piece of code, wouldn't the AI agent be able to evaluate and check its own code to produce better results?
While experimenting during Golden Week, an idea came to me: pass external, fixed standards to the AI instead of having it rely on surrounding code. If the standards are external, we should be able to suppress the bias toward overfitting. dartrics is an attempt to create a tool that is easy and practical to operate even in 2026 by placing this framework on academic metrics.
In this article, I will introduce why I made it, how I designed it, and how to use it.
dartrics
dartrics is a code analysis CLI for Dart and Flutter projects. Built on top of the analyzer package, it provides multiple metric calculations and detection of unused public APIs. The specific pillars of its design will be delved into in the next section.
With tools like Claude Code, it has become standard for AI agents to operate autonomously. In this context, if an AI agent can obtain evaluations of code from multiple perspectives during code analysis and self-review, it should be able to combine them appropriately to make decisions.
Based on this hypothesis, I designed it to be self-contained as a CLI and usable by AI agents without additional guidance.
dart pub global activate dartrics
You can install it with the above command. The intended usage scenario is to give instructions like, "Use dartrics to detect candidates for unnecessary code and handle the parts where deletion is appropriate," which will then be executed as requested. A CLI that works the same way whether executed directly by a human or passed to an AI agent—that is the position of dartrics.
Feature Overview
dartrics is not a completely new tool. The category of measuring metrics itself has decades of accumulation. However, whereas existing tools were designed with humans as the primary readers, dartrics differs in that it was rebuilt with AI agents as the primary readers. Below, I will explain this, divided into three pillars.
Pillar 1: Designing for AI-first
the primary user of dartrics is the AI agent. Of course, it is convenient for humans to use directly as a CLI, but the axis for design decisions is whether it makes sense when passed to an AI. It is an AI-first metrics tool in the same sense as "mobile-first" or "API-first."
This policy has another implication: the choice not to try and make the CLI overly intelligent.
In the world of metrics tools, there is the idea that "if the tool were smarter, it could automatically fix violations." dartrics does not take that direction. Instead, it focuses on outputting results in an interpretable format. It takes the stance that code decisions should be delegated to AI agents, which are closer to "intelligence" than a CLI.
This is a decision that prioritizes the performance and improvement of AI agents. Furthermore, dartrics provides output formats tailored to different users (the --reporter option mentioned later). For AI, it uses ai (YAML-based, emphasizing token efficiency), and for humans, it uses md or console. The goal is that providing the optimal format for each utilizing entity aligns with the operation of an autonomous AI agent.
Pillar 2: Using "Correct AST" as a Foundation
Another pillar is measuring metrics using an AST equivalent to that used during compilation. Specifically, I used the official Dart analyzer package, referencing the same AST as dart analyze and IDEs.
If you just want to measure metrics "in some way," you could get by with regular expressions or simple parsers. However, this leads to problems such as:
- Not being able to follow new Dart syntax like generics, named parameters, and
switchexpressions - Ambiguous determination of method boundaries, causing method length and complexity to deviate from reality
- Unreliable detection of unused code
Conversely, if you use an AST equivalent to compilation time, these discrepancies do not occur in principle.
It is particularly effective for unused code detection. The analyzer provides a complete AST that includes name resolution. This determines "where a certain function/class is called" at the same level as the compiler. Since various metrics can be measured after removing unused code, the numbers for coupling and complexity can be trusted as "numbers for living code."[1]
The accuracy of the information output by dartrics is supported by this foundation of a correct AST.
Pillar 3: Anchoring Metrics to Academic Literature
The final pillar is ensuring that each metric is tied to its original academic literature. For example, Cyclomatic Complexity is linked to McCabe (1976), Cognitive Complexity to SonarSource (2017), LCOM4 to Hitz & Montazeri (1995), coupling to CK (1994), instability to Martin (1994), and Halstead Volume to Halstead (1977).
Through dogfooding, I confirmed that current AI can hallucinate the definitions or the rationale for thresholds of metrics from its training data if it is not given appropriate instructions.
dartrics includes the rationale, refactor hints, and source for each violation in its output, eliminating the need for AI to search again for "what Cyclomatic Complexity is" or "why the threshold is 10." By citing the sources, we prevent AI hallucinations. At the same time, it allows humans to verify which literature an AI agent's decision is based on. These are the two effects we are aiming for.
However, since it is not always appropriate to apply metrics from literature to Dart, the metrics adopted and their thresholds are adjusted individually. The details of this process are summarized in doc/calibration.md in the repository, so please check it if you are concerned during operation.
Including this organization of academic literature, I used AI agents for much of the design and implementation of dartrics. From tracking the original sources of each metric, auditing the sources, and deciding on deletions, to writing the majority of the code, we proceeded with a system where a human proposed "what we want to achieve" and an AI agent carried out the actual work.[2]
Refinements in dartrics
The three pillars are designed to interlock within the repetition of iterations between the AI agent and the CLI output. The results of the AI executing the CLI are returned in a format readable by the AI (Pillar 1), as violations based on a correct AST (Pillar 2), and with grounds from academic literature (Pillar 3). The AI makes judgments based solely on this return value and connects it to the next CLI call. The pillars function individually, but they only gain their full meaning within this loop.
Below, I will introduce the implementation points refined to make this iteration loop successful.
Identifying Issues via ID
Violations are assigned IDs that are stable across executions. This allows for mechanical tracking of "which violations from the previous execution have been fixed and which remain."
Development using AI agents is no longer something that ends in a single shot. In the daily cycle, code is added and deleted. To run this loop, it becomes necessary to identify "which point the violation was referring to" among the metric results. A stable ID is a core design feature.[3]
Output Formats
As touched upon in Pillar 1, output formats are prepared and optimized for different readers. The entrance is switched using --reporter.
| reporter | Target Audience | Usage |
|---|---|---|
ai |
AI agent | Pipeline to LLM. YAML-based emphasizing token efficiency |
console |
Human | Interactive usage in terminal |
md |
Human | Pasting into PR comments or documentation |
sarif |
Machine / Review UI | Integration with GitHub Actions Code Scanning |
json |
Machine | Feeding into a custom dashboard |
For the AI format (--reporter ai), it is a YAML-based output that is less redundant than JSON and prioritizes token efficiency. Decorators were kept to a minimum so that more violations could be packed into the LLM's context window. When actually passed to Claude Code (Opus 4.7) or similar, it interprets the violation ID, rationale, and refactor hints as-is without needing a formatting script.
Note that there is still room for improvement regarding output formats for humans and git management. If there is a form in which you would like to use the output, please feel free to provide feedback via Issue.
Detection of Unused Public APIs
The dartrics unused subcommand also provides detection of unused public APIs. This is a feature that picks up code defined as a public API but not referenced from either internally or externally. This covers an area that cannot be caught by dart fix or unnecessary_* type lints and should be particularly useful during library development.
When used directly by humans, this is probably the feature that feels the most "convenient." This is because searching for unused code diligently is quite a chore. By listing candidate unused code via CLI and checking them individually, you can greatly save that effort. A filter option is also available, so please check the options if you are interested.
Usage
Installation and Analysis
dart pub global activate dartrics
dartrics analyze lib/
If you want to pass it to Claude Code, etc., you can also pipe it with the AI format.
dartrics analyze lib/ --reporter ai | claude -p "Refactor violations"
The output is already in a format that the LLM can understand, so there is no need to insert a formatting script.
Closing the Loop with Regression Analysis
To check for improvements after refactoring, use the regression subcommand.
dartrics regression --before HEAD~1 --after HEAD --reporter ai
The loop of analyze → ask AI to fix → check diff with regression can be run with a single command. Is the created commit really an improvement? Is there a better way to improve it? Such checks can be used by humans to confirm manually, and execution of regression can be formalized and incorporated into the code generation process.
Configuration
Thresholds and targets can be adjusted in analysis_options.yaml.
dartrics:
metrics:
cyclomatic-complexity:
warning: 10 # Default 10
method-length:
enabled: true # Default off, explicitly enabled
exclude:
- "lib/generated/**"
A complete list of configurable keys is defined in the repository's schemas/dartrics-config.schema.json.
Integration into IDE / dart analyze
In addition to the CLI, it is also provided as an analyzer plugin. By writing the following in analysis_options.yaml and restarting the analysis server, three lightweight rules (Cyclomatic Complexity, Cognitive Complexity, Number Of Parameters) will be displayed as inline warnings in your IDE or during dart analyze.
plugins:
dartrics:
Heavy metrics (LCOM4, CBO, RFC, Martin-style coupling) and unused public API detection are limited to the CLI because they require cross-project indexing. This supports a two-stage usage: catching light violations in the IDE while writing code, and performing deep analysis and unused detection via the CLI.
Bundled Documentation
Definitions of metrics and how to build AI loops are bundled within the CLI.
dartrics manual # Metric definitions and interpretation
dartrics ai-loop # How to combine with AI agents
dartrics rules # List of active rules and thresholds
By bundling it in the CLI, the AI agent can automatically read the documentation and execute the CLI.
What moved me the most after building it was that I was able to bundle the documentation. An AI agent does not even need to search GitHub; it simply reads the official documentation in the CLI and goes straight into processing—I believe that providing this flow from the tool side has value unique to CLIs in the AI era.
Other Subcommands
Besides what I have touched upon so far, the following subcommands are available:
dartrics unused # Detection of unused public APIs
dartrics report # Conversion from analysis result JSON to other formats
dartrics doctor # Environment check
You can check the list with dartrics --help, so please check it if you are interested.
Other Research
Since it went well with Dart, I wanted to try it with other languages. After touching a Rust version created by asking Opus, something that actually works is being built. The design of an AI-first CLI and AST-based analysis is not unique to Dart—this is the conclusion at this point.
There were many occasions where I reflected findings from the Rust implementation back to the Dart side; trying common ideas across multiple languages was meaningful. It also doubles the dogfooding target.
Since I am not familiar with Rust, I have not released cargo-rustics. It is strictly an experiment to check the exportability of the concept. The processing time is slower than the Dart version, but I consider the very fact that it works as a result. I plan to deepen this further, as the usage scene for Rust is likely to expand. It is also very interesting as it provides a trigger to investigate various language features.
Conclusion
dartrics is a tool designed on the premise of delegating decisions to AI agents.
As of May 2026, AI agents still have a bias in the Dart/Flutter domain and a habit of incorporating surrounding code as reference models. dartrics is a tool that reinforces the quality of context from the tool side. If AI performance improves, will this reinforcement become even more effective? It is safe to say that I am betting on the future performance improvements of AI agents.
It is still a tool under active development. Feedback and Issues are always welcome. Please try it out.
-
No matter how high the cyclomatic complexity is, if no one calls the function, the first thing you should do is delete it. ↩︎
-
Perhaps it is simply that it was too difficult for an average engineer like me to introduce or manage these metrics, but AI could do both. ↩︎
-
Violations in the same file, scope, and metric are determined as a single ID by taking the sha of the 3 elements. This design ensures the same ID is maintained across executions. ↩︎
Discussion