iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤖

Evaluating Generative AI Tools for Programming (2025/04/14)

に公開
1

These are my personal impressions as of now. Since things are fluid, my opinions might change even by tomorrow.

Models

  • Claude-3.7-sonnet
    • Coding performance is overwhelmingly good. If you're unsure, just use this for now.
    • The limit for managing code is roughly 1,000 lines per file.
  • Gemini 2.5
    • Currently available for free. It's a good idea to use it extensively now to understand its quirks.
    • Since it can handle huge contexts, it's suitable for typical business programming—"reading a lot of code and writing a little."
    • It was unstable due to high load for a week after release, but it has stabilized recently.
    • As expected, its pure coding performance is inferior to Claude-3.7-sonnet.
  • deepseek-chat
    • Too slow to be useful with Cline.
    • I use it for brainstorming when building AI tools. It's cheap and safe even if you send huge amounts of data carelessly.

Coding Agents/Extensions

  • Cline
    • The so-called original.
    • I prioritize Roo, but Roo breaks spectacularly depending on the day's patch, so I use the original at those times.
    • MCP Marketplace is an advantage.
  • RooCode
    • A fork, but honestly, it has become something else entirely.
    • Being able to use Custom Mode yourself is convenient (in fact, the original's restriction to only Plan/Code is hard to use).
    • Diff application tends to break, and there are days when it doesn't work depending on the patch.
  • Gemini Code Assist
  • Copilot Agents
  • Copilot
  • Claude Code
    • A CLI workflow, not VS Code.
    • Since it narrows down the context based on git status, it's suitable for business programming on huge contexts.
    • It uses GH PR/git commit/git status in the prompts, so it's hard to make it work unless you're strictly following a git workflow.
    • While it's good if you're being serious, it didn't suit me because I'm the type who writes a lot of messy code first and creates commits later.

Although I know it has many users, Cursor doesn't fit my style at all, and I can't evaluate it properly because it doesn't click with me. It surprisingly just doesn't feel right.

MCP

Currently, there are no ones I can fully trust, so I generally don't use community-provided ones and build them myself.

I occasionally use playwright-mcp and brave-search.

My custom MCPs and tool implementations are around here:

https://github.com/mizchi/ai-toolkit/tree/main/tools

TypeScript-Specific Issues

  • In any case, it is poor at executing .ts files.
    • It constantly fumbles with tsx, npx, and type: module until it breaks things.
  • It doesn't refactor unless specifically asked. If the code appears to work or passes tests, it tells the user it's done.
  • It tends to write too many try catch blocks, which often lowers quality.
    • Even if I want to catch errors collectively in the parent, the narrow context window causes it to persistently swallow errors with try catch, resulting in a lack of consistency across the code.
  • Once it writes a print debug, it starts assuming it's okay to do so, generates a massive volume of logs, consumes them itself, and stops after hitting the context limit.
    • When I leave a task running before a break, it's usually stuck for this reason.
  • If it references an old library like jQuery or lodash even once, order is lost instantly through that chain. Human intervention is required early on.

How to Write Prompts

Instead of "the more the better," I have come to write very little.

For example, when I want to do TDD, I don't write in detail about TDD (assuming the AI has already learned it); I just write "Do TDD."

I provide at most a one-shot example of how to start writing the first test.

// test is implemented with vitest
import {test, expect} from "vitest";
test("1+1=2", () => {
  expect(1+1, 2);
});

There is no definitive prompt; I add things to correct what didn't work through running tasks.

It's just a matter of writing it down if there are libraries you want to prioritize throughout the project.

Instead, I accumulate knowledge under the docs directory.

docs/
  how-to-test-with-lighthouse.md

Actual Workflow

In Cline, I imagine each source code file as a person and explicitly mention reference resources as if I'm replying repeatedly on X.

Rewrite @/src/get-perf.ts according to @/docs/how-to-test-with-lighthouse.md

The reason I'm writing fewer general-purpose prompts is that the docs side is growing.

In practice, the coding procedure follows these steps:

  • Have Claude generate code for implementation verification, aiming for a maximum of about 1,000 lines.
  • Have Claude write unit tests for the resulting code.
  • Once the operation is confirmed, refactor it myself.
  • Have Gemini integrate it into the existing codebase.
  • Have Gemini write integration tests.
  • Once refactored and module boundaries are sufficiently separated (narrowing the context), have Claude review it.
  • Reflect lessons learned in .clinerules or docs/*.md.

It has become a style where I start with so-called "vibe coding," and once it works, I refactor it myself or rewrite everything using it as a model. The most valuable part is the fact-finding of "whether what I wanted to do can be achieved with this library."

If I know the API needed to achieve the target, that fact alone is enough for me to rewrite it from scratch.

Refactoring and the Context Window

The biggest problem is that AI doesn't seem to have a perspective for managing large-scale code.

It appears that during the model training phase, they are trained with the goal of "getting it to work for now," and they probably haven't acquired concepts like "the smell of bad design."

That being said, there is no universal answer for "high-quality code," but if you don't want the project to fail, human intervention for refactoring is absolutely necessary.

Bad code propagates quickly. AI also degrades, so you must not compromise. Current AI does not meet my quality requirements and, in reality, continues to fall apart.

I can manage things well up to about 1,000 lines, and if context is lacking, it reaches its limit around 3,000 lines.

  • Claude runs out of context and keeps trying to fix a script of about 1,000 lines until it hits its limit and stops, costing about $10.
  • Gemini 2.5 has a large context window, so it can understand the code, but the quality of the generated code is poor, so it falls apart in terms of code quality before it can even get to the point of running.

Also, AI respects the original code too much, so it feels unsuitable for bold refactoring.

Impressions

  • Pros
    • It has clearly become faster to create code that uses complex libraries (e.g., lighthouse, puppeteer) for my specific needs.
    • Data analysis workloads are particularly fast.
    • AI excels at applying some kind of algorithm to existing code.
      • For example, "Weight the importance of crawled documents using the PageRank algorithm based on the link structure."
  • Cons
    • Explicitly specifying the documents to reference improves quality, but preparing the documents themselves is hard, and it's a burden on the human giving instructions.
    • Understanding existing code is truly difficult. Giving instructions directly correlates to human load.
    • My eyes have been hurting lately because I've been performing high-speed visual reviews of generated code.

Bonus: CLI Template

Besides Cline, I'm constantly making interactive CLI tools based on Function Calling.

I know the patterns I need, so I have about 350 lines of Deno code that implement them. I reuse this to build various things.

Note: bashTool is quite dangerous, so do not use it unless you understand what it does.

/**
 * No local dependency chat example with ai-sdk
 */
import {
  streamText,
  tool,
  jsonSchema,
  ToolResult,
  Tool,
  type CoreMessage,
} from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { google } from "@ai-sdk/google";
import { deepseek } from "@ai-sdk/deepseek";
import { parseArgs } from "node:util";
import { extract, toMarkdown } from "@mizchi/readability";
import path from "node:path";

async function runCommand(
  command: string
): Promise<{ code: number; stdout: string; stderr: string }> {
  const [cmd, ...args] = command.split(/\s+/g);
  const c = new Deno.Command(cmd, {
    args,
    stdout: "piped",
    stderr: "piped",
  });

  const { stdout, stderr, code } = await c.output();
  const decoder = new TextDecoder();
  return {
    code,
    stdout: decoder.decode(stdout),
    stderr: decoder.decode(stderr),
  };
}

const SYSTEM = `
You are an assistant who answers user questions.
Use tools as needed to answer user questions.
If only a URL is passed, read the content of that URL and summarize it.
<environment>
  pwd: ${Deno.cwd()}
</environment>
`
  .split("\n")
  .map((line) => line.trim())
  .join("\n");

// Define getModelByName function here
export function getModelByName<
  T extends
    | Parameters<typeof anthropic>[0]
    | Parameters<typeof google>[0]
    | Parameters<typeof deepseek>[0]
>(model: T, settings?: any) {
  if (model === "claude") {
    return anthropic("claude-3-7-sonnet-20250219", settings);
  }
  if (model === "gemini") {
    return google("gemini-2.5-pro-exp-03-25", settings);
  }
  if (model === "deepseek") {
    return deepseek("deepseek-chat", settings);
  }
  if (model.startsWith("claude-")) {
    return anthropic(model, settings);
  }
  if (model.startsWith("gemini-")) {
    return google(model, settings);
  }
  if (model.startsWith("deepseek-")) {
    return deepseek(model, settings);
  }
  throw new Error(`Model ${model} not supported`);
}

/// Tools
export const bashTool = tool({
  description: `
  Suggests the execution of a bash command to the user.
  The user confirms the command before execution. It may be rejected.
  `.trim(),
  parameters: jsonSchema<{ command: string }>({
    type: "object",
    properties: {
      command: {
        type: "string",
        describe: "The command to execute",
      },
      cwd: {
        type: "string",
        describe: "Current Working Directory",
      },
    },
    required: ["command", "cwd"],
  }),
  async execute({ command }) {
    const ok = confirm(`Run: ${command}`);
    if (!ok) {
      return `User denied.`;
    }
    try {
      const result = await runCommand(command);
      return result;
    } catch (e) {
      const message = e instanceof Error ? e.message : String(e);
      return message;
    }
  },
});

export const askTool = tool({
  description: "Ask a question to the user. Call this for user input",
  parameters: jsonSchema<{ question: string }>({
    type: "object",
    properties: {
      question: {
        type: "string",
        describe: "The question to ask the user",
      },
    },
    required: ["question"],
  }),
  async execute({ question }) {
    console.log(`\n%c[ask] ${question}`, "color: green");
    const ret = prompt(">") ?? "no answer";
    if (!ret.trim()) Deno.exit(1);
    console.log(`\n%c[response] ${ret}`, "color: gray");
    return ret;
  },
});

export const readUrlTool = tool({
  description: "Read a URL and extract the text content",
  parameters: jsonSchema<{ url: string }>({
    type: "object",
    properties: {
      url: {
        type: "string",
        describe: "The URL to read",
      },
    },
    required: ["url"],
  }),
  async execute({ url }) {
    const res = await fetch(url).then((res) => res.text());
    const extracted = extract(res);
    return toMarkdown(extracted.root);
  },
});

export const readFileTool = tool({
  description: "Read an absolute file path and extract the text content",
  parameters: jsonSchema<{ filepath: string }>({
    type: "object",
    properties: {
      filepath: {
        type: "string",
        describe: "The absolute file path to read",
      },
    },
    required: ["filepath"],
  }),
  async execute({ filepath }) {
    if (!path.isAbsolute(filepath)) {
      return `Denied: filepath is not absolute path`;
    }
    const res = await Deno.readTextFile(filepath);
    return res;
  },
});

export const writeFileTool = tool({
  description: "Write text content to an absolute file path. User checks it",
  parameters: jsonSchema<{ filepath: string; content: string }>({
    type: "object",
    properties: {
      filepath: {
        type: "string",
        describe: "The absolute file path to write",
      },
      content: {
        type: "string",
        describe: "The content to write to the file",
      },
    },
    required: ["filepath", "content"],
  }),
  async execute({ filepath, content }) {
    if (!path.isAbsolute(filepath)) {
      return `Denied: filepath is not absolute path`;
    }
    const ok = confirm(
      `Write ${filepath}(${content.length})\n${truncate(content)}\n`
    );
    if (!ok) return `User denied`;
    await Deno.writeTextFile(filepath, content);
    return "ok";
  },
});

const BUILTIN_TOOLS: Record<string, Tool> = {
  askTool,
  bashTool,
  readUrlTool,
  readFileTool,
  writeFileTool,
};

/// utils
function truncate(input: unknown, length: number = 100) {
  const str =
    typeof input === "string" ? input : JSON.stringify(input, null, 2);
  return str.length > length ? str.slice(0, length) + "..." : str;
}
const write = (text: string) => {
  Deno.stdout.write(new TextEncoder().encode(text));
};

async function loadMessages(filepath: string): Promise<CoreMessage[]> {
  try {
    const _ = await Deno.stat(filepath);
    const content = await Deno.readTextFile(filepath);
    return JSON.parse(content);
  } catch (_e) {
    return [];
  }
}

async function loadExternalTools(exprs: string[], cwd = Deno.cwd()) {
  const tools: Record<string, Tool> = {};
  for (const toolPath of exprs ?? []) {
    // from URL
    if (toolPath.startsWith("https://")) {
      const mod = await import(toolPath);
      tools[mod.toolName] = mod.default as Tool;
      continue;
    }
    // from local file
    const resolvedToolPath = path.join(cwd, toolPath);
    const mod = await import(resolvedToolPath);
    const baseName = path.basename(resolvedToolPath).replace(/\.tsx?$/, "");
    tools[baseName] = mod.default as Tool;
    console.log(`\n%c[tool-added] ${toolPath}`, "color: blue");
  }
  return tools;
}

/// Run
if (import.meta.main) {
  const parsed = parseArgs({
    args: Deno.args,
    options: {
      input: { type: "string", short: "i" },
      debug: { type: "boolean", short: "d" },
      modelName: { type: "string", short: "m" },
      maxSteps: { type: "string", short: "s" },
      maxTokens: { type: "string" },
      noBuiltin: { type: "boolean" },
      persist: { type: "string", short: "p" },
      tools: { type: "string", short: "t", multiple: true },
    },
    allowPositionals: true,
  });
  const modelName = parsed.values.modelName ?? "claude-3-7-sonnet-20250219";
  const debug = parsed.values.debug ?? false;
  const externals = parsed.values.tools
    ? loadExternalTools(parsed.values.tools, Deno.cwd())
    : {};
  const usingTools: Record<string, Tool> = parsed.values.noBuiltin
    ? externals
    : {
        ...BUILTIN_TOOLS,
        ...externals,
      };
  let messages: CoreMessage[] = [];
  let writeMessages: (() => Promise<void>) | undefined = undefined;
  if (parsed.values.persist) {
    const outpath = path.join(Deno.cwd(), parsed.values.persist);
    messages = await loadMessages(outpath);
    writeMessages = async () => {
      await Deno.writeTextFile(outpath, JSON.stringify(messages, null, 2));
    };
    Deno.addSignalListener("SIGINT", async () => {
      try {
        await writeMessages?.();
      } finally {
        Deno.exit(0);
      }
    });
  }

  const firstPrompt = parsed.positionals.join(" ");
  if (firstPrompt) {
    messages.push({
      role: "user",
      content: firstPrompt,
    });
  }

  if (debug) {
    console.log("[options]", parsed.values);
    console.log("[tools]", Object.keys(usingTools));
    console.log("[messsages]", messages.length);
  }

  const model = getModelByName(modelName, {});
  while (true) {
    if (messages.length > 0) {
      const stream = streamText({
        model,
        tools: usingTools,
        system: SYSTEM,
        messages: messages,
        maxSteps: parsed.values.maxSteps ? Number(parsed.values.maxSteps) : 100,
        maxTokens: parsed.values.maxTokens
          ? Number(parsed.values.maxTokens)
          : undefined,
        toolChoice: "auto",
      });
      for await (const part of stream.fullStream) {
        if (part.type === "text-delta") {
          write(part.textDelta); // Display on screen
          continue;
        }
        if (part.type === "tool-call") {
          console.log(
            `%c[tool-call:${part.toolName}] ${truncate(part.args)}`,
            "color: blue"
          );
          // @ts-ignore no-type
        } else if (part.type === "tool-result") {
          const toolResult = part as ToolResult<string, any, any>;
          console.log(
            `%c[tool-result:${toolResult.toolName}]\n${truncate(
              toolResult.result
            )}`,
            "color: green"
          );
        } else if (debug) {
          console.log(
            `%c[debug:${part.type}] ${truncate(part, 512)}`,
            "color: gray;"
          );
        }
      }
      const response = await stream.response;
      messages.push(...response.messages);
      await writeMessages?.();
      write("\n\n");
    }
    // Next input
    const nextInput = prompt(">");
    if (!nextInput || nextInput.trim() === "") {
      Deno.exit(0);
    }
    messages.push({ role: "user", content: nextInput });
  }
}

Discussion

たろきちたろきち

今までで一番参考になった記事!!
煽り記事(AIがコーディングするからコーダーは要らない等)とか、コーディングはGemini2.5proが最強とかのガセとか、そんなのばっかりだった。
煽りはともかく、他の記事は本当に生成したことあるのか疑問なくらい。単純な実力を測りたい場合、単純なものだと例えば【C++の関数名を列挙する正規表現】を生成してみればその実力が分かる。
って、愚痴になったけど本当に有用な記事をありがと!!