🖐️
【Playwright × AI】自然言語でブラウザの自動操作ができるStagehandを試してみる

hosaka313
2024/12/16に公開
 🎭 はじめに 🤖Playwrightのコードを書く際、筆者は以下のようにAIを使って書いています。

hoge.page.tsといったファイルに、要素のロケーター（セレクタ）を返すメソッドをまとめたPageオブジェクトを定義（人間）

hoge.step.tsといったファイルに、要素に対するアクションをまとめたStepオブジェクトを定義（ほぼAI。間違えない。）

hoge.spec.tsに上記を組み合わせたテストケースを書く（AI => 人の手で手直し）
PlaywrightのAPIが優秀なこともあって、stepに関してはほぼAIは実装を間違えません。あとは要素の位置を安定して間違えずに取ってくれれば、自然言語でAIに指示して、使い捨てのE2Eテストくらいは書けそうな気になってきます。
同じことを考える人はもちろん多くいて、Playwrightを自然言語で操るアプローチは、調べてみるといくつか出ていました。
そのうち、比較的洗練されていそうなStagehandを試してみました。

 Stagehandhttps://github.com/browserbase/stagehand
現状はオープンソースで無料でも使えますが、有料プランもありました。
https://www.browserbase.com/#pricing
Stagehand is currently available as an early releaseの記載もあり、まだ安定していません。
LLMとしてOpenAIのgpt-4o/o1とClaude 3.5 Sonnetが使えますが、今回はOpenAIを使います。
ちなみにStagehandは「舞台係」「舞台の裏方」くらいの意味。Playwrightが「劇作家」なので、意識したネーミングでしょう。
!筆者はbrowserbaseの回し者ではありません。

 🌳 環境MacBook Air M3
Node v20.18.1
npm 10.8.2

 ⚙️ セットアップREADMEに記載の通りにセットアップします。

 プロジェクト作成$mkdir stagehand-test
$cd stagehand-test
$npm init
$npm install @browserbasehq/stagehand zod

 .envを追加OPENAI_API_KEY=sk-xxxxxxx...

 ブラウザをインストール（まだの場合）$npm exec playwright install

 tsxをインストール.tsの実行用として。ts-nodeなどお好みのもので構いません。
$npm install -D tsx

 コードの追加まずはREADMEのコードそのまま。
init => act => extractという流れで、stagehandリポジトリのcontributorを取得しようとしています。
https://github.com/browserbase/stagehand/graphs/contributors
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({
  env: "LOCAL",
});

await stagehand.init();
await stagehand.page.goto("https://github.com/browserbase/stagehand");
await stagehand.act({ action: "click on the contributors" });
const contributor = await stagehand.extract({
  instruction: "extract the top contributor",
  schema: z.object({
    username: z.string(),
    url: z.string(),
  }),
});
await stagehand.close();
console.log(`Our favorite contributor is ${contributor.username}`);
Top-level awaitが含まれており、commonjsではなくESMにする必要があるのでpackage.jsonに下記を追加します。
+  "type": "module",
stagehandのAPIは以下のみと非常にシンプルです。内部でLLMにPlaywrightのメソッドを選ばせているためです。


 サンプルコードの実行 & ログの確認
 サンプルコード実行$npx tsx src/index.ts
すると以下のようなログが出てきます。
...[省略]
2024-12-15T22:27:22.270Z::[stagehand:openai] creating chat completion {"openAiOptions":{"value":"{\"messages\":[{\"role\":\"system\",\"content\":\"You are an AI assistant tasked with evaluating the progress and completion status of an extraction task.\\nAnalyze the extraction response and determine if the task is completed or if more information is needed.\\n\\nStrictly abide by the following criteria:\\n1. Once the instruction has been satisfied by the current extraction response, ALWAYS set completion status to true and stop processing, regardless of remaining chunks.\\n2. Only set completion status to false if BOTH of these conditions are true:\\n   - The instruction has not been satisfied yet\\n   - There are still chunks left to process (chunksTotal > chunksSeen)\"},{\"role\":\"user\",\"content\":\"Instruction: extract the top contributor\\nExtracted content: {\\n  \\\"username\\\": \\\"jeremypress\\\",\\n  \\\"url\\\": \\\"/jeremypress\\\"\\n}\\nchunksSeen: 0\\nchunksTotal: 6\"}],\"temperature\":0.1,\"top_p\":1,\"frequency_penalty\":0,\"presence_penalty\":0,\"model\":\"gpt-4o\"}","type":"object"}}
2024-12-15T22:27:22.984Z::[stagehand:openai] response {"response":{"value":"{\"id\":\"chatcmpl-AerN4j3bV6VvnRh2BpGqKXMKAO6ZZ\",\"object\":\"chat.completion\",\"created\":1734301642,\"model\":\"gpt-4o-2024-08-06\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"{\\\"progress\\\":\\\"The top contributor has been extracted: username 'jeremypress'.\\\",\\\"completed\\\":true}\",\"refusal\":null},\"logprobs\":null,\"finish_reason\":\"stop\"}],\"usage\":{\"prompt_tokens\":262,\"completion_tokens\":21,\"total_tokens\":283,\"prompt_tokens_details\":{\"cached_tokens\":0,\"audio_tokens\":0},\"completion_tokens_details\":{\"reasoning_tokens\":0,\"audio_tokens\":0,\"accepted_prediction_tokens\":0,\"rejected_prediction_tokens\":0}},\"system_fingerprint\":\"fp_9faba9f038\"}","type":"object"},"requestId":{"value":"gyeye09i5wb","type":"string"}}
2024-12-15T22:27:22.984Z::[stagehand:extraction] received extraction response {"extraction_response":{"value":"{\"username\":\"jeremypress\",\"url\":\"/jeremypress\",\"metadata\":{\"progress\":\"The top contributor has been extracted: username 'jeremypress'.\",\"completed\":true}}","type":"object"}}
2024-12-15T22:27:22.984Z::[stagehand:extraction] got response {"extraction_response":{"value":"{\"username\":\"jeremypress\",\"url\":\"/jeremypress\",\"metadata\":{\"progress\":\"The top contributor has been extracted: username 'jeremypress'.\",\"completed\":true}}","type":"object"}}
Our favorite contributor is jeremypress
以下のページから正しく取ってこれました。

ただ、何回か実行すると、Our favorite contributor is browserbaseと間違った回答も返ってきました。
なお、この実行でのコストは1回$0.06未満でした。


 ログの確認ログは以下に大別できそうです。

 OpenAI とのやりとり（[stagehand:openai]）JSON形式でモデルの応答やトークン使用量、選んだツール、finish_reasonなどが書かれている。

 Stagehandのアクションログ（[stagehand:action]）実行したアクション（クリック、URL遷移など）に関するログ。例えば要素クリック前後のURL、クリックした要素の情報など。

 DOM解析・抽出ログ（[stagehand:extraction] / [stagehand:extract]）DOM要素の取得に関するログ。DOM要素一覧（0:,1:,2:...など）やテキスト内容、抽出した情報に関する記述。
ログからもstagehandはLLMにToolを与えて目標を達成させる、Toolエージェントの一種であることが伺えます。

 プロンプトを覗くLLMアプリの心臓であるプロンプトを覗いて、LLMに何をさせている把握します。
https://github.com/browserbase/stagehand/blob/473ca3fe7a8e97cf4337ddfa43aba8f0fdf93412/lib/prompt.ts
簡単にまとめると、

 actSystemPromptユーザーが達成したい「目標(goal)」をもとに、Playwrightアクションを行うためのシステムプロンプト。
ユーザーのゴールやこれまでのステップ、現在のDOM要素リストが与えられる

doAction、skipSectionという2種類のツールを使用すること
ゴール達成と判断できる場合はcompletedをtrueにする。
後述しますが、doActionは使用するPlaywrightのメソッドを返します。

 verifyActCompletionSystemPromptユーザーのゴール、ステップのリスト、スクリーンショット画像を元に目標が完了したかを判断する。

 ToolsdoAction、skipSectionのTool定義は下記。doActionがPlaywrightの実行メソッドを返します。
skipSectionというツールがあるのは、Stagehandはコンテキストを圧縮するため、Chunkingを行っているためです。
export const actTools: Array<OpenAI.ChatCompletionTool> = [
  {
    type: "function",
    function: {
      name: "doAction",
      description:
        "execute the next playwright step that directly accomplishes the goal",
      parameters: {
        type: "object",
        required: ["method", "element", "args", "step", "completed"],
        properties: {
          method: {
            type: "string",
            description: "The playwright function to call.",
          },
          element: {
            type: "number",
            description: "The element number to act on",
          },
          args: {
            type: "array",
            description: "The required arguments",
            items: {
              type: "string",
              description: "The argument to pass to the function",
            },
          },
          step: {
            type: "string",
            description:
              "human readable description of the step that is taken in the past tense. Please be very detailed.",
          },
          why: {
            type: "string",
            description:
              "why is this step taken? how does it advance the goal?",
          },
          completed: {
            type: "boolean",
            description:
              "true if the goal should be accomplished after this step",
          },
        },
      },
    },
  },
  {
    type: "function",
    function: {
      name: "skipSection",
      description:
        "skips this area of the webpage because the current goal cannot be accomplished here",
      parameters: {
        type: "object",
        properties: {
          reason: {
            type: "string",
            description: "reason that no action is taken",
          },
        },
      },
    },
  },
];

 extractウェブページから情報を抽出（テキストやDOM要素）するためのプロンプト。

 refine抽出したコンテンツを整理するためのプロンプト。
既存の抽出結果（previously extracted）と新たに取得した抽出結果（newly extracted）を比較し、重複を除去したり、情報を更新・追加したりして最終的な整形済みデータを生成するプロンプトを構築する。

 metadata抽出タスクが完了したかどうかを判定するプロンプト。

 observe指定の条件に合う要素をDOMの候補リストから抽出し、配列で返すようにするためのプロンプト。

 askユーザーの質問に短くシンプルに答えるためのプロンプト。

 試してみるサンプル以外にも試してみます。

 試すシナリオピティナ・トップページからマイページを開く
ログインする
ログインユーザー名を取得する。
https://www.piano.or.jp/
ログイン名はログイン後のヘッダから取得できます。



筆者の名前は黒田、ではありません。テストアカウントでテキトーにつけました。
コードは以下のようになります。
<REDACTED>には実際の情報を入れています。
import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({
  env: "LOCAL",
});

await stagehand.init();
await stagehand.page.goto("https://www.piano.or.jp/");
await stagehand.act({ action: "マイページを開く" });

await stagehand.act({
  action: "メールアドレスに %email% 、パスワードに %password%を 入力する。",
  variables: {
    email: "<REDACTED>",
    password: "<REDACTED>",
  }
});

await stagehand.act({
  action: "ログインボタンをクリックする",
});

const { loginName } = await stagehand.extract({
  instruction: "ログイン名を取得する",
  schema: z.object({
    loginName: z.string(),
  }),
});

console.log(loginName);
実行。
 $npx tsx src/index.ts
すると、30秒くらい探索しつつログインし、
2024-12-15T21:33:44.541Z::[stagehand:extraction] got response {"extraction_response":{"value":"{\"loginName\":\"黒田 テスト\",\"metadata\":{\"progress\":\"ログイン名 \\\"黒田 テスト\\\" が抽出されました。\",\"completed\":true}}","type":"object"}}
黒田 テスト
と正しく取得できました。所感は、
ページをスクロールして要素を探索したり、速度に難あり。
安定感は心許ない。何度も実行する自動E2Eテストに置換するのは難しい。
たとえば上記のコードでemailの代わりにusernameとしたところ、failしました。

何も知らないユーザーの手動テストの代替、と考えれば有用かも。安定して結果が返れば、ある程度人間にもわかりやすいWebサイトと言えそう。
といったところで、動くけれども実験的というものでした。Gemini Flash 2.0が速度、画像認識ともに優れているので、Geminiだともう少し使いやすくなるかもしれません。

 深掘りもう少しコードを読み込んでみます。

 actメソッド全体の要になっているactメソッドを覗いてみました。
https://github.com/browserbase/stagehand/blob/473ca3fe7a8e97cf4337ddfa43aba8f0fdf93412/lib/inference.ts#L95C1-L152C2
Function callingを使ったコードでTool callがある場合は、再帰的に呼び出します。
これはさらに上位のactHandlerで呼ばれます。
https://github.com/browserbase/stagehand/blob/473ca3fe7a8e97cf4337ddfa43aba8f0fdf93412/lib/handlers/actHandler.ts#L963-L1473
キャッシュやchunk処理などが入ってかなり大きい関数なので、主要なロジックのみを図示します。
doActionツールがPlaywrightのメソッドを返すので、それを_performPlaywrightMethodに渡しています。
        await this._performPlaywrightMethod(
          method,
          args,
          xpaths[0],
          domSettleTimeoutMs,
        );
結果の検証は別プロンプトを渡したLLMで行っていました。
        const actionCompleted = await this._verifyActionCompletion({

 モデルについてhttps://github.com/browserbase/stagehand/blob/473ca3fe7a8e97cf4337ddfa43aba8f0fdf93412/lib/llm/LLMProvider.ts
執筆時点では下記で、Geminiは未対応です。
  private modelToProviderMap: { [key in AvailableModel]: ModelProvider } = {
    "gpt-4o": "openai",
    "gpt-4o-mini": "openai",
    "gpt-4o-2024-08-06": "openai",
    "o1-mini": "openai",
    "o1-preview": "openai",
    "claude-3-5-sonnet-latest": "anthropic",
    "claude-3-5-sonnet-20240620": "anthropic",
    "claude-3-5-sonnet-20241022": "anthropic",
  };

 cacheinit()でenableCachingをtrueにすると有効化できる。試してみると、
tmp/.cache/action_cache.json
tmp/.cache/llm_calls.json
が生成されました。
{
  "2e35710e9ed9ff37dad921a339e72db4c422f64937b90f15dbafe52e0ccc4c38": {
    "data": {
      "id": "chatcmpl-AeqcnukK903dPRbqv18j0fyzP6YgI",
      "object": "chat.completion",
      "created": 1734298773,
      "model": "gpt-4o-2024-08-06",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": null,
            "tool_calls": [
              {
                "id": "call_Qxhb4vefxyzOAs17tRkFcseK",
                "type": "function",
                "function": {
                  "name": "skipSection",
                  "arguments": "{\"reason\":\"There are no input fields for email or password in the current DOM elements.\"}"
                }
              }
            ],
            "refusal": null
          },
          "logprobs": null,
          "finish_reason": "tool_calls"
        }
      ],...
}

 useVisionオプションactやobserveメソッドに存在。true/false/fallbackのどれか。

スクリーンショットを渡して画像認識をするかどうか。

 変数プレースホルダを使って渡せます。
await stagehand.act({
  action: "enter %username% into the username field",
  variables: {
    username: "john.doe@example.com",
  },
});

 extract要素の取得。Zodでvalidateできる。
const price = await stagehand.extract({
  instruction: "extract the price of the item",
  schema: z.object({
    price: z.number(),
  }),
});

 observe現在のページで考えられるアクションを返す。

たとえばgoogleの検索画面を例にとると、
await stagehand.init();
await stagehand.page.goto("https://www.google.co.jp");
await stagehand.observe({ instruction: "可能なActionを日本語で具体的に羅列して" })
2024-12-15T21:50:35.275Z::[stagehand:observation] found elements {"elements":{"value":"[{\"description\":\"Googleについてリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[1]/a[1]\"},{\"description\":\"ストアリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[1]/a[2]\"},{\"description\":\"Gmailリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/a[1]\"},{\"description\":\"画像リンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/a[1]\"},{\"description\":\"ログインリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[2]/a[1]\"},{\"description\":\"Google 検索ボタンをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[3]/form[1]/div[1]/div[1]/div[3]/center[1]/input[1]\"},{\"description\":\"I'm Feeling Luckyボタンをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[3]/form[1]/div[1]/div[1]/div[3]/center[1]/input[2]\"},{\"description\":\"ホリデーセール特価を見るリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[4]/div[1]/div[1]/div[1]/div[1]/div[3]/div[1]/promo-middle-slot[1]/div[1]/a[1]\"},{\"description\":\"Englishリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[4]/div[3]/div[1]/a[1]\"},{\"description\":\"広告リンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[1]/a[1]\"},{\"description\":\"ビジネスリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[1]/a[2]\"},{\"description\":\"検索の仕組みリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[1]/a[3]\"},{\"description\":\"プライバシーリンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[2]/a[1]\"},{\"description\":\"規約リンクをクリックする\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[2]/a[2]\"},{\"description\":\"設定を開く\",\"selector\":\"xpath=/html/body[1]/div[1]/div[6]/div[2]/div[2]/span[1]/span[1]/g-popup[1]/div[1]\"}]","type":"object"}}

 TipsREADMEにTipsが載っています。奇抜なことはなくて、要点はステップは細かく、具体的に、というところです。
https://github.com/browserbase/stagehand/tree/main?tab=readme-ov-file#prompting-tips

 LangchainLangchain.jsに、ツールが用意されています。
https://js.langchain.com/docs/integrations/tools/stagehand/
ブラウザ操作より高度なことをしたい場合は、上位Agentを使って動かすこともできそうです。

 その他のPlaywright × AIの試み最後に自然言語でPlaywrigthを動かす、という発想で書かれている別のライブラリを追記します。

 fuji-webChrome Extension。これはPlaywrightを使っているわけではなさそうですが、Stagehandのインスピレーションになっているようで、READMEに記載がありました。
https://github.com/normal-computing/fuji-web
ナナメ読みですが、過去の行動を渡したり、annotateされたスクリーンショットを渡して精度を上げるなど工夫されていました。



https://www.normalcomputing.com/blog-posts/introducing-fuji-web

 ZerostepPlaywrightを自然言語で操作するケースでは@zerostep/playwrightが出てきます。
https://zerostep.com/
コードは下記で、actionの代わりにaiが使われています。
import { test, expect } from '@playwright/test'
import { ai } from '@zerostep/playwright'

test.describe('GitHub', () => {
  test('verify the number of labels in a repo', async ({ page }) => {
    await page.goto('https://github.com/zerostep-ai/zerostep')
    await ai(`Click on the Issues tabs`, { page, test })

    await page.waitForURL('https://github.com/zerostep-ai/zerostep/issues')
    await ai('Click on Labels', { page, test })

    await page.waitForURL('https://github.com/zerostep-ai/zerostep/labels')
    const numLabels = await ai('How many labels are listed?', { page, test })

    expect(parseInt(numLabels)).toEqual(9)
  })
})
ただし、無料プランだとaiメソッドは月500callまで。
https://github.com/zerostep-ai/zerostep

 Auto Playwrightこれも似たような発想で、autoメソッドを使って下記のようなコードになります。
import { test, expect } from "@playwright/test";
import { auto } from "auto-playwright";

test("auto Playwright example", async ({ page }) => {
  await page.goto("/");

  // `auto` can query data
  // In this case, the result is plain-text contents of the header
  const headerText = await auto("get the header text", { page, test });

  // `auto` can perform actions
  // In this case, auto will find and fill in the search text input
  await auto(`Type "${headerText}" in the search box`, { page, test });

  // `auto` can assert the state of the website
  // In this case, the result is a boolean outcome
  const searchInputHasHeaderText = await auto(`Is the contents of the search box equal to "${headerText}"?`, { page, test });

  expect(searchInputHasHeaderText).toBe(true);
});
https://github.com/lucgagan/auto-playwright

 ShortestClaude対応。
import { shortest } from '@antiwork/shortest'

shortest('Login to the app using email and password', { username: process.env.GITHUB_USERNAME, password: process.env.GITHUB_PASSWORD })
https://github.com/anti-work/shortest

 Browser-useLangchain Agentのツールを提供。目標のための実行ステップまでLLMに考えさせるので、Stagehandより高次のAgentと言えるでしょうか。

https://github.com/browser-use/browser-use

 結びPlaywrightを自然言語で動かすことを目指したツールとして、Stagehandを触ってみました。
キャッシュやfallbackとしてのスクリーンショットの利用など、アイデアの実装法として参考になりましたが、現実問題として、gpt-4oですらログイン程度で時間がかかり、また安定性も欠いたので、まだモデルの馬力が頼りないように感じます。
それでも工夫次第で、たとえば、自社のWebサイトにStagehandを走らせてみて、詰まるところがあれば改善、といった一度きりのユースケースであれば、すぐ使えると思います。
課題はまだまだあると思いますが、ビジネス側にも伝わりやすい自然言語でテスト/仕様書が書ける未来がほのめいてワクワクしますし、2025年はAgentの年になるのは半ば確実なので、注目すべきリポジトリとして、日本語で記事として残しておくこととします。
全日本ピアノ指導者協会（ピティナ）Publication
Discussion

ログインするとコメントできます