🌟

Browser Useの内部実装の解説Part1

2024/12/31に公開

OpenAI

Browser Useとは

Browser UseとはAIエージェントがブラウザを直接操作して目的の操作や情報の取得を全自動で行ってくれるツールで、OSSとして公開されています。Part1では実際にBrowser Useを動かしながら、コードの一部を覗いていこうと思います。好評でしたらPart2も書こうかなくらいの気持ちです。

Quick Start

では早速使ってみましょう。まずはpythonの環境のセットアップからです。私はpyenvを使用しています。pyenvが未インストールの方は別途検索してpyenvのインストールだけ完了させておいてください。
まず2024/12/30日時点では、pythonのバージョンが

requires-python = ">=3.11"

である必要があるそうです。私は、とりあえずpython3.11.8にしました。以下でインストールできます。

pyenv install 3.11.8

インストールが完了したら、現在のpython環境をインストールしたものに変更する必要があるので、

pyenv global 3.11.8 # pyenv local 3.11.8の場合は現在のディレクトリ以下のみで変更。どちらでも良いです。

次に、demo.pyを作成してください。
中身は

from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio

async def main():
    agent = Agent(
        task="Find a one-way flight from Bali to Oman on 12 January 2025 on Google Flights. Return me the cheapest option.",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result)

asyncio.run(main())

次に.envを作成しましょう。
.envファイルには環境変数を指定します。

OPENAI_API_KEY=ここにapikeyを書いてください。

# Set to false to disable anonymized telemetry
ANONYMIZED_TELEMETRY=true

# LogLevel: Set to debug to enable verbose logging, set to result to get results only. Available: result | debug | info
BROWSER_USE_LOGGING_LEVEL=info

あとは実行するだけです。

python demo.py

すると何やら画面が開き、ブラウザが勝手に動き、しばらくすると閉じると思います。同時にターミナルの出力を見ると

📍 Step 14
INFO     [agent] 👍 Eval: Success - Results for flights from Bali to Oman are visible.
INFO     [agent] 🧠 Memory: Found the cheapest flight option from Bali to Oman on 12 January 2025.
INFO     [agent] 🎯 Next goal: Identify and confirm the cheapest flight option from the results.
INFO     [agent] 🛠️  Action 1/1: {"done":{"text":"The cheapest one-way flight from Bali to Oman on 12 January 2025 is to Muscat, costing ¥63,822 with a duration of 44 hours and 55 minutes and one stop."}}
INFO     [agent] 📄 Result: The cheapest one-way flight from Bali to Oman on 12 January 2025 is to Muscat, costing ¥63,822 with a duration of 44 hours and 55 minutes and one stop.
INFO     [agent] ✅ Task completed successfully
INFO     [agent] Created history GIF at agent_history.gif

のようになっていました。どうやら14ステップで操作が完了して、

Result: The cheapest one-way flight from Bali to Oman on 12 January 2025 is to Muscat, costing ¥63,822 with a duration of 44 hours and 55 minutes and one stop.

これが結果になっているらしいです。今回目的は、Find a one-way flight from Bali to Oman on 12 January 2025 on Google Flights. Return me the cheapest option.とdemo.pyに書いていましたので、BaliからOmanまでのフライトについてきちんと調べられていてすごいですね。それでは、具体的な実装を一つずつ覗いていきましょう。

Agent

demo.pyの最初に出てくるAgent

from browser_use import Agent

ここですね。中身は、、、900行くらいある。。。全部はここでは見れないですね🥺
一旦__init__でもみてみますか。。。

def __init__(
		self,
		task: str,
		llm: BaseChatModel,
		browser: Browser | None = None,
		browser_context: BrowserContext | None = None,
		controller: Controller = Controller(),
		use_vision: bool = True,
		save_conversation_path: Optional[str] = None,
		max_failures: int = 5,
		retry_delay: int = 10,
		system_prompt_class: Type[SystemPrompt] = SystemPrompt,
		max_input_tokens: int = 128000,
		validate_output: bool = False,
		generate_gif: bool = True,
		include_attributes: list[str] = [
			'title',
			'type',
			'name',
			'role',
			'tabindex',
			'aria-label',
			'placeholder',
			'value',
			'alt',
			'aria-expanded',
		],
		max_error_length: int = 400,
		max_actions_per_step: int = 10,
	):

引数を見ていきましょう。
task：str型で、これはAgentに渡す目標ですね。
llm：BaseChatModel型で、これはLLMの抽象的なインターフェースだと思います。
demoではこの二つのみを指定していたので、残りはデフォルト値で読んでいきましょう。

self.browser = Browser()
self.browser_context = BrowserContext(browser=self.browser)

どうやらデフォルトでは、ブラウザとブラウザコンテキストを作成するようですね。

self.telemetry = ProductTelemetry()

これはなんですかね。謎です。

self._setup_action_models()

何かしらの初期化関数でしょうか？

self.message_manager = MessageManager(
			llm=self.llm,
			task=self.task,
			action_descriptions=self.controller.registry.get_prompt_description(),
			system_prompt_class=self.system_prompt_class,
			max_input_tokens=self.max_input_tokens,
			include_attributes=self.include_attributes,
			max_error_length=self.max_error_length,
			max_actions_per_step=self.max_actions_per_step,
		)

		# Tracking variables
		self.history: AgentHistoryList = AgentHistoryList(history=[])

ここら辺も何かインスタンスを作成していますね。
とまあ、いろんなクラスが出てきましたが、Agentというくらいなので、このクラスが中核になっていることは確かでしょう。一旦おいておきましょうか。次に

result = await agent.run()

これを実行しているので、run関数を見ていきます。

agent.run

async def run(self, max_steps: int = 100) -> AgentHistoryList:

返り値はAgentHistoryListですね。つまり、AgentHistoryListには、Agentの行動の結果が記録されていくのでしょうね。

self.telemetry.capture(
				AgentRunTelemetryEvent(
					agent_id=self.agent_id,
					task=self.task,
				)
			)

なぞ。

await self.step()

これが実行処理っぽい！

if self._too_many_failures():
					break

途中でやめる機能。

if self.history.is_done():
					if (
						self.validate_output and step < max_steps - 1
					):  # if last step, we dont need to validate
						if not await self._validate_output():
							continue

					logger.info('✅ Task completed successfully')
					break

validationして、うまくいっていたら終了。ダメならretryかな？

finally:
			self.telemetry.capture(
				AgentEndTelemetryEvent(
					agent_id=self.agent_id,
					task=self.task,
					success=self.history.is_done(),
					steps=len(self.history.history),
				)
			)
			if not self.injected_browser_context:
				await self.browser_context.close()

			if not self.injected_browser and self.browser:
				await self.browser.close()

			if self.generate_gif:
				self.create_history_gif()

終わった後、もしくはエラー後の処理。telemetryがまた呼び出されていますね。あとはインスタンスを閉じたりとかの処理だと思います。create_history_gif()これは知らない。

agent.step

state = await self.browser_context.get_state(use_vision=self.use_vision)

現在の状態を取得。状態がなんなのかはよく分からない。

self.message_manager.add_state_message(state, self._last_result, step_info)

謎。

input_messages = self.message_manager.get_messages()

これがLLMへの指示かな？

model_output = await self.get_next_action(input_messages)

ここで、指令を渡して実行しているっぽい！

agent.get_next_action

async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
		"""Get next action from LLM based on current state"""

		structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
		response: dict[str, Any] = await structured_llm.ainvoke(input_messages)

多分structured_llm.ainvoke(input_messages)が実行部分ですね。

structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)

これはLangChainの関数らしく、要するにLLMに構造化データを吐き出すように設定を変更しているのですね。では、self.AgentOutputを見ればどんな構造なのかわかりそうですね。

Action model: {'$defs': {'ClickElementAction': {'properties': {'index': {'title': 'Index', 'type': 'integer'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Xpath'}}, 'required': ['index'], 'title': 'ClickElementAction', 'type': 'object'}, 'DoneAction': {'properties': {'text': {'title': 'Text', 'type': 'string'}}, 'required': ['text'], 'title': 'DoneAction', 'type': 'object'}, 'ExtractPageContentAction': {'properties': {'value': {'default': 'text', 'enum': ['text', 'markdown', 'html'], 'title': 'Value', 'type': 'string'}}, 'title': 'ExtractPageContentAction', 'type': 'object'}, 'GoToUrlAction': {'properties': {'url': {'title': 'Url', 'type': 'string'}}, 'required': ['url'], 'title': 'GoToUrlAction', 'type': 'object'}, 'InputTextAction': {'properties': {'index': {'title': 'Index', 'type': 'integer'}, 'text': {'title': 'Text', 'type': 'string'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Xpath'}}, 'required': ['index', 'text'], 'title': 'InputTextAction', 'type': 'object'}, 'OpenTabAction': {'properties': {'url': {'title': 'Url'

こんな感じですね。Actionを定義していて、例えばClickElementActionだとなんかしらの要素をクリッkする。DoneActionだと操作の完了？とかなんですかね？
要するにブラウザの操作のインターフェースになっているっぽいです。

result: list[ActionResult] = await self.controller.multi_act(
				model_output.action, self.browser_context
			)
			self._last_result = result

ここを見ると、llmとの会話で得られたactionをcontrollerにコンテキストと一緒に渡しているっぽいですね！

全体像を把握しよう

今回はAgentの一部を見ただけですが、私の理解では以下のような構造になっていそうだと思いました。

Controller：ブラウザ操作やアクション管理実行。人間の操作を抽象化したもの？
Browser：ブラウザの状態管理やUI部分。
Agent：以前は人間が考えていた部分。思考の統括者のような存在
といった感じでしょうか？このような視点で概要を理解していると、コードが読みやすくなる気がします。次回はもっと具体的にコードを読み進めていきます。