🕷️

Cloud Functions上でheadress-chromeをPythonで動かす

2021/12/25に公開

Google Cloud

はじめに

Cloud Functions上でheadless-chrome を Python で動かす方法を紹介します。
Cloud FunctionとDocker Desktopを導入してあるものとします。
今回書いたコードは、Gitにアップしてあります
- https://github.com/hayashikengo/python-selenium-gcf-template
Cloud Functionsへのデプロイは、Mac OSで実行しています。
crawler ディレクトリを作って、その中で作業しています。

背景

GCPのserverless環境で、クローラー（Python）を動かしたいという要件があった。
データクレンジング処理に使うライブラリは、Pythonが充実しているので、Pythonを使いたい。

インフラ構成

CloudSchedulerで Pub/Sub を使用して Cloud Functionsをトリガーし、クローラーを実行します。インフラ構築手順は、公式ドキュメントにまとまっているので、この記事では触れません。

手順（headress-chrome on Cloud Functions）

headress-chromeのインストール

Cloud Functions は、Puppeteerはサポートしてるみたいだが、headress-chromeはサポートし
ていないみたい。

なので今回は、有志の方が lamda用に headless-chrome の build image を公開してくれているものを使います。

Terminal
mkdir -p bin/
curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
unzip headless-chromium.zip -d bin/
rm headless-chromium.zip

chromedriverのインストール

serverless-chrome に対応している chromedriver 2.37 を使用します。

Terminal
curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
unzip chromedriver.zip -d bin/
rm chromedriver.zip

.gcloudignoreの設定

GCP Functionsのアップロード上限が500MBなので、.gcloudignoreで不要なファイルはアップロードしないようにしています。

.gcloudignore

.git
.gitignore

README.md
__pycache__/
venv

lsd コマンドでファイルサイズを見てみます。headless-chromiumが97MBありますが、500MB以下なので許容範囲内。

Iterm
$ l bin
.rwxr-xr-x hayashikengo staff 7.5 MB Tue Mar 20 08:22:36 2018  chromedriver
.rwxr-xr-x hayashikengo staff  97 MB Thu Feb 15 19:07:40 2018  headless-chromium

functionの実装

GCF functionsでは、python3.9のイメージを使うので、python 3.9の環境をローカルで作ります。環境がない場合は、こちらからインストールしてください。

必要なライブラリをインストールして、requirements.txtに書き出します。requirements.txtに書き出されたライブラリはFunctionデプロイ時にインストールされます。

requirement.txt
fake-useragent==0.1.11
selenium==3.13.0

main.pyにデプロイする関数を実装します。

main.py
import os

from fake_useragent import UserAgent
from selenium import webdriver


def crawler(request):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--window-size=1280x1696')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--hide-scrollbars')
    chrome_options.add_argument('--enable-logging')
    chrome_options.add_argument('--log-level=0')
    chrome_options.add_argument('--v=99')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument('user-agent='+UserAgent().random)
    chrome_options.binary_location = os.getcwd() + "/bin/headless-chromium"
    driver = webdriver.Chrome(
        os.getcwd() + "/bin/chromedriver", chrome_options=chrome_options)

    driver.get('https://en.wikipedia.org/wiki/Special:Random')
    line = driver.find_element_by_class_name('firstHeading').text
    print(line)
    driver.quit()
    return line

GCP Functionsにデプロイ

gcloudコマンドが入っていない方は、公式よりインストール知てください。
https://cloud.google.com/sdk/gcloud

gcloudコマンドでデプロイ。python 3.9で動くのでpython39イメージを指定。

Terminal
gcloud functions deploy crawler \
    --region asia-northeast1 \
    --runtime python39 \
    --allow-unauthenticated \
    --trigger-http

urlを開いてみる。

Terminal
# urlを調べる
gcloud functions describe crawler --region asia-northeast1 | grep url

# アクセスしてみる
curl https://asia-northeast1-xxxx.cloudfunctions.net/crawler
Neftenbach <- ランダムな値が帰ってくる

ランダムなカラムが帰ってくればOKでです！

さいごに

最後まで読んでいただいて、ありがとうございますmm
Cloud Functions上でheadless-chromeをPythonで動かす方法を書きました。
後は、main.pyの中身を書いて、requirements.txtに必要なライブラリを追加すれば動きます！

様々なライブラリの提供者に感謝しつつ、自分も貢献できるように精進します！
twitterでもBE・アプリ開発など情報発信していますので、よかったらフォローしてくださいmm
https://twitter.com/kenbu05

番外編

GitHubに今回のコードテンプレートを公開してます

使用方法は、README.mdに書いてあります。

bin/*の実行ファイルもGitHubで管理したい

Gitで管理するにはGit LFSを使えば、Gitで管理できます。

Scrapyを使いたい

普通にCloud Functions上で、Scrapyを動かすとcrashします。
Scrapyは内部的にTwistedにて、グローバルオブジェクトを使用しているため、冪等性が担保されていないためです。

解決策としては、Scrapy用の子プロセスを生成して、その内部で実行することによってTwistedのスコープをプロセス内に留める事ができます。冪等性が担保され、crashが起きずに起動できます。

詳しくは、こちらの記事にて。

ファイルを扱いたい

Cloud Functionsでは、tmp配下のファイルが扱えます。

参考

Discussion

ログインするとコメントできます