⌛

ScrapyでExponential Backoffっぽいリトライをする

2021/01/21に公開

Python

scrapy

tech

RetryMiddlewareをオーバーライドしてしまうのが楽でした。

middlewares.py

import time
import random
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

(追記)

class CustomRetryMiddleware(RetryMiddleware):
    # override
    def __init__(self, settings):
        self.download_delay = settings.getfloat('DOWNLOAD_DELAY')
        super().__init__(settings)

    # override
    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)

            # Exponential Backoff (with Jitter)
            retry_times = request.meta.get('retry_times', 0)
            sec = (2 ** retry_times) + round(random.random(), 4)
            spider.logger.info('⏳ Retring: {} ({} times, wait +{} sec)'.format(spider.name, retry_times + 1, sec))
            time.sleep(sec + self.download_delay)

            return self._retry(request, reason, spider) or response
        return response

デフォルトで有効になっているオリジナルのRetryMiddlewareを無効にし、作成したCustomRetryMiddlewareを代わりに有効にします。

RetryMiddlewareと同じくRETRY_HTTP_CODESで設定されているHTTPステータスコードがリトライ対象になるので、必要なら合わせて設定してください。

setting.py

DOWNLOADER_MIDDLEWARES = {
   'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
   (略)
   '[project_name_is_here].middlewares.CustomRetryMiddleware': 550,
}

RETRY_TIMES = 5

httpbinを叩くSpiderを書いて動作確認。

spiders/backoff.py

import scrapy

class BackoffSpider(scrapy.Spider):
    name = 'backoff'
    allowed_domains = ["httpbin.org"]
    start_urls = (
        "http://www.httpbin.org/",
        "http://www.httpbin.org/status/201",
        "http://www.httpbin.org/status/500",  # in RETRY_HTTP_CODES
        "http://www.httpbin.org/status/501",  # not in RETRY_HTTP_CODES
    )

    def parse(self, response):
        return {
            "url": response.url,
            "status": response.status,
        }

RETRY_TIMESで設定した回数まで2^n秒遅延しつつリトライしていればOK。
(DOWNLOAD_DELAY分とかJitter分も加味して遅延させていますが、そのへんはお好みでカスタマイズしてください)

$ scrapy crawl backoff
(略)
2021-01-21 18:58:51 [backoff] INFO: Spider opened: backoff
2021-01-21 18:58:51 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in /Users/terukizm/goto-eater-official-info-crawler/.scrapy/httpcache
2021-01-21 18:58:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-21 18:58:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/robots.txt> (referer: None) ['cached']
2021-01-21 18:58:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) ['cached']
2021-01-21 18:58:51 [backoff] INFO: ⏳ Retring: backoff (1 times, wait +1.2131 sec)
2021-01-21 18:58:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2021-01-21 18:58:55 [backoff] INFO: ⏳ Retring: backoff (2 times, wait +2.0295 sec)
2021-01-21 18:59:01 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2021-01-21 18:59:01 [backoff] INFO: ⏳ Retring: backoff (3 times, wait +4.5924 sec)
2021-01-21 18:59:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 3 times): 500 Internal Server Error
2021-01-21 18:59:08 [backoff] INFO: ⏳ Retring: backoff (4 times, wait +8.5823 sec)
2021-01-21 18:59:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 4 times): 500 Internal Server Error
2021-01-21 18:59:20 [backoff] INFO: ⏳ Retring: backoff (5 times, wait +16.488 sec)
2021-01-21 18:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 5 times): 500 Internal Server Error
2021-01-21 18:59:39 [backoff] INFO: ⏳ Retring: backoff (6 times, wait +32.1995 sec)
2021-01-21 19:00:14 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 6 times): 500 Internal Server Error
2021-01-21 19:00:14 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) ['cached']
2021-01-21 19:00:14 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2021-01-21 19:00:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.httpbin.org/>
{'url': 'http://www.httpbin.org/', 'status': 200}
2021-01-21 19:00:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.httpbin.org/status/500>: HTTP status code is not handled or not allowed
2021-01-21 19:00:15 [scrapy.core.engine] DEBUG: Crawled (201) <GET http://www.httpbin.org/status/201> (referer: None)
2021-01-21 19:00:15 [scrapy.core.scraper] DEBUG: Scraped from <201 http://www.httpbin.org/status/201>
{'url': 'http://www.httpbin.org/status/201', 'status': 201}
2021-01-21 19:00:15 [scrapy.core.engine] DEBUG: Crawled (501) <GET http://www.httpbin.org/status/501> (referer: None)
2021-01-21 19:00:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <501 http://www.httpbin.org/status/501>: HTTP status code is not handled or not allowed
2021-01-21 19:00:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-21 19:00:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2870,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 11671,
 'downloader/response_count': 10,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/201': 1,
 'downloader/response_status_count/500': 6,
 'downloader/response_status_count/501': 1,
 'elapsed_time_seconds': 83.854904,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 1, 21, 10, 0, 15, 599183),
 'httpcache/firsthand': 2,
 'httpcache/hit': 8,
 'httpcache/miss': 2,
 'httpcache/store': 2,
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/500': 1,
 'httperror/response_ignored_status_count/501': 1,
 'item_scraped_count': 2,
 'log_count/DEBUG': 13,
 'log_count/ERROR': 1,
 'log_count/INFO': 20,
 'memusage/max': 51486720,
 'memusage/startup': 50556928,
 'response_received_count': 5,
 'retry/count': 5,
 'retry/max_reached': 1,
 'retry/reason_count/500 Internal Server Error': 5,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 9,
 'scheduler/dequeued/memory': 9,
 'scheduler/enqueued': 9,
 'scheduler/enqueued/memory': 9,
 'start_time': datetime.datetime(2021, 1, 21, 9, 58, 51, 744279)}
2021-01-21 19:00:15 [scrapy.core.engine] INFO: Spider closed (finished)

AutoThrottleとか分散クローリングとかでもちゃんと動くかはわかんないですが(twisted.internet.task.deferLaterとか使わないといけないのかも…)、参考になれば幸いです。

参考: https://stackoverflow.com/questions/55431996/scrapy-set-delay-to-retry-middleware

Discussion