⌛
ScrapyでExponential Backoffっぽいリトライをする
RetryMiddlewareをオーバーライドしてしまうのが楽でした。
middlewares.py
import time
import random
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
(追記)
class CustomRetryMiddleware(RetryMiddleware):
# override
def __init__(self, settings):
self.download_delay = settings.getfloat('DOWNLOAD_DELAY')
super().__init__(settings)
# override
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
# Exponential Backoff (with Jitter)
retry_times = request.meta.get('retry_times', 0)
sec = (2 ** retry_times) + round(random.random(), 4)
spider.logger.info('⏳ Retring: {} ({} times, wait +{} sec)'.format(spider.name, retry_times + 1, sec))
time.sleep(sec + self.download_delay)
return self._retry(request, reason, spider) or response
return response
デフォルトで有効になっているオリジナルのRetryMiddlewareを無効にし、作成したCustomRetryMiddlewareを代わりに有効にします。
RetryMiddlewareと同じくRETRY_HTTP_CODESで設定されているHTTPステータスコードがリトライ対象になるので、必要なら合わせて設定してください。
setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
(略)
'[project_name_is_here].middlewares.CustomRetryMiddleware': 550,
}
RETRY_TIMES = 5
httpbinを叩くSpiderを書いて動作確認。
spiders/backoff.py
import scrapy
class BackoffSpider(scrapy.Spider):
name = 'backoff'
allowed_domains = ["httpbin.org"]
start_urls = (
"http://www.httpbin.org/",
"http://www.httpbin.org/status/201",
"http://www.httpbin.org/status/500", # in RETRY_HTTP_CODES
"http://www.httpbin.org/status/501", # not in RETRY_HTTP_CODES
)
def parse(self, response):
return {
"url": response.url,
"status": response.status,
}
RETRY_TIMES
で設定した回数まで2^n秒遅延しつつリトライしていればOK。
(DOWNLOAD_DELAY
分とかJitter分も加味して遅延させていますが、そのへんはお好みでカスタマイズしてください)
$ scrapy crawl backoff
(略)
2021-01-21 18:58:51 [backoff] INFO: Spider opened: backoff
2021-01-21 18:58:51 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in /Users/terukizm/goto-eater-official-info-crawler/.scrapy/httpcache
2021-01-21 18:58:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-21 18:58:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/robots.txt> (referer: None) ['cached']
2021-01-21 18:58:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) ['cached']
2021-01-21 18:58:51 [backoff] INFO: ⏳ Retring: backoff (1 times, wait +1.2131 sec)
2021-01-21 18:58:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2021-01-21 18:58:55 [backoff] INFO: ⏳ Retring: backoff (2 times, wait +2.0295 sec)
2021-01-21 18:59:01 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2021-01-21 18:59:01 [backoff] INFO: ⏳ Retring: backoff (3 times, wait +4.5924 sec)
2021-01-21 18:59:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 3 times): 500 Internal Server Error
2021-01-21 18:59:08 [backoff] INFO: ⏳ Retring: backoff (4 times, wait +8.5823 sec)
2021-01-21 18:59:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 4 times): 500 Internal Server Error
2021-01-21 18:59:20 [backoff] INFO: ⏳ Retring: backoff (5 times, wait +16.488 sec)
2021-01-21 18:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 5 times): 500 Internal Server Error
2021-01-21 18:59:39 [backoff] INFO: ⏳ Retring: backoff (6 times, wait +32.1995 sec)
2021-01-21 19:00:14 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 6 times): 500 Internal Server Error
2021-01-21 19:00:14 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) ['cached']
2021-01-21 19:00:14 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2021-01-21 19:00:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.httpbin.org/>
{'url': 'http://www.httpbin.org/', 'status': 200}
2021-01-21 19:00:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.httpbin.org/status/500>: HTTP status code is not handled or not allowed
2021-01-21 19:00:15 [scrapy.core.engine] DEBUG: Crawled (201) <GET http://www.httpbin.org/status/201> (referer: None)
2021-01-21 19:00:15 [scrapy.core.scraper] DEBUG: Scraped from <201 http://www.httpbin.org/status/201>
{'url': 'http://www.httpbin.org/status/201', 'status': 201}
2021-01-21 19:00:15 [scrapy.core.engine] DEBUG: Crawled (501) <GET http://www.httpbin.org/status/501> (referer: None)
2021-01-21 19:00:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <501 http://www.httpbin.org/status/501>: HTTP status code is not handled or not allowed
2021-01-21 19:00:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-21 19:00:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2870,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 11671,
'downloader/response_count': 10,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/201': 1,
'downloader/response_status_count/500': 6,
'downloader/response_status_count/501': 1,
'elapsed_time_seconds': 83.854904,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 21, 10, 0, 15, 599183),
'httpcache/firsthand': 2,
'httpcache/hit': 8,
'httpcache/miss': 2,
'httpcache/store': 2,
'httperror/response_ignored_count': 2,
'httperror/response_ignored_status_count/500': 1,
'httperror/response_ignored_status_count/501': 1,
'item_scraped_count': 2,
'log_count/DEBUG': 13,
'log_count/ERROR': 1,
'log_count/INFO': 20,
'memusage/max': 51486720,
'memusage/startup': 50556928,
'response_received_count': 5,
'retry/count': 5,
'retry/max_reached': 1,
'retry/reason_count/500 Internal Server Error': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2021, 1, 21, 9, 58, 51, 744279)}
2021-01-21 19:00:15 [scrapy.core.engine] INFO: Spider closed (finished)
AutoThrottleとか分散クローリングとかでもちゃんと動くかはわかんないですが(twisted.internet.task.deferLater
とか使わないといけないのかも…)、参考になれば幸いです。
参考: https://stackoverflow.com/questions/55431996/scrapy-set-delay-to-retry-middleware
Discussion