Open10

Scrapyお試し・事始め

Teru roomTeru room

作業ディレクトリの作成

  • (base) macpro:~ sharland$ cd dev/
  • (base) macpro:dev sharland$ mkdir scrapy;cd scrapy
  • (base) macpro:scrapy sharland$ pwd
/Users/sharland/dev/scrapy

仮想環境の作成

  • (base) macpro:scrapy sharland$ python -m venv /Users/sharland/dev/scrapy/venv
  • (base) macpro:scrapy sharland$ source venv/bin/activate
Teru roomTeru room

scrapyのインストール

  • (venv) (base) macpro:scrapy sharland$ pip install --upgrade pip
実行結果
Collecting pip
  Downloading https://files.pythonhosted.org/packages/08/e3/57d4c24a050aa0bcca46b2920bff40847db79535dc78141eb83581a52eb8/pip-23.1.2-py3-none-any.whl (2.1MB)
    100% |████████████████████████████████| 2.1MB 11.6MB/s 
Installing collected packages: pip
  Found existing installation: pip 19.0.3
    Uninstalling pip-19.0.3:
      Successfully uninstalled pip-19.0.3
Successfully installed pip-23.1.2
  • (venv) (base) macpro:scrapy sharland$ pip install scrapy
実行結果
Collecting scrapy
  Downloading https://files.pythonhosted.org/packages/23/c6/a337d9ccf8c7ab10262367b4351af89e1c85ae4f1a0b98d52381bc931c80/Scrapy-2.8.0-py2.py3-none-any.whl (272kB)
    100% |████████████████████████████████| 276kB 11.3MB/s 
Collecting packaging (from scrapy)
  Using cached https://files.pythonhosted.org/packages/ab/c3/57f0601a2d4fe15de7a553c00adbc901425661bf048f2a22dfc500caf121/packaging-23.1-py3-none-any.whl
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/06/1e/9e3bfb6a10253f5d95acfed9c5732f4abc2ef87bdf985594ddfb99d222da/queuelib-1.6.2-py2.py3-none-any.whl
Collecting parsel>=1.5.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/36/d9/b67d9f251a69037c79bac90f975c84696f5ca68045bd1b97e68804625757/parsel-1.8.1-py2.py3-none-any.whl
Collecting Twisted>=18.9.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/ac/63/b5540d15dfeb7388fbe12fa55a902c118fd2b324be5430cdeac0c0439489/Twisted-22.10.0-py3-none-any.whl (3.1MB)
    100% |████████████████████████████████| 3.1MB 5.2MB/s 
Collecting tldextract (from scrapy)
  Downloading https://files.pythonhosted.org/packages/fb/21/dad9eaedad757362458f92f9345307cc847956ab9775ee9ab5a0fcb912cf/tldextract-3.4.1-py3-none-any.whl (92kB)
    100% |████████████████████████████████| 102kB 14.0MB/s 
Collecting cryptography>=3.4.6 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/85/86/a17a4baf08e0ae6496b44f75136f8e14b843fd3d8a3f4105c0fd79d4786b/cryptography-40.0.2-cp36-abi3-macosx_10_12_x86_64.whl (2.8MB)
    100% |████████████████████████████████| 2.8MB 8.2MB/s 
Requirement already satisfied: setuptools in ./venv/lib/python3.7/site-packages (from scrapy) (40.8.0)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/06/a9/2da08717a6862c48f1d61ef957a7bba171e7eefa6c0aa0ceb96a140c2a6b/cssselect-1.2.0-py2.py3-none-any.whl
Collecting pyOpenSSL>=21.0.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/b7/6d/d7377332703ffd8821878794aca4fb54637da654bf3e467ffb32109c2147/pyOpenSSL-23.1.1-py3-none-any.whl (57kB)
    100% |████████████████████████████████| 61kB 16.1MB/s 
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/93/5a/5e93f280ec7be676b5a57f305350f439d31ced168bca04e6ffa64b575664/service_identity-21.1.0-py2.py3-none-any.whl
Collecting lxml>=4.3.0 (from scrapy)
  Using cached https://files.pythonhosted.org/packages/af/cc/2136ec0afa2625ae45c2318c40a74ed8de2d669af12e98bb2fb356069698/lxml-4.9.2-cp37-cp37m-macosx_10_15_x86_64.whl
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/d8/17/30a7228107e568291491a5ea0f8ad8c96c813a92b34ef71fafb489755633/w3lib-2.1.1-py3-none-any.whl
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/9a/98/d03afe36d01b1adf03435bb306d0e7f87d498c94ba8db5290f716b350bb8/itemloaders-1.1.0-py3-none-any.whl
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/f6/46/d10583c874ed67386ba09630cacc9baea446f429e02cff973ce34fc6419e/zope.interface-6.0-cp37-cp37m-macosx_10_15_x86_64.whl (202kB)
    100% |████████████████████████████████| 204kB 13.4MB/s 
Collecting PyDispatcher>=2.0.5; platform_python_implementation == "CPython" (from scrapy)
  Downloading https://files.pythonhosted.org/packages/66/0e/9ee7bc0b48ec45d93b302fa2d787830dca4dc454d31a237faa5815995988/PyDispatcher-2.0.7-py3-none-any.whl
Collecting protego>=0.1.15 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/81/4d/3e01f10d6dd2d35793711c2e27a07e547c6aec0ab8d3199bb83e68956fdb/Protego-0.2.1-py2.py3-none-any.whl
Collecting itemadapter>=0.1.0 (from scrapy)
  Downloading https://files.pythonhosted.org/packages/3b/9e/e1a5a2882d5a3cbf9018d18102edc4cc34de8a207e6c5eb765784298fb48/itemadapter-0.8.0-py3-none-any.whl
Collecting jmespath (from parsel>=1.5.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/31/b4/b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc/jmespath-1.0.1-py3-none-any.whl
Collecting typing-extensions; python_version < "3.8" (from parsel>=1.5.0->scrapy)
  Using cached https://files.pythonhosted.org/packages/31/25/5abcd82372d3d4a3932e1fa8c3dbf9efac10cc7c0d16e78467460571b404/typing_extensions-4.5.0-py3-none-any.whl
Collecting incremental>=21.3.0 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/77/51/8073577012492fcd15628e811db585f447c500fa407e944ab3a18ec55fb7/incremental-22.10.0-py2.py3-none-any.whl
Collecting constantly>=15.1 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl
Collecting attrs>=19.2.0 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/f0/eb/fcb708c7bf5056045e9e98f62b93bd7467eb718b0202e7698eb11d66416c/attrs-23.1.0-py3-none-any.whl (61kB)
    100% |████████████████████████████████| 61kB 5.5MB/s 
Collecting Automat>=0.8.0 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/29/90/64aabce6c1b820395452cc5472b8f11cd98320f40941795b8069aef4e0e0/Automat-22.10.0-py2.py3-none-any.whl
Collecting hyperlink>=17.1.1 (from Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/6e/aa/8caf6a0a3e62863cbb9dab27135660acba46903b703e224f14f447e57934/hyperlink-21.0.0-py2.py3-none-any.whl (74kB)
    100% |████████████████████████████████| 81kB 16.1MB/s 
Collecting requests>=2.1.0 (from tldextract->scrapy)
  Downloading https://files.pythonhosted.org/packages/cf/e1/2aa539876d9ed0ddc95882451deb57cfd7aa8dbf0b8dbce68e045549ba56/requests-2.29.0-py3-none-any.whl (62kB)
    100% |████████████████████████████████| 71kB 21.7MB/s 
Collecting filelock>=3.0.8 (from tldextract->scrapy)
  Downloading https://files.pythonhosted.org/packages/ad/73/b094a662ae05cdc4ec95bc54e434e307986a5de5960166b8161b7c1373ee/filelock-3.12.0-py3-none-any.whl
Collecting requests-file>=1.4 (from tldextract->scrapy)
  Downloading https://files.pythonhosted.org/packages/77/86/cdb5e8eaed90796aa83a6d9f75cfbd37af553c47a291cd47bc410ef9bdb2/requests_file-1.5.1-py2.py3-none-any.whl
Collecting idna (from tldextract->scrapy)
  Using cached https://files.pythonhosted.org/packages/fc/34/3030de6f1370931b9dbb4dad48f6ab1015ab1d32447850b9fc94e60097be/idna-3.4-py3-none-any.whl
Collecting cffi>=1.12 (from cryptography>=3.4.6->scrapy)
  Downloading https://files.pythonhosted.org/packages/b5/7d/df6c088ef30e78a78b0c9cca6b904d5abb698afb5bc8f5191d529d83d667/cffi-1.15.1-cp37-cp37m-macosx_10_9_x86_64.whl (178kB)
    100% |████████████████████████████████| 184kB 19.8MB/s 
Collecting pyasn1-modules (from service-identity>=18.1.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/cd/8e/bea464350e1b8c6ed0da3a312659cb648804a08af6cacc6435867f74f8bd/pyasn1_modules-0.3.0-py2.py3-none-any.whl (181kB)
    100% |████████████████████████████████| 184kB 12.0MB/s 
Collecting pyasn1 (from service-identity>=18.1.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/14/e5/b56a725cbde139aa960c26a1a3ca4d4af437282e20b5314ee6a3501e7dfc/pyasn1-0.5.0-py2.py3-none-any.whl (83kB)
    100% |████████████████████████████████| 92kB 13.2MB/s 
Collecting six (from service-identity>=18.1.0->scrapy)
  Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl
Collecting importlib-metadata; python_version < "3.8" (from attrs>=19.2.0->Twisted>=18.9.0->scrapy)
  Downloading https://files.pythonhosted.org/packages/30/bb/bf2944b8b88c65b797acc2c6a2cb0fb817f7364debf0675792e034013858/importlib_metadata-6.6.0-py3-none-any.whl
Collecting urllib3<1.27,>=1.21.1 (from requests>=2.1.0->tldextract->scrapy)
  Using cached https://files.pythonhosted.org/packages/7b/f5/890a0baca17a61c1f92f72b81d3c31523c99bec609e60c292ea55b387ae8/urllib3-1.26.15-py2.py3-none-any.whl
Collecting charset-normalizer<4,>=2 (from requests>=2.1.0->tldextract->scrapy)
  Using cached https://files.pythonhosted.org/packages/e1/b4/53678b2a14e0496fc167fe9b9e726ad33d670cfd2011031aa5caeee6b784/charset_normalizer-3.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Collecting certifi>=2017.4.17 (from requests>=2.1.0->tldextract->scrapy)
  Using cached https://files.pythonhosted.org/packages/71/4c/3db2b8021bd6f2f0ceb0e088d6b2d49147671f25832fb17970e9b583d742/certifi-2022.12.7-py3-none-any.whl
Collecting pycparser (from cffi>=1.12->cryptography>=3.4.6->scrapy)
  Using cached https://files.pythonhosted.org/packages/62/d5/5f610ebe421e85889f2e55e33b7f9a6795bd982198517d912eb1c76e1a53/pycparser-2.21-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata; python_version < "3.8"->attrs>=19.2.0->Twisted>=18.9.0->scrapy)
  Using cached https://files.pythonhosted.org/packages/5b/fa/c9e82bbe1af6266adf08afb563905eb87cab83fde00a0a08963510621047/zipp-3.15.0-py3-none-any.whl
Installing collected packages: packaging, queuelib, jmespath, lxml, w3lib, typing-extensions, cssselect, parsel, incremental, zope.interface, constantly, zipp, importlib-metadata, attrs, six, Automat, idna, hyperlink, Twisted, urllib3, charset-normalizer, certifi, requests, filelock, requests-file, tldextract, pycparser, cffi, cryptography, pyOpenSSL, pyasn1, pyasn1-modules, service-identity, itemadapter, itemloaders, PyDispatcher, protego, scrapy
Successfully installed Automat-22.10.0 PyDispatcher-2.0.7 Twisted-22.10.0 attrs-23.1.0 certifi-2022.12.7 cffi-1.15.1 charset-normalizer-3.1.0 constantly-15.1.0 cryptography-40.0.2 cssselect-1.2.0 filelock-3.12.0 hyperlink-21.0.0 idna-3.4 importlib-metadata-6.6.0 incremental-22.10.0 itemadapter-0.8.0 itemloaders-1.1.0 jmespath-1.0.1 lxml-4.9.2 packaging-23.1 parsel-1.8.1 protego-0.2.1 pyOpenSSL-23.1.1 pyasn1-0.5.0 pyasn1-modules-0.3.0 pycparser-2.21 queuelib-1.6.2 requests-2.29.0 requests-file-1.5.1 scrapy-2.8.0 service-identity-21.1.0 six-1.16.0 tldextract-3.4.1 typing-extensions-4.5.0 urllib3-1.26.15 w3lib-2.1.1 zipp-3.15.0 zope.interface-6.0
Teru roomTeru room

テストプログラムの作成と実行結果

  • (venv) (base) macpro:scrapy sharland$ cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://www.zyte.com/blog/']

    def parse(self, response):
        for title in response.css('.oxy-post-title'):
            yield {'title': title.css('::text').get()}

        for next_page in response.css('a.next'):
            yield response.follow(next_page, self.parse)
EOF
  • (venv) (base) macpro:scrapy sharland$ scrapy runspider myspider.py
実行結果
2023-05-03 23:40:39 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-05-03 23:40:39 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.7.4 (default, Aug 13 2019, 15:17:50) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Darwin-22.4.0-x86_64-i386-64bit
2023-05-03 23:40:39 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2023-05-03 23:40:39 [py.warnings] WARNING: /Users/sharland/dev/scrapy/venv/lib/python3.7/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-05-03 23:40:39 [scrapy.utils.log] DEBUG: Using reactor: t

・・・途中省略・・・

2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'A Practical Guide to JSON Parsing with Python'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'A Practical Guide to JSON Parsing with Python'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'How to (safely) extract data from social media platforms and news sites'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'Zyte API – a single solution for web data extraction'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'Black Friday 2022 – an analysis of web scraping patterns'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'How web scraping can be used for digital transformation'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'Zyte vs import.io: Which is the best alternative?'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'Web scraping e-commerce: 5 ways to help you succeed'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'The Scraper’s System Part 2: Explorer’s Compass to analyze websites\xa0'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'Maximize the quality of news and article data extraction'}
2023-05-03 23:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/>
{'title': 'The Scraper’s System: a secret sauce to architect scalable web scraping applications'}
2023-05-03 23:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zyte.com/blog/page/2/> (referer: https://www.zyte.com/blog/)
2023-05-03 23:40:41 [py.warnings] WARNING: /Users/sharland/dev/scrapy/venv/lib/python3.7/site-packages/scrapy/selector/unified.py:83: UserWarning: Selector got both text and root, root is being ignored.
  super().__init__(text=text, type=st, root=root, **kwargs)

2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'A Practical Guide to JSON Parsing with Python'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': '4 key steps to develop an Automated Data QA process'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'Reflecting on the 2022 Web Data Extraction Summit | Zyte'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'How Data Mining and AI create business value'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': '6 Key Takeaways from Extract Summit 2022'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'Hacktoberfest 2022: Contribute to the Open-Source Community'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'Web Data Extraction Summit 2022'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'Guarantee the best results for product data extraction'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'Web snapshots? The what, the why, and the how'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'How to successfully build an Enterprise Data Extraction infrastructure'}
2023-05-03 23:40:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/2/>
{'title': 'How to optimize your data strategy with web data scraping'}
2023-05-03 23:40:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zyte.com/blog/page/3/> (referer: https://www.zyte.com/blog/page/2/)
2023-05-03 23:40:42 [py.warnings] WARNING: /Users/sharland/dev/scrapy/venv/lib/python3.7/site-packages/scrapy/selector/unified.py:83: UserWarning: Selector got both text and root, root is being ignored.
  super().__init__(text=text, type=st, root=root, **kwargs)

・・・途中省略・・・

2023-05-03 23:41:05 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/27/>
{'title': 'A Practical Guide to JSON Parsing with Python'}
2023-05-03 23:41:05 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/27/>
{'title': 'Spoofing your Scrapy bot IP using tsocks'}
2023-05-03 23:41:05 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zyte.com/blog/page/27/>
{'title': 'Hello, world'}
2023-05-03 23:41:05 [scrapy.core.engine] INFO: Closing spider (finished)
2023-05-03 23:41:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7213,
 'downloader/request_count': 27,
 'downloader/request_method_count/GET': 27,
 'downloader/response_bytes': 780780,
 'downloader/response_count': 27,
 'downloader/response_status_count/200': 27,
 'elapsed_time_seconds': 25.591,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 5, 3, 14, 41, 5, 209032),
 'httpcompression/response_bytes': 5194192,
 'httpcompression/response_count': 27,
 'item_scraped_count': 289,
 'log_count/DEBUG': 317,
 'log_count/INFO': 10,
 'log_count/WARNING': 28,
 'memusage/max': 57356288,
 'memusage/startup': 57356288,
 'request_depth_max': 26,
 'response_received_count': 27,
 'scheduler/dequeued': 27,
 'scheduler/dequeued/memory': 27,
 'scheduler/enqueued': 27,
 'scheduler/enqueued/memory': 27,
 'start_time': datetime.datetime(2023, 5, 3, 14, 40, 39, 618032)}
2023-05-03 23:41:05 [scrapy.core.engine] INFO: Spider closed (finished)
Teru roomTeru room

プロジェクトの作成

  • (venv) (base) macpro:scrapy sharland$ mkdir projects;cd projects
  • (venv) (base) macpro:projects sharland$ scrapy startproject books
実行結果
New Scrapy project 'books', using template directory '/Users/sharland/dev/scrapy/venv/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/sharland/dev/scrapy/projects/books

You can start your first spider with:
    cd books
    scrapy genspider example example.com
  • (venv) (base) macpro:projects sharland$ tree
.
└── books
    ├── books
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       └── __init__.py
    └── scrapy.cfg

3 directories, 7 files
アイテム 説明
scrapy.cfg spiderの作成やデプロイ用設定ファイル
spiders spider作成用フォルダ
items.py スクレイピングで取得したアイテム格納用
middlewares.py requestresponse処理拡張ロジック記述用
pipline.py 取得データのクレンジング、チェック、DBへの更新処理記述用
settings.py 各種パラメーター設定用ファイル
Teru roomTeru room

Spiderの作成

  • (venv) (base) macpro:projects sharland$ cd books/
  • 作成コマンドの書式
scrapy genspider [-t テンプレート] スパイダー名 URL
  • スレレイピング対象URL : https://books.toscrape.com/catalogue/category/books/business_35/index.html
  • Spiderの作成
    • (venv) (base) macpro:books sharland$ scrapy genspider books_basic books.toscrape.com/catalogue/category/books/business_35/index.html
実行結果
Created spider 'books_basic' using template 'basic' in module:
  books.spiders.books_basic
  • 生成ファイルの確認
    • (venv) (base) macpro:books sharland$ tree
      • 生成ファイル名:books_basic.py
.
├── books
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-37.pyc
│   │   └── settings.cpython-37.pyc
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-37.pyc
│       └── books_basic.py
└── scrapy.cfg

4 directories, 11 files
  • テンプレートの種類表示コマンド
    • (venv) (base) macpro:books sharland$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
テンプレート名 用途 備考
basic 基本
crawl 通常Webサイトクロール用 ルール定義でリンクをたどっていく
csvfeed csvファイル読み込み用
xmlfeed xmlファイル読み込み用
Teru roomTeru room

生成されたSpiderファイルの確認

  • (venv) (base) macpro:books sharland$ cat books/spiders/books_basic.py
import scrapy

class BooksBasicSpider(scrapy.Spider):
    name = "books_basic"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        pass
適用 説明 備考
BooksBasicSpider Spiderクラスを継承 追加機能のみをオーバーライドして実装
name genspiderコマンドで指定したspider名 同一プロジェクト内でユニークな名前
allowed_domains spiderがアクセスできるドメインリスト 複数指定可能
start_urls spiderがスクレイピングを開始するURL デフォルト:http:// 必要に応じてhttps://に変更要
parse パース用メソッド XPathやCSSセレクタを用いて情報の抽出を行う
Teru roomTeru room

生成されたSpiderファイルの更新

parseメソッドにスクレイピングコードを実装する

import scrapy

class BooksBasicSpider(scrapy.Spider):
    name = "books_basic"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/category/books/business_35/index.html"]

    def parse(self, response):
        sel=scrapy.Selector(response)
        articles=sel.xpath('//*[@id="default"]/div/div/div/div/section/div[2]/ol/li/article')
        for article in articles:
            title=article.xpath('//h3/a/@title').extract()
            price=article.xpath('//div[2]/p[1]/text()').extract()
            thumbnail=article.xpath('//div[1]/a/img/@src').extract()
        yield{
            'title':title,
            'price':price,
            'thumbnail':thumbnail,
        }
xpathアイテム 説明
//*[@id="default"]/div/div/div/div/section/div[2]/ol/li/article 繰り返しのあるタグのベースパス
//h3/a/@title <h3>タグ内の<a>タグのtitleプロパティの値
//div[2]/p[1]/text() 2番目の<div>タグ内の1番目の<p>タグ内の文字列
//div[1]/a/img/@src 1番目の<div>タグ内の<a>タグ内の<img>タグ内のsrcプロパティの値
Teru roomTeru room

スクレーピングの実行

  • (venv) (base) macpro:books sharland$ scrapy runspider books/spiders/books_basic.py
実行結果
2023-05-04 22:42:04 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: books)
2023-05-04 22:42:04 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.7.4 (default, Aug 13 2019, 15:17:50) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Darwin-22.4.0-x86_64-i386-64bit
2023-05-04 22:42:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'books',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'books.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'SPIDER_MODULES': ['books.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-05-04 22:42:04 [asyncio] DEBUG: Using selector: KqueueSelector
2023-05-04 22:42:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-05-04 22:42:04 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-05-04 22:42:04 [scrapy.extensions.telnet] INFO: Telnet Password: aa4402688b6272c8
2023-05-04 22:42:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-05-04 22:42:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-05-04 22:42:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-04 22:42:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-04 22:42:04 [scrapy.core.engine] INFO: Spider opened
2023-05-04 22:42:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-04 22:42:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-04 22:42:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://books.toscrape.com/robots.txt> (referer: None)
2023-05-04 22:42:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/category/books/business_35/index.html> (referer: None)
2023-05-04 22:42:05 [py.warnings] WARNING: /Users/sharland/dev/scrapy/venv/lib/python3.7/site-packages/scrapy/selector/unified.py:83: UserWarning: Selector got both text and root, root is being ignored.
  super().__init__(text=text, type=st, root=root, **kwargs)

2023-05-04 22:42:05 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/category/books/business_35/index.html>
{'title': ['The Dirty Little Secrets of Getting Your Dream Job', 'The Third Wave: An Entrepreneur’s Vision of the Future', 'The 10% Entrepreneur: Live Your Startup Dream Without Quitting Your Day Job', 'Shoe Dog: A Memoir by the Creator of NIKE', 'Made to Stick: Why Some Ideas Survive and Others Die', 'Quench Your Own Thirst: Business Lessons Learned Over a Beer or Two', 'The Art of Startup Fundraising', 'Born for This: How to Find the Work You Were Meant to Do', "The E-Myth Revisited: Why Most Small Businesses Don't Work and What to Do About It", 'Rich Dad, Poor Dad', "The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses", 'Rework'], 
'price': ['£33.34', '£12.61', '£27.55', '£23.99', '£38.85', '£43.14', '£21.00', '£21.59', '£36.91', '£51.74', '£33.92', '£44.88'], 
'thumbnail': ['../../../../media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg', '../../../../media/cache/d0/77/d077a30042df6b916bfc8d257345c69e.jpg', '../../../../media/cache/82/93/82939ca78da0b724f16ec814849514fd.jpg', '../../../../media/cache/19/aa/19aa1184a3565b1dae6092146018e109.jpg', '../../../../media/cache/e2/2e/e22e4a82d97f9f0689d5295a98f5dcff.jpg', '../../../../media/cache/2d/fd/2dfdc52bcdbd82dee50372bc46c83e15.jpg', '../../../../media/cache/b3/7b/b37be83183f1dcb759d92bda8f8998a4.jpg', '../../../../media/cache/aa/67/aa677a97ecdcbbde7471f1c90ed0cf6f.jpg', '../../../../media/cache/11/2c/112c55a6bcd401c3bd603f5ddb2e6b82.jpg', '../../../../media/cache/18/f4/18f45d31e3892fee589e23f15d759ee3.jpg', '../../../../media/cache/39/f1/39f167dff90d7f84f5c8dc5e05d4051b.jpg', '../../../../media/cache/54/10/5410a58193e2373c04b3021ade78a82b.jpg']}
2023-05-04 22:42:05 [scrapy.core.engine] INFO: Closing spider (finished)
2023-05-04 22:42:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 493,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 39595,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.100196,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 5, 4, 13, 42, 5, 859625),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 57233408,
 'memusage/startup': 57233408,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 5, 4, 13, 42, 4, 759429)}
2023-05-04 22:42:05 [scrapy.core.engine] INFO: Spider closed (finished)
Teru roomTeru room

GitHubに反映

  • (base) macpro:scrapy sharland$ cd ~/dev/scrapy/;git init
  • (base) macpro:scrapy sharland$ vim .gitignore
.git/
vnenv/
test/
.DS_Store
.gitignore
  • (base) macpro:scrapy sharland$ git add .
  • (base) macpro:scrapy sharland$ git commit -m "2023/5/5(fri)09:20"
  • (base) macpro:scrapy sharland$ git status
  • (base) macpro:scrapy sharland$ git remote add origin https://github.com/teruroom/scrapy.git
  • (base) macpro:scrapy sharland$ git remote -v
origin  https://github.com/teruroom/scrapy.git (fetch)
origin  https://github.com/teruroom/scrapy.git (push)
  • (base) macpro:scrapy sharland$ git push -u origin master