Open3
Pythonでのスクレイピングメモ

その他参考
- 与えたキーワードを掛け合わせて自動でGoogle検索
https://qiita.com/mckyhrs/items/b0ab0d9caf032b6b2a2a - BeautifulSoupの美しさを知りませんでした。実際に使ってみるまでは
https://naruport.com/blog/2019/7/13/how-to-use-of-beautiful-soup-4/ - BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
https://stackoverflow.com/questions/45494505/beautifulsoup-whats-the-difference-between-lxml-and-html-parser-and-html5 - html5lib and lxml parsers in Python
https://www.geeksforgeeks.org/html5lib-and-lxml-parsers-in-python/ - lxml - XML and HTML with Python
https://lxml.de/index.html
インストール
$ pip install requests
$ pip install beautifulsoup4
$ pip install lxml

1.HTMLを取得(Requests)
基本
import requests
if __name__ == "__main__":
url = "https://www.python.org/"
res = requests.get(url)
print("res.url", res.url) # str
print("res.status_code", res.status_code) # int e.g.200
print("res.headers", res.headers) # dict
print("res.encoding", res.encoding) # str - utf-8
print("res.text", res.text) # str - html
TODO json
""" 動かない
headers = {"content-type":"application/json"}
res = requests.get(url, headers=)
res.json()
"""

2.HTMLを解析(Beautifulsoup)
2.1最初のタグから情報を取得する
- まずは下準備
import requests
from bs4 import BeautifulSoup as bs
if __name__ == "__main__":
url = "https://www.python.org/"
res = requests.get(url)
# BeautifulSoupでHTMLを解析する
soup = bs(res.text, "lxml")
- 最初のタグを取る
print("soup.find('h2')", soup.find('h2')) # debug
# soup.h2
※ちなみに存在しないタグを.find関数で取得するとNone
が返ってくる。
- タグから文字を取る
print("soup.find('h2').text", soup.find('h2').text) # debug
# soup.h2.text
# soup.find('h2').text
# soup.h2.get_text()
# soup.find('h2').get_text()
- 実際にやってみる
Zennのわたしの記事のタイトル部分『[第一部]これならわかる!Flask+Nginx+uWSGIをAWSに丁寧にデプロイ』の文字を取得してみる。
url = "https://zenn.dev/kumamoto/articles/361c906f973f49"
res = requests.get(url)
soup = bs(res.text, "lxml")
print("soup.find('h1').text", soup.find('h1').text) # debug
- 補足
"""
GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for
this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a differe
nt virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 15 of the file main.py. To get rid of this warning, pass the additional ar
gument 'features="html.parser"' to the BeautifulSoup constructor.
"""
2.2タグを全て取得する
import requests
from bs4 import BeautifulSoup as bs
if __name__ == "__main__":
url = "https://www.python.org/"
res = requests.get(url)
soup = bs(res.text, "lxml")
results = soup.find_all("h2")
results_l = [None] * 10
for i, r in enumerate(results):
results_l[i] = r.text
# ワンライナーな書き方で
texts = [html_tag.text for html_tag in soup.find_all("h2")]
print("texts", texts) # debug