Open2021/04/07にコメント追加3

Pythonでのスクレイピングメモ

その他参考

与えたキーワードを掛け合わせて自動でGoogle検索
https://qiita.com/mckyhrs/items/b0ab0d9caf032b6b2a2a
BeautifulSoupの美しさを知りませんでした。実際に使ってみるまでは
https://naruport.com/blog/2019/7/13/how-to-use-of-beautiful-soup-4/
BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
https://stackoverflow.com/questions/45494505/beautifulsoup-whats-the-difference-between-lxml-and-html-parser-and-html5
html5lib and lxml parsers in Python
https://www.geeksforgeeks.org/html5lib-and-lxml-parsers-in-python/
lxml - XML and HTML with Python
https://lxml.de/index.html

インストール

$ pip install requests
$ pip install beautifulsoup4
$ pip install lxml

Kumamoto-Hamachi

1.HTMLを取得(Requests)

基本

import requests

if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    print("res.url", res.url)  # str
    print("res.status_code", res.status_code)  # int e.g.200
    print("res.headers", res.headers)  # dict
    print("res.encoding", res.encoding)  # str - utf-8
    print("res.text", res.text)  # str - html

TODO json

    """ 動かない
    headers = {"content-type":"application/json"}
    res = requests.get(url, headers=)
    res.json()
    """

Kumamoto-Hamachi

2.HTMLを解析(Beautifulsoup)

2.1最初のタグから情報を取得する

まずは下準備

import requests
from bs4 import BeautifulSoup as bs
if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    # BeautifulSoupでHTMLを解析する
    soup = bs(res.text, "lxml")

最初のタグを取る

    print("soup.find('h2')", soup.find('h2'))  # debug
    # soup.h2

※ちなみに存在しないタグを.find関数で取得するとNoneが返ってくる。

タグから文字を取る

    print("soup.find('h2').text", soup.find('h2').text)  # debug
    # soup.h2.text
    # soup.find('h2').text
    # soup.h2.get_text()
    # soup.find('h2').get_text()

実際にやってみる

Zennのわたしの記事のタイトル部分『[第一部]これならわかる！Flask+Nginx+uWSGIをAWSに丁寧にデプロイ』の文字を取得してみる。

    url = "https://zenn.dev/kumamoto/articles/361c906f973f49"
    res = requests.get(url)
    soup = bs(res.text, "lxml")
    print("soup.find('h1').text", soup.find('h1').text)  # debug

補足

    """
    GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for
    this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a differe
    nt virtual environment, it may use a different parser and behave differently.
    The code that caused this warning is on line 15 of the file main.py. To get rid of this warning, pass the additional ar
    gument 'features="html.parser"' to the BeautifulSoup constructor.
    """

2.2タグを全て取得する

import requests
from bs4 import BeautifulSoup as bs

if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    soup = bs(res.text, "lxml")
    results = soup.find_all("h2")
    results_l = [None] * 10
    for i, r in enumerate(results):
        results_l[i] = r.text

# ワンライナーな書き方で
    texts = [html_tag.text for html_tag in soup.find_all("h2")]
    print("texts", texts)  # debug