Open3

Pythonでのスクレイピングメモ

Kumamoto-HamachiKumamoto-Hamachi

https://teratail.com/questions/337700

その他参考

インストール

$ pip install requests
$ pip install beautifulsoup4
$ pip install lxml
Kumamoto-HamachiKumamoto-Hamachi

1.HTMLを取得(Requests)

基本

import requests

if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    print("res.url", res.url)  # str
    print("res.status_code", res.status_code)  # int e.g.200
    print("res.headers", res.headers)  # dict
    print("res.encoding", res.encoding)  # str - utf-8
    print("res.text", res.text)  # str - html

TODO json

    """ 動かない
    headers = {"content-type":"application/json"}
    res = requests.get(url, headers=)
    res.json()
    """
Kumamoto-HamachiKumamoto-Hamachi

2.HTMLを解析(Beautifulsoup)

2.1最初のタグから情報を取得する

  • まずは下準備
import requests
from bs4 import BeautifulSoup as bs
if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    # BeautifulSoupでHTMLを解析する
    soup = bs(res.text, "lxml")
  • 最初のタグを取る
    print("soup.find('h2')", soup.find('h2'))  # debug
    # soup.h2

※ちなみに存在しないタグを.find関数で取得するとNoneが返ってくる。

  • タグから文字を取る
    print("soup.find('h2').text", soup.find('h2').text)  # debug
    # soup.h2.text
    # soup.find('h2').text
    # soup.h2.get_text()
    # soup.find('h2').get_text()
  • 実際にやってみる

Zennのわたしの記事のタイトル部分『[第一部]これならわかる!Flask+Nginx+uWSGIをAWSに丁寧にデプロイ』の文字を取得してみる。

    url = "https://zenn.dev/kumamoto/articles/361c906f973f49"
    res = requests.get(url)
    soup = bs(res.text, "lxml")
    print("soup.find('h1').text", soup.find('h1').text)  # debug
  • 補足
    """
    GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for
    this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a differe
    nt virtual environment, it may use a different parser and behave differently.
    The code that caused this warning is on line 15 of the file main.py. To get rid of this warning, pass the additional ar
    gument 'features="html.parser"' to the BeautifulSoup constructor.
    """

2.2タグを全て取得する

import requests
from bs4 import BeautifulSoup as bs

if __name__ == "__main__":
    url = "https://www.python.org/"
    res = requests.get(url)
    soup = bs(res.text, "lxml")
    results = soup.find_all("h2")
    results_l = [None] * 10
    for i, r in enumerate(results):
        results_l[i] = r.text

# ワンライナーな書き方で
    texts = [html_tag.text for html_tag in soup.find_all("h2")]
    print("texts", texts)  # debug