🥣

BeautifulSoup4でざっくりHTML解析してみる

2024/03/13に公開

Python

スクレイピング

Beautiful Soup 4

tech

BeautifulSoupは、PythonでHTMLやXMLファイルからデータを抽出するためのライブラリです。
ライブラリの基本的な使い方をざっくりまとめてみました。

ドキュメント：BeautifulSoup Documentation

📌 事前準備

⌨️ 検証した環境について

python_version >= 3.8
requirements.txt

beautifulsoup4==4.12.3
certifi==2024.2.2
charset-normalizer==3.3.2
idna==3.6
requests==2.31.0
soupsieve==2.5
urllib3==2.2.1

🛠️ ライブラリのインストール

pip install beautifulsoup4 requests

※ requestsは、最後に外部のWEBサイトの情報をスクレイピングするときに使います。

📝 サンプルソース

from bs4 import BeautifulSoup
import re
import requests

# HTMLコンテンツの例
html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Web Page</title>
</head>
<body>
    <h1>Welcome to My Web Page</h1>
    <p>This is a paragraph of text.</p>
    <ul>
        <li id="list_1" class="class_1">List item one</li>
        <li id="list_2" class="class_2">List item two</li>
        <li id="list_3" class="class_3">List item three</li>
    </ul>
    <custom>
      <a href="http://www.example1.com">Visit Example1</a>
      <a href="http://www.example2.com">Visit Example2</a>
      <a href="http://www.example3.com">Visit Example3</a>
    </custom>
</body>
</html>
"""

# BeautifulSoupオブジェクトの生成。htmlパーサーを指定する。
soup = BeautifulSoup(html_content, 'html.parser')

📌 基本的な使い方

HTMLを全て出力

✅ 丸ごと出力する

print(soup)

💻 結果

<!DOCTYPE html>

<html>
<head>
<title>Sample Web Page</title>
</head>
<body>
<h1>Welcome to My Web Page</h1>
<p>This is a paragraph of text.</p>
<ul>
<li class="class_1" id="list_1">List item one</li>
<li class="class_2" id="list_2">List item two</li>
<li class="class_3" id="list_3">List item three</li>
</ul>
<custom>
<a href="http://www.example1.com">Visit Example1</a>
<a href="http://www.example2.com">Visit Example2</a>
<a href="http://www.example3.com">Visit Example3</a>
</custom>
</body>
</html>

✅ タグを除いたテキストを抽出

print(soup.get_text())

💻 結果

Sample Web Page


Welcome to My Web Page
This is a paragraph of text.

List item one
List item two
List item three


Visit Example1
Visit Example2
Visit Example3

✅ `stripped_strings`を使用してタグを除いたテキスト抽出

for string in soup.stripped_strings:
    print(string)

💻 結果

Sample Web Page
Welcome to My Web Page
This is a paragraph of text.
List item one
List item two
List item three
Visit Example1
Visit Example2
Visit Example3

特定のDOM要素の情報を取得

✅ DOMを指定して取得

print(soup.title)             # タイトルタグ
print(soup.title.name)        # タグ名
print(soup.title.string)      # タグの中身
print(soup.ul.text)           # ulタグのテキスト
print(soup.title.parent.name) # タイトルの親要素のタグ名
print(soup.p)                 # P要素
print(soup.ul)                # ul要素
print(soup.li.text)           # NOTE: 属性が複数ある場合、最初にあるものを取得する

💻 結果

<title>Sample Web Page</title>

title

Sample Web Page

List item one
List item two
List item three

head

<p>This is a paragraph of text.</p>

<ul>
<li class="class_1" id="list_1">List item one</li>
<li class="class_2" id="list_2">List item two</li>
<li class="class_3" id="list_3">List item three</li>
</ul>

List item one

属性の取得

✅ 特定のDOMの属性を取得

print(soup.li['id'])          # li要素のid属性
print(soup.li['class'])       # li要素のclass属性
print(soup.li.attrs)          # li要素の全属性

💻 結果

list_1
['class_1']
{'id': 'list_1', 'class': ['class_1']}

要素を検索して取得

✅ `find`メソッド（単一の要素を検索）

print(soup.find('a'))                             # 最初のaタグ
print(soup.find('li', id='list_3'))               # IDが'list_3'のliタグ

💻 結果

<a href="http://www.example1.com">Visit Example1</a>

<li class="class_3" id="list_3">List item three</li>

✅ `find_all`メソッド（複数の要素をリストで検索）

print(soup.find_all('li'))                                      # すべてのliタグ
print(soup.find_all(['li', 'a']))                               # すべてのliタグとaタグ
print(soup.find_all(string="List item one"))                    # テキストが'List item one'の要素
print(soup.find_all(string=["List item one", "List item two"])) # 複数のテキストを持つ要素
print(soup.find_all(string=re.compile("List")))                 # 文字列'List'を含むテキストを持つ要素

💻 結果

[<li class="class_1" id="list_1">List item one</li>, <li class="class_2" id="list_2">List item two</li>, <li class="class_3" id="list_3">List item three</li>]

[<li class="class_1" id="list_1">List item one</li>, <li class="class_2" id="list_2">List item two</li>, <li class="class_3" id="list_3">List item three</li>, <a href="http://www.example1.com">Visit Example1</a>, <a href="http://www.example2.com">Visit Example2</a>, <a href="http://www.example3.com">Visit Example3</a>]

['List item one']

['List item one', 'List item two']

['List item one', 'List item two', 'List item three']

✅ `find_all`の結果を走査して特定のプロパティを出力

for link in soup.find_all('a'):
  print(link.get('href'))

💻 結果

http://www.example1.com
http://www.example2.com
http://www.example3.com

✅ `find_all`で正規表現で要素検索

for tag in soup.find_all(re.compile("^custom$")):
    print(tag.name)

💻 結果

custom

子要素の取得

✅ `contents`メソッド（子要素をリストとして取得）

print(soup.ul.contents)
print(soup.ul.contents[3].string)

💻 結果

['\n', <li class="class_1" id="list_1">List item one</li>, '\n', <li class="class_2" id="list_2">List item two</li>, '\n', <li class="class_3" id="list_3">List item three</li>, '\n']

List item two

✅ `children`メソッド（子要素をイテレータとして取得）

for child in soup.custom.children:
    print(child)

💻 結果

<a href="http://www.example1.com">Visit Example1</a>

<a href="http://www.example2.com">Visit Example2</a>

<a href="http://www.example3.com">Visit Example3</a>

✅ `descendants`メソッド（子要素を再帰的にイテレータとして取得）

for descendant in soup.custom.descendants:
    print(descendant)

💻 結果

<a href="http://www.example1.com">Visit Example1</a>
Visit Example1

<a href="http://www.example2.com">Visit Example2</a>
Visit Example2

<a href="http://www.example3.com">Visit Example3</a>
Visit Example3

HTMLをWebから読み込んで処理する

✅ HTMLを取得してh1要素を抽出する

url      = "https://www.crummy.com/software/BeautifulSoup/bs4/doc"
response = requests.get(url)
html     = response.text
soup     = BeautifulSoup(html, 'html.parser')

for h1_dom in soup.find_all("h1"):
    print(h1_dom.text)

💻 結果

Beautiful Soup Documentation¶
Quick Start¶
Installing Beautiful Soup¶
Making the soup¶
Kinds of objects¶
Navigating the tree¶
Searching the tree¶
Modifying the tree¶
Output¶
Specifying the parser to use¶
Encodings¶
Line numbers¶
Comparing objects for equality¶
Copying Beautiful Soup objects¶
Advanced parser customization¶
Troubleshooting¶
Translating this documentation¶
Beautiful Soup 3¶

📌 まとめ

簡単ですが、BeautifulSoupを使った基本的なHTMLデータの処理方法をインストールから実際のコードの書き方まで解説しました。
Webスクレイピングやデータ解析の際に、必要な情報を効率的に収集し処理することができます。
BeautifulSoupを活用して、HTMLデータの探索や分析をもっと手軽に、もっとパワフルに進めましょう。

📌 事前準備

⌨️ 検証した環境について

🛠️ ライブラリのインストール

📝 サンプルソース

📌 基本的な使い方

HTMLを全て出力

✅ 丸ごと出力する

✅ タグを除いたテキストを抽出

✅ stripped_stringsを使用してタグを除いたテキスト抽出

特定のDOM要素の情報を取得

✅ DOMを指定して取得

属性の取得

✅ 特定のDOMの属性を取得

要素を検索して取得

✅ findメソッド（単一の要素を検索）

✅ find_allメソッド（複数の要素をリストで検索）

✅ find_allの結果を走査して特定のプロパティを出力

✅ find_allで正規表現で要素検索

子要素の取得

✅ contentsメソッド（子要素をリストとして取得）

✅ childrenメソッド（子要素をイテレータとして取得）

✅ descendantsメソッド（子要素を再帰的にイテレータとして取得）

HTMLをWebから読み込んで処理する

✅ HTMLを取得してh1要素を抽出する

📌 まとめ

Discussion

✅ `stripped_strings`を使用してタグを除いたテキスト抽出

✅ `find`メソッド（単一の要素を検索）

✅ `find_all`メソッド（複数の要素をリストで検索）

✅ `find_all`の結果を走査して特定のプロパティを出力

✅ `find_all`で正規表現で要素検索

✅ `contents`メソッド（子要素をリストとして取得）

✅ `children`メソッド（子要素をイテレータとして取得）

✅ `descendants`メソッド（子要素を再帰的にイテレータとして取得）