iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔥

Web Scraping with Python 3

に公開

Overview

I had a situation where I needed to perform web scraping using Python, so here is a note on that. Scraping with Python can be done by either statically retrieving HTML or by dynamically retrieving HTML using a browser. This time, I will cover the latter method.

Environment

  • Windows 11 x64
  • Python 3.12.10
  • pip 25.0.1

Installing Geckodriver

Geckodriver requires the Firefox browser. There is also a version for Chrome, but I did not use it this time. Additionally, you need to select the Geckodriver version based on the versions of Firefox and Python you are using. Version compatibility can be checked on this page. The Geckodriver binary itself can be downloaded from this page.

Installing Dependencies

pip install selenium webdriver_manager urllib3

Code

scraping.py
import sys
import codecs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from urllib.parse import urlparse
import time

# Set the character encoding of the output to UTF-8
sys.stdout.reconfigure(encoding="utf-8")

# Get URL from command line arguments
if len(sys.argv) < 2:
	print("Usage: python script.py {URL}")
	sys.exit(1)

url = sys.argv[1]

# Setup Firefox driver
driver = webdriver.Firefox()

# Scrape hogehoge site
def scrape_hogehoge(driver):
	title = driver.find_element(By.CSS_SELECTOR, ".main")
	inner = driver.find_element(By.CSS_SELECTOR, ".inner")
	p = inner.find_element(By.CSS_SELECTOR, "p")
	print(f"{title.text}: {p.text.strip()}")

# Scrape piyopiyo site
def scrape_piyopiyo(driver):
	title = driver.find_element(By.CSS_SELECTOR, "h1")
	print(title.text)

# Scrape other sites
def scrape_other_site(driver):
	title = driver.title # Since h1 is frequently used, it is automatically retrieved
	print(title)

try:
	# Access the specified URL
	driver.get(url)
    # Wait for the page to load
	time.sleep(3)
	# Get domain and branch processing
	domain = urlparse(driver.current_url).netloc

	if "hogehoge.com" in domain:
		scrape_hogehoge(driver)
	elif "piyopiyo.io" in domain:
		scrape_piyopiyo(driver)
	else:
		scrape_other_site(driver)
finally:
	driver.quit()

Discussion