iTranslated by AI
Trying out Web Scraping
I Tried Web Scraping
My wife loves "Michi-no-Eki" (roadside stations), and we have many opportunities to visit them on holidays and weekends. Since it's troublesome to look them up every time, I decided to convert all the Michi-no-Eki locations in the Kinki region into CSV data so I can easily grasp where they are.
What is Web Scraping?
Web scraping is the process of parsing web files and extracting the necessary information (such as strings, numerical values, etc.).
Points to Consider
Depending on how the code is written, scraping can put more load on a server than simply clicking with a mouse. (I believe it should be fine unless it's an extreme case.)
Therefore, some websites may restrict access.
The first thing to check is:
Type "robots.txt" at the end of the URL.
This is where the site specifies:
"Please do not allow automated programs (crawlers) to enter here."
A page that looks like a technical back-end document containing these rules will be displayed.
If you are not comfortable with English, please use a translation tool.
The URL of the site I am scraping this time is
here. ↓
[https://www.kkr.mlit.go.jp/road/michi_no_eki/ichiran.html]
User-agent: *
Disallow: /n_info/mitoshi/eizen/gaiyou/ Clear◎
Disallow: /disclosure/ Clear◎
Disallow: /asuwa/bousai/ Clear◎
*Disallow indicates restricted areas.
It is perfectly clear.
Next, I checked the Terms of Service.


This is also clear.
I recommend reading the Terms of Service thoroughly.
Structure of HTML Files
One of the components of a website's structure is the .html file.
Let's take a look at the structure of an HTML file. The method to check is extremely easy.
Go to the site you want to inspect, press the F12 key, or right-click and select "Developer Tools" (or "View Page Source" if the former is not available).

Source: From the Ministry of Land, Infrastructure, Transport and Tourism Kinki Regional Development Bureau homepage
I can see that it has this kind of structure.
Let's Actually Perform Scraping
The development environment is Windows 11, Python 3 (Jupyter Lab). Details are documented in my repository.
-Environment-
Execute in windows ps
-------------
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt #Installed all libraries
jupyter lab
-------------
websq-mitinoeki/
|_____.venv/
|_____main.ipynb
|_____requirements.txt
|_____(Plan to output csv here....)
Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
Specify the URL
url = "https://www.kkr.mlit.go.jp/road/michi_no_eki/ichiran.html#prefecture-4"
Extract HTML
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)
#'<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="utf-.......
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)
#<!DOCTYPE html>
#<html lang="ja">
#<head>
#<meta charset="utf-8"/>
Specify Tags
table_body = soup.find_all("ul",class_="table body")
#Find all <ul class="table body"> tags within the parsed soup
print(table_body[0]) #find_all() returns a list type
#<ul class="table body">
#<li class="cell station-name">Minamiechizen Sankairi [ Minamiechizen Sankairi]
#</li>
#<li class="cell location">39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun</li>
#<li class="cell line multi-line-text">Prefectural Road Nakagoya Takefu Line</li>
#<li class="cell to-detail"><a href="contents/fukui/minamiechizensankairi.html">Detail</a></li>
#</ul>
Check various outputs
print(table_body[0].find("li", class_="station-name").text.strip())
#'Minamiechizen Sankairi [ Minamiechizen Sankairi]'
print(table_body[0].find("li", class_="location").text.strip())
#'39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun'
print(table_body[0].find("a").get("href"))
#'contents/fukui/minamiechizensankairi.html'
print("https://www.kkr.mlit.go.jp/road/michi_no_eki/" + table_body[0].find("a").get("href"))
#'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/minamiechizensankairi.html'
Organize the data format
contents_data = []
head_url = "https://www.kkr.mlit.go.jp/road/michi_no_eki/"
for i in range(len(table_body)):
name = table_body[i].find("li", class_="station-name").text.strip()
location = table_body[i].find("li", class_="location").text.strip()
URL = head_url + table_body[i].find("a").get("href")
data = {
"StationName": name,
"Address": location,
"URL": URL
}
contents_data.append(data)
print(contents_data)
#[{'StationName': 'Minamiechizen Sankairi [ Minamiechizen Sankairi]',
# 'Address': '39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun',
# 'URL': 'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/minamiechizensankairi.html'},
# {'StationName': 'Echizen Ono Arashima no Sato [ Echizen Ono Arashima no Sato]',
# 'Address': '137-21 Warabiou, Ono-shi',
# 'URL': 'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/echizenonoarashimano#sato.html'},
# {'StationName': 'Kyouryukeikoku Katsuyama[ Kyouryukeikoku Katsuyama ]',
Create PANDAS DF
df = pd.DataFrame(contents_data)
print(df)

Output to CSV
df.to_csv("mitinoeki.csv", index=False)
Discussion