iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🥰

Trying out Web Scraping

に公開

I Tried Web Scraping

My wife loves "Michi-no-Eki" (roadside stations), and we have many opportunities to visit them on holidays and weekends. Since it's troublesome to look them up every time, I decided to convert all the Michi-no-Eki locations in the Kinki region into CSV data so I can easily grasp where they are.

What is Web Scraping?

Web scraping is the process of parsing web files and extracting the necessary information (such as strings, numerical values, etc.).

Points to Consider

Depending on how the code is written, scraping can put more load on a server than simply clicking with a mouse. (I believe it should be fine unless it's an extreme case.)

Therefore, some websites may restrict access.
The first thing to check is:




Type "robots.txt" at the end of the URL.

This is where the site specifies:
"Please do not allow automated programs (crawlers) to enter here."

A page that looks like a technical back-end document containing these rules will be displayed.
If you are not comfortable with English, please use a translation tool.
The URL of the site I am scraping this time is
here. ↓

[https://www.kkr.mlit.go.jp/road/michi_no_eki/ichiran.html]

User-agent: *
Disallow: /n_info/mitoshi/eizen/gaiyou/  Clear◎
Disallow: /disclosure/   Clear◎
Disallow: /asuwa/bousai/   Clear◎

*Disallow indicates restricted areas.
It is perfectly clear.

Next, I checked the Terms of Service.






This is also clear.
I recommend reading the Terms of Service thoroughly.




Structure of HTML Files

One of the components of a website's structure is the .html file.
Let's take a look at the structure of an HTML file. The method to check is extremely easy.
Go to the site you want to inspect, press the F12 key, or right-click and select "Developer Tools" (or "View Page Source" if the former is not available).



Source: From the Ministry of Land, Infrastructure, Transport and Tourism Kinki Regional Development Bureau homepage


I can see that it has this kind of structure.




Let's Actually Perform Scraping

The development environment is Windows 11, Python 3 (Jupyter Lab). Details are documented in my repository.


https://github.com/kenkenkengo0421/websq-mitinoeki/tree/main


-Environment-

Execute in windows ps

-------------

py -m venv .venv

.\.venv\Scripts\Activate.ps1

pip install -r requirements.txt #Installed all libraries

jupyter lab 

-------------

websq-mitinoeki/
                  |_____.venv/
                  |_____main.ipynb
                  |_____requirements.txt
                  |_____(Plan to output csv here....)



Import libraries

import pandas as pd 
import requests
from bs4 import BeautifulSoup


Specify the URL

url = "https://www.kkr.mlit.go.jp/road/michi_no_eki/ichiran.html#prefecture-4"


Extract HTML

r = requests.get(url)
r.encoding = "utf-8"
print(r.text)

#'<!DOCTYPE html>
<html lang="ja">
  <head>
    <meta charset="utf-.......

soup =  BeautifulSoup(r.text, 'html.parser')
print(soup)

#<!DOCTYPE html>
#<html lang="ja">
#<head>
#<meta charset="utf-8"/>


Specify Tags

table_body = soup.find_all("ul",class_="table body")
#Find all <ul class="table body"> tags within the parsed soup

print(table_body[0])   #find_all() returns a list type

#<ul class="table body">
#<li class="cell station-name">Minamiechizen Sankairi [ Minamiechizen Sankairi]
#</li>
#<li class="cell location">39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun</li>
#<li class="cell line multi-line-text">Prefectural Road Nakagoya Takefu Line</li>
#<li class="cell to-detail"><a href="contents/fukui/minamiechizensankairi.html">Detail</a></li>
#</ul>


Check various outputs

print(table_body[0].find("li", class_="station-name").text.strip())
#'Minamiechizen Sankairi [ Minamiechizen Sankairi]'

print(table_body[0].find("li", class_="location").text.strip())
#'39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun'

print(table_body[0].find("a").get("href"))
#'contents/fukui/minamiechizensankairi.html'

print("https://www.kkr.mlit.go.jp/road/michi_no_eki/" + table_body[0].find("a").get("href"))
#'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/minamiechizensankairi.html'


Organize the data format


contents_data = []
head_url = "https://www.kkr.mlit.go.jp/road/michi_no_eki/"

for i in range(len(table_body)):
    name = table_body[i].find("li", class_="station-name").text.strip()
    location = table_body[i].find("li", class_="location").text.strip()
    URL = head_url + table_body[i].find("a").get("href")   

    data = {
        "StationName": name,
        "Address": location,
        "URL": URL       
    }
    
    contents_data.append(data)

print(contents_data)
#[{'StationName': 'Minamiechizen Sankairi [ Minamiechizen Sankairi]',
#  'Address': '39-2-2 Makidani, Minamiechizen-cho, Nanjo-gun',
#  'URL': 'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/minamiechizensankairi.html'},
# {'StationName': 'Echizen Ono Arashima no Sato [ Echizen Ono Arashima no Sato]',
#  'Address': '137-21 Warabiou, Ono-shi',
#  'URL': 'https://www.kkr.mlit.go.jp/road/michi_no_eki/contents/fukui/echizenonoarashimano#sato.html'},
# {'StationName': 'Kyouryukeikoku Katsuyama[ Kyouryukeikoku Katsuyama ]',


Create PANDAS DF

df = pd.DataFrame(contents_data)
print(df)



Output to CSV

df.to_csv("mitinoeki.csv", index=False)

Discussion