🙆

デジタル源氏物語における校異源氏物語と現代語訳の対応づけ

2024/01/07に公開

Python

XML

TEI

idea

概要

「デジタル源氏物語」は『源氏物語』に関する様々な関連データを収集・作成し、それらを結びつけることで、『源氏物語』研究はもちろん、古典籍を利用した教育・研究活動の一助となる環境の提案を目指したサイトです。

https://genji.dl.itc.u-tokyo.ac.jp/

本サイトが提供する機能の一つとして、「校異源氏物語と現代語訳の対応づけ」があります。以下のように、「校異源氏物語」と青空文庫で公開されている与謝野晶子訳の対応箇所がハイライト表示されます。

この記事では、上記の機能を実現するための手順について説明します。

データ

以下のようなデータを作成します。

「校異源氏物語」のテキストデータに対して、anchorタグを使って、与謝野晶子訳のファイルとIDのペアを対応付ます。

<text>
  <body>
    <p>
      <lb/>
      <pb facs="#zone_2055" n="2055"/>
      <lb/>
      <seg corresp="https://w3id.org/kouigenjimonogatari/api/items/2055-01.json">
        <anchor corresp="https://genji.dl.itc.u-tokyo.ac.jp/api/items/tei/yosano/56.xml#YG5600000300"/>
        やまにおはしてれいせさせ給やうに経仏なとくやうせさせ給
        <anchor corresp="https://genji.dl.itc.u-tokyo.ac.jp/api/items/tei/yosano/56.xml#YG5600000400"/>
        又の日はよかはに
      </seg>
      <lb/>
      ...

このデータの作成にあたっては、以下のツールを開発して使用しました。

残念ながら2024-01-07時点では動作しませんが、動作イメージは以下をご確認ください。今後、本ツールの改修を行いたいと思います。

https://youtu.be/hOp_PxYUrZk

上記の作業の結果、以下のようなGoogleドキュメントが作成されます。

「校異源氏物語」の各行について、対応する与謝野晶子訳のIDを\[YG(\d+)\]の形で挿入されます。

2055-01 [YG5600000300]やまにおはしてれいせさせ給やうに経仏なとくやうせさせ給[YG5600000400]又の日はよかはに
2055-02 おはしたれはそうつおとろきかしこまりきこえ給[YG5600000500]としころ御いのりなとつけか
2055-03 たらひたまひけれとことにいとしたしきことはなかりけるをこのたひ一品の宮
2055-04 の御心ちのほとにさふらひ給へるにすくれたまへるけん物し給けりとみたまひ
2055-05 てよりこよなうたうとひたまひていますこしふかきちきりくはへ給てけれはお
2055-06 も〱しくおはするとのゝかくわさとおはしましたることゝもてさはきゝこえ
2055-07 給[YG5600000600]御物かたりなとこまやかにしておはすれは御ゆつけなとまいり給[YG5600000700]すこし人
2055-08 〱しつまりぬるに[YG5600000800]をのゝわたりにしり給へるやとりや侍とゝひ給へはしか侍
...

源氏物語の冊毎のGoogleドキュメントがGoogleドライブに保存されます。

処理

Googleドライブからのファイル名とIDの一覧を取得

Googleドライブとの接続

#| export
import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

#| export
class GoogleDriveClient:

    def __init__(self, credential_path):

        # If modifying these scopes, delete the file token.json.
        SCOPES = [
            "https://www.googleapis.com/auth/drive.metadata.readonly"
        ]

        creds = None
        # The file token.json stores the user's access and refresh tokens, and is
        # created automatically when the authorization flow completes for the first
        # time.
        if os.path.exists("token.json"):
            creds = Credentials.from_authorized_user_file("token.json", SCOPES)
        # If there are no (valid) credentials available, let the user log in.
        if not creds or not creds.valid:
            if creds and creds.expired and creds.refresh_token:
                creds.refresh(Request())
            else:
                flow = InstalledAppFlow.from_client_secrets_file(
                    credential_path, SCOPES
                )
                creds = flow.run_local_server(port=0)
            # Save the credentials for the next run
            with open("token.json", "w") as token:
                token.write(creds.to_json())

        try:
            service = build('drive', 'v3', credentials=creds)
            self.drive_service = service

        except HttpError as e:
            print(e)
            print("Error while creating API client")
            raise e

一覧の取得

import json
client = GoogleDriveClient(credential_path)

service = client.drive_service

# Call the Drive v3 API
results = service.files().list(
    q="'1QgS4z_5vk8AEz95iA3q7j41-U3oDdfpx' in parents",
    pageSize=100, fields="nextPageToken, files(id, name)").execute()
items = results.get('files', [])

config = {}

if not items:
    print('No files found.')
else:
    for item in items:
        config[item['name']] = item['id']

with open("data/config.json", "w") as f:
    json.dump(config, f, indent=4)

各Googleドキュメントに対する処理

上記で取得したファイル名（冊数）とIDに基づき、各Googleドキュメントに対する処理を行います。

この処理により、冒頭で紹介したanchorタグが挿入された校異源氏物語のXMLデータを作成します。

なお、編集元となる校異源氏物語のXMLデータは以下で公開されています。

https://kouigenjimonogatari.github.io/

（参考）XMLデータの整形

XMLデータの整形にあたり、以下のような関数を作成しました。

def pretty(self, xml_string: str) -> str:
        """
        Pretty prints the XML string.
        :param xml_string: XML string to pretty print.
        :return: Pretty printed XML string.
        """

        # 文字列からDOMツリーを構築します。
        dom = minidom.parseString(xml_string)

        # 整形されたXMLを取得します。
        pretty_xml_as_string = dom.toprettyxml()

        # 空行を削除します。
        pretty_xml_as_string = '\n'.join([line for line in pretty_xml_as_string.split('\n') if line.strip()])

        return pretty_xml_as_string

以下のような、Beautiful Soupのprettify()メソッドの場合、不要な改行が含まれてしまうようでした。

# BeautifulSoupオブジェクトを作成
soup = BeautifulSoup(html_content, "html.parser")

# 整形されたHTMLを取得して表示
pretty_html = soup.prettify()
print(pretty_html)

まとめ

デジタル源氏物語における校異源氏物語と現代語訳の対応づけに必要な手順をメモしました。

概要

データ

処理

Googleドライブからのファイル名とIDの一覧を取得

各Googleドキュメントに対する処理

（参考）XMLデータの整形

まとめ

Discussion