Closed2024/09/04にクローズ2

メモ: Livedoorニュースコーパス

LlamaIndexでのGraphRAG実装（v2）でニュース記事のデータセットが使用されていた。
https://docs.llamaindex.ai/en/stable/examples/cookbooks/GraphRAG_v2/
これを日本語で試すならば、Livedoorニュースコーパスが近そうということで、内容をざっと見てみる。
https://www.rondhuit.com/download.html#ldcc

 livedoor ニュースコーパス
 概要本コーパスは、NHN Japan株式会社が運営する「livedoor ニュース」のうち、下記のクリエイティブ・コモンズライセンスが適用されるニュース記事を収集し、可能な限りHTMLタグを取り除いて作成したものです。
トピックニュース   http://news.livedoor.com/category/vender/news/
Sports Watch   http://news.livedoor.com/category/vender/208/
ITライフハック   http://news.livedoor.com/category/vender/223/
家電チャンネル   http://news.livedoor.com/category/vender/kadench/
MOVIE ENTER   http://news.livedoor.com/category/vender/movie_enter/
独女通信   http://news.livedoor.com/category/vender/90/
エスマックス   http://news.livedoor.com/category/vender/smax/
livedoor HOMME   http://news.livedoor.com/category/vender/homme/
Peachy   http://news.livedoor.com/category/vender/ldgirls/
収集時期：2012年9月上旬

 ライセンス各記事ファイルにはクリエイティブ・コモンズライセンス「表示 – 改変禁止」が適用されます。 クレジット表示についてはニュースカテゴリにより異なるため、ダウンロードしたファイルを展開したサブディレクトリにあるそれぞれの LICENSE.txt をご覧ください。 livedoor はNHN Japan株式会社の登録商標です。
基本的に以下の記事のほぼ写経、つまり単なる自分メモ。
https://zenn.dev/robes/articles/c2c65d9aef7562

kun432

!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz

import os
import pandas as pd
import tarfile

tar_file_path = "/content/ldcc-20140209.tar.gz"
extract_folder = "/content/ldcc_data/"

with tarfile.open(tar_file_path, "r:gz") as tar:
    tar.extractall(path=extract_folder)

ディレクトリ構造

ldcc_data
└── text
    ├── CHANGES.txt
    ├── dokujo-tsushin/
    │   ├── dokujo-tsushin-4778030.txt
    │   ├── dokujo-tsushin-4778031.txt
    │   ├── dokujo-tsushin-4782522.txt
    │   ├── dokujo-tsushin-4788357.txt
    │   (snip)
    ├── it-life-hack/
    │   ├── (snip)
    ├── kaden-channel/
    │   ├── (snip)
    ├── livedoor-homme/
    │   ├── (snip)
    ├── movie-enter/
    │   ├── (snip)
    ├── peachy/
    │   ├── (snip)
    ├── README.txt
    ├── smax/
    │   ├── (snip)
    ├── sports-watch/
    │   ├── (snip)
    └── topic-news/
        ├── (snip)

カテゴリ情報も持たせたかった＆pandasのデータフレームのまま扱いたかったので少し修正

def read_articles_from_directory(directory_path):
    files = [f for f in os.listdir(directory_path) if f not in ["LICENSE.txt"]]

    articles = []
    for file in files:
        with open(os.path.join(directory_path, file),
		"r", encoding="utf-8") as f:
            lines = f.readlines()
            articles.append({
                "url": lines[0].strip(),
                "date": lines[1].strip(),
                "title": lines[2].strip(),
                "body": ''.join(lines[3:]).strip()
            })

    return articles

def read_category_from_directory(directory_path):
    category = None
    with open(os.path.join(directory_path, "LICENSE.txt"), "r", encoding="utf-8") as f:
        lines = f.readlines()
        category = lines[-2].strip()

    return category

directories = [d for d in os.listdir(extract_folder + "text/")
	if d not in ["CHANGES.txt", "README.txt"]]

dfs = {}
csv_file_paths = {}
for directory in directories:
    category = read_category_from_directory(extract_folder + "text/" + directory)
    articles = read_articles_from_directory(extract_folder + "text/" + directory)
    df = pd.DataFrame(articles)
    df["category"] = category
    dfs[directory] = df

    csv_path = f"/content/{directory}.csv"
    df.to_csv(csv_path, index=False)
    csv_file_paths[directory] = csv_path

for k in dfs.keys():
    display(dfs[k].head(5))

このスクラップは2024/09/04にクローズされました