Closed2025/03/12にクローズ2

Common Crawlから日本語データを取り出す

Common Crawlの`content_languages`カラムを参照する

`content_languages`カラムとは

The language of a document is identified by Compact Language Detector 2 (CLD2). It is able to identify 160 different languages and up to 3 languages per document. The table lists the percentage covered by the primary language of a document (returned first by CLD2). So far, only HTML pages are passed to the language detector.

CLD2の判定結果が記録されている。CLD2は1つのテキストに対して複数の判定結果を英語（eng）とポーランド語（pol）のように複数の言語が含まれている場合は、eng,polと,で結合される。

Athenaで収集する

Athenaで次のクエリを実行する。

SELECT
    warc_filename,
    warc_record_offset,
    warc_record_length
FROM ccindex
WHERE
    crawl = 'CC-MAIN-2023-50'
    AND subset = 'warc'
    AND content_languages = 'jpn' -- 日本語のみと判定されたレコードだけ
;

キュー内の時間: 107 ミリ秒
実行時間: 1 分 32.493 秒
スキャンしたデータ: 31.99 GB
レコード数: 74,044,705
CSVのサイズ: 153.2 KB

参考: `url`カラムも取得した場合

SELECT
    url,
    warc_filename,
    warc_record_offset,
    warc_record_length
FROM ccindex
WHERE
    crawl = 'CC-MAIN-2023-50'
    AND subset = 'warc'
    AND content_languages = 'jpn'
;

クエリの実行結果

キュー内の時間: 122 ミリ秒
実行時間: 1 分 50.652 秒
スキャンしたデータ: 72.20 GB
レコード数: 74,044,705
CSVのサイズ: 13.9GB

Tips

Common CrawlのindexをAthenaで処理する手順
- https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
- 記事の執筆時点よりもカラムが増えているので、CREATE EXTERNAL TABLEは https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/athena/cc-index-create-table-flat.sql のクエリを使ったほうがいい。
urlカラムは容量が大きいので参照するとリソースを大量に使う

vzvu3k6k

Swallow コーパス

Common Crawlを処理して構築されていて、最初に日本語のテキスト抽出を行っている。

このスクラップは2025/03/12にクローズされました

Common Crawlのcontent_languagesカラムを参照する

content_languagesカラムとは

Athenaで収集する

参考: urlカラムも取得した場合

クエリの実行結果

Tips

Swallow コーパス

Common Crawlの`content_languages`カラムを参照する

`content_languages`カラムとは

参考: `url`カラムも取得した場合