The RefinedWeb Dataset for Falcon LLM のメモ

Falcon LLM の学習に使われたデータセット構築

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
https://arxiv.org/abs/2306.01116

データセットは 5T (5 兆) tokens 規模で, そのうち 600 G tokens(およそ 500 GB)が公開されている.

syoyo

CCNet のやりかたに似ているが

Fuzzy な deduplication(重複除去. cc_net では文章のハッシュで exact な比較のみ)
URL ベースで neural filtering(アダルトコンテンツ抜くなど)

あたりで発展がある感じ.

RefinedWeb は The Pile (同じくデータセット)よりもより性能のよい LLM が学習できる(~= クオリティの高いデータセットとなっている).

すべて Web 由来(多くは Commoncrawl)で public なデータセット.
2023/06 時点では, publicly available なのでは state-of-art なデータセットかも.

syoyo

dedup の重要性

データセットに文章の繰り返しがあるとネットワーク規模が大きくなるごとに悪影響を及ぼす.

1B では 100 個規模の重複があるとよくない(harmful)
175B では数個重複があるだけでよくない結果になりがち(could have a disproportionate effect)

Scaling laws and interpretability of learning from repeated data

syoyo

3.1 Document preparation

Reading the data

Commoncrawl のデータは WARC(raw HTML response) or WET file(plain text のみ含んだもの)の形式で提供されている.
今回は WARC ファイルから処理を始める(warcio library 利用)

URL filtering

URL でますはフィルタリング.
詐欺的, アダルト, 暴力的, ギャンブルに関連などの URL をフィルタリング.

blocklist
事前に用意したワードをどれくらい含んでいるかで判断

論文 G.1 に詳細.

Text extraction

trafilatura https://trafilatura.readthedocs.io/en/latest/ でテキスト抽出し, 追加で regex でフォーマッティング(改行は二つまで. URL は除去)

Language identification

ccnet のように fastText で言語判断. Wikipedia データセットで train.

RefinedWeb では英語だけ抽出.

Language identifical した時点で, データセットはおよそ半分までに減る.
(Figure 2 参照)

syoyo

3.2 Filtering: document-wise and line-wise

Repetition removal

多くのドキュメントは繰り返しデータ(crawer のエラーや, 低品質な web ソース元が原因)を含んでいるのでこれを取り除く.
(dedup の時点で行ってもよいが, より上流で document 単位で処理したほうが楽)

Document-wise filtering

Scaling Language Models: Methods, Analysis & Insights from Training Gopher
https://arxiv.org/abs/2112.11446

のやり方を参考に繰り返しをフィルタリング(=> 具体的には?)

Line-wise corrections.

行がインターレースしたものが多く残っているので(SNS の「いいね 3」とか, ナビゲーションなどのテキストが, 本文に混在しているなどかな?), line-correction filter で補正.
詳細は

Following manual inspection of the data, we devised a line-wise filtering strategy. We analyse documents line-by-line, and
discard or edit the lines based on the following rules:
• If it is mainly composed of uppercase characters (discard);
• If it is only composed of numerical characters (discard);
• If it is a counter (e.g. 3 likes) (discard);
• If it only contains one word (discard);
• If it is short (≤ 10 words) and matches a pattern (edit):
– At the beginning of the line (e.g. sign-in);
– At the end of the line (e.g. Read more...);
– Anywhere in the line (e.g. items in cart).
Finally, if the words in the flagged lines represent more than 5% of the total document words, the document is discarded.
We derived these filters through manual inspection of the data, and note that they require adaptation across languages.

ここまで処理した時点でおおむね元データの 1/4 となる.

syoyo

3.3. Deduplication: fuzzy, exact, and across dumps

フィルタリングしたあとでも, ドキュメント間で重複や繰り返しがあるのでこれを取り除く.
dedup 処理にはコストがかかる.

Fuzzy deduplication

ドキュメントの MinHash を計算して dedup

Exact deduplication

suffix array で token 単位でのマッチング比較し, 50 以上の連続した tokens がマッチしたら除去.

URL deduplication

T.B.W.

syoyo

Dedup のための正規化

punctuation の除去
lowercased
NFD unicode(日本語の場合は NFKC がよいと思われる)
アクセントの除去
whitespace の正規化(複数の空白は 1 つにまとめるかな)
(cc_net のように数値を 0 にする, は無いっぽい?)

gpt-2 tokenizer で tokenize し, n-gram(RefinedWeb では 5-gram)にする.

その他詳細は Lee et al. 2022 https://arxiv.org/abs/2107.06499 を参照.
Lee et al. 2022 では false positive 対策のため, Jakkard 係数など算出しているが, RefiedWeb ではデータサイズが大きいのでスキップ
(=> 日本語データでサイズすくなければ Jakkard 係数算出してもよいかも)

G.3.2. EXACT SUBSTRING DEDUPLICATION

Lee et al. 2022 のコードはここで取得可能

こちらをもちいてごにょごにょ