🐫

Cloudflare Blog をガチャガチャ要約翻訳

2024/09/16に公開

The Cloudflare Blog を素材に Workers AI で遊んでみます。

Todo

Blog ガチャガチャを作ります。
記事をランダムに取ってきて、要約(AI)、翻訳(AI)します。

Blog

最初の投稿は 2009年4月28日 で、以来 15 年(本記事の執筆時点では 5476 日)をかけて、ざっと 3015 の記事がありました(集計が間違ってなければ)。

二日に一記事以上は投稿されているペースです。
一日一読で 8 年と少し。

前調査として全記事のテキストのサイズを調べたところ、合計で 31 メガバイトありました。

本文が一番短いのは 6 文字、40 バイト(ハッシュタグ)。
長い のは 12K 文字、71 K バイト(動画の文字起こし)。

文字数、記事サイズの分布
# 200 ~ 3000 文字、1200 ~ 20K バイトくらい

# 一ブログあたりの本文テキストの文字数
> x <- scan("blog-wordcount.txt")
Read 3015 items
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    6.0   608.5  1066.0  1275.5  1664.0 12692.0
> quantile(x, c(0.05,0.5,0.95), type = 7)
    5%    50%    95%
 191.7 1066.0 3108.3

# 一ブログあたりの本文テキストのサイズ
> x <- scan("blog-size.txt")
Read 3015 items
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     40    3999    7014    8483   11068   71203
> quantile(x, c(0.05,0.5,0.95), type = 7)
     5%     50%     95%
 1263.4  7014.0 21240.4

AI model

Workers AI で要約と翻訳、それぞれ 1 モデルづつ用意があるので、それで試します。

要約

summarization | bart-large-cnn

翻訳

translation | m2m100-1.2b

Flow

通信フローはこうなります。

Results

下記のように要約・翻訳されました。

このときのログを見ると、フローがつかめます。

console.log()

本文が分割(TEXT)、それぞれ要約(SUMMARY)、結合された要約は分割され翻訳(TRANS)されています。

  (log) TEXT: We recently released a new version of Cloudflare Resolver which adds a piece of information called “Extended DNS Errors” (EDE) along with the response code under certain circumstances. This will be helpful in tracing DNS resolution errors and figuring out what went wrong behind the scenes. (image from: https://www.pxfuel.com/en/free-photo-expka)A tight-lipped agentThe DNS protocol was designed to map domain names to IP addresses. To inform the client about the result of the lookup, the protocol has a 4 bit field, called response code/RCODE. The logic to serve a response might look something like this: function lookup(domain) { ... switch result { case "No error condition": return NOERROR with client expected answer case "No record for the request type": return NOERROR case "The request domain does not exist": return NXDOMAIN case "Refuse to perform the specified operation for policy reasons": return REFUSE default("Server failure: unable to process this query due to a problem with the name server"): return SERVFAIL } } try { lookup(domain) } catch { return SERVFAIL } Although the context hasn't changed much, protocol extensions such as DNSSEC have been added, which makes the RCODE run out of space to express the server's internal status. To keep backward compatibility, DNS servers have to squeeze various statuses into existing ones. This behavior could confuse the client, especially with the "catch-all" SERVFAIL: something went wrong but what exactly?Most often, end users don't talk to authoritative name servers directly, but use a stub and/or a recursive resolver as an agent to acquire the information it needs. When a user receives SERVFAIL, the failure can be one of the following:The stub resolver fails to send the request.The stub resolver doesn’t get a response.The recursive resolver, which the stub resolver sends its query to, is overloaded.The recursive resolver is unable to communicate with upstream authoritative servers.The recursive resolver fails to verify the DNSSEC chain.The authoritative server takes too long to respond....In such cases, it is nearly impossible for the user to know exactly what’s wrong. The resolver is usually the one to be blamed, because, as an agent, it fails to get back the answer, and doesn’t return a clear reason for the failure in the response.Keep backward compatibilityIt seems we need to return more information, but (there's always a but) we also need to keep the behavior
  (log) TEXT: of existing clients unchanged.One way is to extend the RCODE space, which came out with the Extension mechanisms for DNS or EDNS. It defines a 8 bit EXTENDED-RCODE, as high-order bits to current 4 bit RCODE. Together they make up a 12 bit integer. This changes the processing of RCODE, requires both client and server to fully support the logic unfortunately.Another approach is to provide out-of-band data without touching the current RCODE. This is how Extended DNS Errors is defined. It introduces a new option to EDNS, containing an INFO-CODE to describe error details with an EXTRA-TEXT as an optional supplement. The option can be repeated as many times as needed, so it's possible for the client to get a full error chain with detailed messages. The INFO-CODE is just something like RCODE, but is 16 bits wide, while the EXTRA-TEXT is an utf-8 encoded string. For example, let’s say a client sends a request to a resolver, and the requested domain has two name servers. The client may receive a SERVFAIL response with an OPT record (see below) which contains two extended errors, one from one of the authoritative servers that shows it's not ready to serve, and the other from the resolver, showing it cannot connect to the other name server. ;; OPT PSEUDOSECTION: ; ... ; EDE: 14 (Not Ready) ; EDE: 23 (Network Error): (cannot reach upstream 192.0.2.1) ; ... Google has something similar in their DoH JSON API, which provides diagnostic information in the "Comment" field.Let's dig into itOur 1.1.1.1 service has an initial support of the draft version of Extended DNS Errors, while we are still trying to find the best practice. As we mentioned above, this is not a breaking change, and existing clients will not be affected. The additional options can be safely ignored without any problem, since the RCODE stays the same.If you have a newer version of dig, you can simply check it out with a known problematic domain. As you can see, due to DNSSEC verification failing, the RCODE is still SERVFAIL, but the extended error shows the failure is "DNSSEC Bogus". $ dig @1.1.1.1 dnssec-failed.org ; <<>> DiG 9.16.4-Debian <<>> @1.1.1.1 dnssec-failed.org ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1111 ;;
  (log) TEXT: flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; EDE: 6 (DNSSEC Bogus) ;; QUESTION SECTION: ;dnssec-failed.org. IN A ;; Query time: 111 msec ;; SERVER: 1.1.1.1#53(1.1.1.1) ;; WHEN: Wed Sep 01 00:00:00 PDT 2020 ;; MSG SIZE rcvd: 52 Note that Extended DNS Error relies on EDNS. So to be able to get one, the client needs to support EDNS, and needs to enable it in the request. At the time of writing this blog post, we see about 17% of queries that 1.1.1.1 received had EDNS enabled within a short time range. We hope this information will help you uncover the root cause of a SERVFAIL in the future.
  (log) SUMMARY:  Extended DNS Error relies on EDNS. To be able to get one, the client needs to support EDNS, and needs to enable it in the request. At the time of writing this blog post, we see about 17% of queries that 1.1.1 received had EDNS enabled within a short time range.
  (log) SUMMARY: Extended DNS Errors is a way to provide out-of-band data without touching the current RCODE. It introduces a new option to EDNS, containing an INFO-CODE to describe error details. The option can be repeated as many times as needed, so it's possible for the client to get a full error chain with detailed messages.
  (log) SUMMARY: Cloudflare Resolver adds information called "Extended DNS Errors" (EDE) EEDE is a piece of information that can be used to trace DNS resolution errors.
  (log) TRANS: Cloudflare Resolver adds information called "Extended DNS Errors" (EDE) EEDE is a piece of information that can be used to trace DNS resolution errors. Extended DNS Errors is a way to provide out-of-band data without touching the current RCODE. It introduces a new option to EDNS, containing an INFO-CODE to describe error details. The option can be repeated as many times as needed, so it's possible for the client to get a full error chain with detailed messages. Extended DNS Error relies on EDNS. To be able to get one, the client needs to support EDNS,
  (log) TRANS: Cloudflare Resolver は「Extended DNS Errors(EDE)」と呼ばれる情報を追加します。EEDE は DNS 解析エラーを追跡するために使用できる情報の一部です。Extended DNS Errors は、現在の RCODE に触れずにバンド外のデータを提供する方法です。それは EDNS に新しいオプションを導入し、エラーの詳細を説明する INFO-CODE を含みます。オプションは必要に応じて何度も繰り返すことができますので、クライアントは詳細なメッセージを含む完全なエラーチェーンを得ることができます。Extended DNS Error は EDNS に依存します。それを取得するために、クライアントは EDNS をサポートする必要があります。
  (log) TRANS: and needs to enable it in the request. At the time of writing this blog post, we see about 17% of queries that 1.1.1 received had EDNS enabled within a short time range.
  (log) TRANS: このブログ記事を書く時点で、私たちは 1.1.1 が受信したクエリの約 17% が短期間で EDNS を有効にしました。

もう一つ。余談ですが、
過去の記事は今と違って CloudFlare なことがわかります。
タイムマシンみたいに、こういう楽しみもありますね。
個人的な 🎯 記事を引けたら、嬉しでしょう。

Code

要約には 384 文字ごと
翻訳には 96 文字ごと
で AI に渡しました。
文字量と情報量の落とし所を見極めるのがむずい

code 抜粋
    // その時点の最大ページ番号
    const maxPage = await getMaxPageNumber();
    // ページ番号から一つ選択
    const randomPage = Math.floor(Math.random() * maxPage) + 1;
    // ページ中のブログリスト
    const posts = await scrapePage(randomPage);
    // ブログリストから一つ選択
    const randomPost = posts[Math.floor(Math.random() * posts.length)];
    // ブログをフェッチ、必要情報を抜く
    const { content, date } = await fetchBlogContent(randomPost.url);
    // 要約 AI に一度に渡す Token 最大値
    const maxLength = 384;
    // 記事を最大 Token で分割
    const chunks = splitTextIntoChunks(content, maxLength);
    // Token を 要約 AI に送り、応答をテキストで得る
    const summaries = await Promise.all(chunks.map(async (chunk) => {
      const message = {
        input_text: chunk,
        max_length: maxLength  
      };
      console.log("TEXT: " + chunk)
      const response = await c.env.AI.run('@cf/facebook/bart-large-cnn', message);
      console.log("SUMMARY: " + response.summary)
      return response.summary
    }));
    // 複数のテキストはくっつける
    const combinedSummary = summaries.join(' ');
    // 翻訳 AI に一度に渡す Token 最大値 
    const maxTransLength = 96;
    // 要約を最大 Token で分割
    const summaryChunks = splitTextIntoChunks(combinedSummary, maxTransLength);
    // Token を 翻訳 AI に送り、応答をテキストで得る
    let translatedChunks: string[] = [];
    for (const tchunk of summaryChunks) {
      const translatedChunk = await c.env.AI.run('@cf/meta/m2m100-1.2b', {
        text: tchunk,
        source_lang: "english",
        target_lang: "japanese"
    });
    console.log("TRANS: "+ tchunk)
    console.log("TRANS: "+ translatedChunk.translated_text)
    translatedChunks.push(translatedChunk.translated_text || '');
    }
    // 複数のテキストはくっつける
    const sumJp = translatedChunks.join(' ');
	// 必要な情報を JSON で返す
    return c.json({
      title: randomPost.title,
        date: date, 
        url: randomPost.url,
        summaryJ: sumJp,
        summary: combinedSummary,
        content: content
    });

Metric

メトリックはこんな感じでした。

CPU Time

p95 | 611.8 msec

Wall Time

p95 | 24,727.6 ms

当然ですが、記事サイズがデカくなると、Wall Time がかかる作りです。
最後のモリモリ(30秒待ち)はこいつです。長い。

Execution Duration

p95 | 2.1 GB-sec

Thougts

元ネタが良ければ、要約も良くなりがちで、翻訳も良くなりそう。
要約が肝か…

Next

  • 単に単語に区切って BART に渡して動いてそうだが、tokenizer とか使った方が良いの?
  • 要約にもいろいろアリそう(抽出とか抽象とか)。
  • 今回は記事を区切って要約し、くっつけてる形だが、つぎは記事全体を見て重要なところに重みつけしてから要約してくれたりするモデルがあると素晴らしいかも。

https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
https://ai.meta.com/research/publications/bart-denoising-sequence-to-sequence-pre-training-for-natural-language-generation-translation-and-comprehension/

Discussion