Open2023/10/16にコメント追加6

jp-azureopenai-samples/5.internal-document-searchの解析メモ

bicepの基本

azure.yamlはinfra\main.bicepを読みに行く。

Search Indexはインフラが完成してからpostprovisionディレクティブで指定されてる

hooks:
    postprovision:
      windows:
        shell: pwsh
        run: ./scripts/prepdocs.ps1
        interactive: true
        continueOnError: false

Akira Otaka

ここでSearch Index作成。といってもpredocs.pyを呼び出してる。

prepdocs.ps1

Start-Process `
  -FilePath $venvPythonPath `
  -ArgumentList "./scripts/prepdocs.py $cwd/data/* `
    --storageaccount $env:AZURE_STORAGE_ACCOUNT `
    --container $env:AZURE_STORAGE_CONTAINER `
    --searchservice $env:AZURE_SEARCH_SERVICE `
    --index $env:AZURE_SEARCH_INDEX `
    --formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE `
    --tenantid $env:AZURE_TENANT_ID -v" `
  -Wait `
  -NoNewWindow

Akira Otaka

create_search_index

フィールドを定義
- id
- content（セマンティックフィールド）
- category
- sourcepage
- sourcefile

upload_blobs

PDFだけ1ページずつ分解して送っている。
保存も1ページずつ？

Akira Otaka

get_document_text

def get_document_text(filename):
    offset = 0
    page_map = []
    if args.localpdfparser:
        reader = PdfReader(filename)
        pages = reader.pages
        for page_num, p in enumerate(pages):
            page_text = p.extract_text()
            page_map.append((page_num, offset, page_text))
            offset += len(page_text)

localpdfparserがある場合

enumerateでページ数を、extract_text()でページ内のテキストを得る。
（ページ数、先頭からの文字数、ページ内のコンテンツ）をタプルとして配列に入れる。

そうじゃない場合はcognitive searchをごにょごにょ？

    return page_map

Akira Otaka

split_text

約1000文字＝1セクションとして、それがどのページに属しているのかを判定する。1セクションを単純に区切り文字だけで切るのではなく、次のセクションでは100文字戻って文字列をラップさせている。区切り文字が続いた場合に備えるのと、同じ（ような）文章が多い方が検索精度が上がるからかな？

def split_text(page_map):
    SENTENCE_ENDINGS = [".", "!", "?", "。", "！", "？"]
    WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n", "、", "，", "；", "：", "（", "）", "【",
                    "】", "「", "」", "『", "』", "《", "》"]

セクションの区切り。
日本語の区切りも追加してみた。
注意点としてはMAX_SECTION_LENGTHが優先されるので、句読点だけで区切られないということ。

    if args.verbose: print(f"Splitting '{filename}' into sections")

    def find_page(offset):
        l = len(page_map)
        for i in range(l - 1):
            if offset >= page_map[i][1] and offset < page_map[i + 1][1]:
                return i
        return l - 1

    all_text = "".join(p[2] for p in page_map)

page_mapがページごとにtupleになっていて、その2つ目にコンテンツが入っているので、それを全部結合している。

    length = len(all_text)
    start = 0
    end = length

    while start + SECTION_OVERLAP < length:

startは最初0から始まるが、2回目以降はend - SECTION_OVERLAPが代入されて901となる。すなわち、MAX_SECTION_LENGTHからマイナス方向にSECTION_OVERLAP分だけ戻って、（startは加算され続けるので）startがコンテンツの最大（最後）を超えない場合は継続する。

        last_word = -1
        end = start + MAX_SECTION_LENGTH

        if end > length:
            end = length
        else:
            # Try to find the end of the sentence

            while end < length and (end - start - MAX_SECTION_LENGTH) < SENTENCE_SEARCH_LIMIT and all_text[
                end] not in SENTENCE_ENDINGS:

継続条件

コンテンツの最大（最後）を超えていない
区切り文字を見つけるまでにMAX_SECTION_LENGTH（100文字）を超えていない
最後の文字が区切り文字ではない

                if all_text[end] in WORDS_BREAKS:
                    last_word = end
                end += 1
            if end < length and all_text[end] not in SENTENCE_ENDINGS and last_word > 0:
                end = last_word  # Fall back to at least keeping a whole word
        if end < length:
            end += 1

        # Try to find the start of the sentence or at least a whole word boundary
        last_word = -1
        while start > 0 and start > end - MAX_SECTION_LENGTH - 2 * SENTENCE_SEARCH_LIMIT and all_text[
            start] not in SENTENCE_ENDINGS:
            if all_text[start] in WORDS_BREAKS:
                last_word = start
            start -= 1
        if all_text[start] not in SENTENCE_ENDINGS and last_word > 0:
            start = last_word
        if start > 0:
            start += 1

スタート位置の見つけ方

最初は必ず０
2回目以降はSENTENCE_ENDINGSの次（+1）がスタート位置
ループ：最後のセクションの文字数がMAX_SECTION_LENGTH+SENTENCE_SEARCH_LIMITの2倍（マージン？）が少ない場合、スタート位置をWORDS_BREAKSが出てくるまで前に戻していく

        section_text = all_text[start:end]
        yield (section_text, find_page(start))

        last_table_start = section_text.rfind("<table")
        if (last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table")):
            # If the section ends with an unclosed table, we need to start the next section with the table.
            # If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
            # If last table starts inside SECTION_OVERLAP, keep overlapping
            if args.verbose: print(
                f"Section ends with unclosed table, starting next section with the table at page {find_page(start)} offset {start} table start {last_table_start}")
            start = min(end - SECTION_OVERLAP, start + last_table_start)

        else:
            start = end - SECTION_OVERLAP

次回のスタート位置は最後の位置からSECTION_OVERLAP分戻ったところ（デフォルトだと100文字分）。なので、場合によっては約100文字分がオーバーラップして登録される。

    if start + SECTION_OVERLAP < end:
        yield (all_text[start:end], find_page(start))