Microsoft Presidioでセンシティブな文字列を検出・マスクする

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

DeepL訳

Presidio（ラテン語のpraesidium「保護、駐屯地」が語源）は、機密データが適切に管理され、統治されるように支援します。クレジットカード番号、氏名、所在地、社会保障番号、ビットコイン・ウォレット、米国の電話番号、財務データなど、テキストや画像に含まれるプライベート・エンティティの高速識別および匿名化モジュールを提供します。

kun432

NeMo-GuardrailsでLLMへの入力値にセンシティブな情報があったら弾くってのをやりたいと思って、ドキュメントを斜め読みしてたのだけど、

結構泥臭くやってるように見える。これで果たして弾けるのか？
内部的にLLM使っているようにも見えるが、そもそもLLMに送る前に弾きたい
（ただしちゃんとドキュメント読めていないので間違ってるかもしれない）

ということで、他のドキュメントを見てたところ、

Presidio-based Sensitive Data Detection

NeMo Guardrails supports detecting sensitive data out-of-the-box using Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more. You can detect sensitive data on user input, bot output, or the relevant chunks retrieved from the knowledge base.

というのがあって、気になったのでこちらを先に調べようと思う。

以下は、日本語で紹介してくれている、めちゃめちゃありがたい記事。

kun432

ドキュメントはこちら。

https://microsoft.github.io/presidio/

"How it works"がわかりやすい。

正規表現で判定（電話番号とか）
固有表現（Named Entity）はML/NLPで判定
（必要ならば）チェックサムでそれが正しいかを判定
それらのコンテキストとなるキーワードを抽出
匿名化

インストール

インストールはpipで行う。あとDockerイメージも用意されていてどうもAPI経由で利用できる様になっているっぽいので、気が向いたら試したい。

今回はColaboratoryでpipインストールする。利用するNLPエンジンは、spaCy、Transformers、Stanzaから選択できるが、今回はデフォルトのspaCyで進める。

Presidioのパッケージをインストール。

# テキスト・画像共通の解析用パッケージっぽい
!pip install presidio_analyzer

# テキスト秘匿化用パッケージ
!pip install presidio_anonymizer

# 画像秘匿化用パッケージ
!pip install presidio_image_redactor

画像秘匿化の場合、画像からのテキスト抽出にTesseractを使っているようなので合わせて以下もインストールする必要がある。

!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

次に、spaCyのモデルをダウンロード。一旦ドキュメントどおりに英語のモデルを使うことにする。モデルはこれ。

!python -m spacy download en_core_web_lg

QuickStart

Getting started with Presidioを進める。

テキストに「電話番号」があるかを解析してみる。

from presidio_analyzer import AnalyzerEngine

text="My phone number is 212-555-5555"

# エンジンを初期化してNLPモジュールをロードする
analyzer = AnalyzerEngine()

# analyzerに解析させる
results = analyzer.analyze(
    text=text,
    entities=["PHONE_NUMBER"],
    language='en'
)

print(results)

電話番号の文字列の位置とスコアが出力される。

WARNING:presidio-analyzer:configuration file /usr/local/lib/python3.10/dist-packages/conf/default.yaml not found.  Using default config: {'nlp_engine_name': 'spacy', 'models': [{'lang_code': 'en', 'model_name': 'en_core_web_lg'}]}.
WARNING:presidio-analyzer:configuration file is missing 'ner_model_configuration'. Using default
WARNING:presidio-analyzer:model_to_presidio_entity_mapping is missing from configuration, using default
WARNING:presidio-analyzer:low_score_entity_names is missing from configuration, using default
WARNING:presidio-analyzer:labels_to_ignore is missing from configuration, using default
[type: PHONE_NUMBER, start: 19, end: 31, score: 0.75]

これを秘匿化する。

from presidio_anonymizer import AnonymizerEngine

# anonymizerを初期化
anonymizer = AnonymizerEngine()

# テキストとanalyzerの結果をanonymizerに渡す
anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results)

print(anonymized_text)

text: My phone number is <PHONE_NUMBER>
items:
[
    {'start': 19, 'end': 33, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'}
]

次に画像。いい感じのサンプル画像をがない。。。動作感を確認したいだけなので、presidioのgithubレポジトリでテスト用に用意されているものを一旦使う。

from presidio_image_redactor import ImageRedactorEngine
from PIL import Image

image = Image.open("original_image.png")

redactor = ImageRedactorEngine()
redactor.redact(image=image)

こんな感じでマスクされた画像が生成される。

検出できるエンティティ一覧

kun432

テキスト処理の例

テキスト処理をもう少し深掘りしていく。なお、日本語の場合については一番最後でやっている。

kun432

1: 拒否リストベースのPII検出

PatternRecognizerを使うと、文字列の配列を拒否リストとして指定して検出できる。

from presidio_analyzer import PatternRecognizer

titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

# PatternRecognizerを定義する
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"

# PatternRecognizerで直接解析する
result = titles_recognizer.analyze(text1, entities=["TITLE"])

print(f"Result:\n {result}")

Result:
 [type: TITLE, start: 10, end: 19, score: 1.0]

AnalyzerEngineに定義したPatternRecognizerを追加することもできる。

from presidio_analyzer import PatternRecognizer
from presidio_analyzer import AnalyzerEngine

titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

# PatternRecognizerを定義
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

# AnalyzerEngineの初期化
analyzer = AnalyzerEngine()

# AnalyzerEngineに、定義したPatternRecognizerを追加
analyzer.registry.add_recognizer(titles_recognizer)

text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick."

# AnalyzerEngineで解析
result = analyzer.analyze(text=text1, language="en")

print(f"Result:\n {result}")

結果。

Result:
 [type: TITLE, start: 10, end: 19, score: 1.0, type: PERSON, start: 20, end: 24, score: 0.85]

こちらの場合は指定してないけど「人名（PERSON）」も拾ってるみたい。

検出した箇所の一覧を表示してみる。

print("Identified these PII entities:")
for r in result:
    print(f"- {text1[r.start:r.end]} as {r.entity_type}")

Identified these PII entities:
- Professor as TITLE
- Plum as PERSON

kun432

2. 正規表現ベースのPII検出

PatternとPatternRecognizerを組み合わせて、正規表現のパターンを使って検出する。

from presidio_analyzer import Pattern, PatternRecognizer

# `Pattern`オブジェクトを使って正規表現パターンを定義
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# 定義した正規表現パターンからPatternRecognizerを定義。正規表現パターンは複数定義可能。
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

text = "I live in 510 Broad st."

numbers_result = number_recognizer.analyze(text=text, entities=["NUMBER"])

print("Result:", numbers_result)
print("Identified these PII entities:")
for r in numbers_result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")

Result: [type: NUMBER, start: 10, end: 13, score: 0.5]
Identified these PII entities:
- 510 Broad as TITLE
- st. as PERSON

kun432

3. ルールベースの論理検出

1つ前で数値の検出を行ったが、今度は"Number One"というような数値の文字列表現を検出してみる。spaCyのトークン分割を使ったカスタムのクラス定義を行う。

EntityRecognizerという抽象クラスを継承した、独自のrecognizerクラスを作る
EntityRecognizerクラスを継承する場合はloadとanalyzeの2メソッドを実装する必要がある
recognizerは、入力テキストを事前処理したNlpArtifactsオブジェクトを受ける必要がある

recognizerの定義は以下のような構造になる。

from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts


class MyRecognizer(EntityRecognizer):
    def load(self) -> None:
        """ロードは不要"""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        特定のPIIを検出するロジックをここに書く
        """
        pass

まだちょっとよくわからないけどもとりあえず進める。

数値の数値的・文字列的表現を共に検出するNumbersRecognizerを実装する。

from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts


class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7  # recognizerが期待する信頼レベル

    def load(self) -> None:
        """ロードは不要"""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        数値表現（123 とか One Two Threeとか)を示すトークンが含まれているかを解析する
        """
        results = []

        # spaCyで分割されたトークンに対し順次 `token.like_num` を呼ぶ
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results


# NumbersRecognizerからrecognizerインスタンスを初期化
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])

NlpArtifactsを必要とするrecognizerは、AnalyzerEngineのフローの一部として呼び出す必要があるみたい。

from presidio_analyzer import AnalyzerEngine

text = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_result = analyzer.analyze(text=text, language="en")
print("Results:")
for res in numbers_result:
    print(f"- {str(res)}")
print("Identified these PII entities:")
for r in numbers_result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")

結果

Results:
- type: PERSON, start: 0, end: 7, score: 0.85
- type: NUMBER, start: 17, end: 21, score: 0.7
- type: NUMBER, start: 22, end: 24, score: 0.7
Identified these PII entities:
- Roberto as PERSON
- Five as NUMBER
- 10 as NUMBER

kun432

5. 異なるモデルや言語の使用

ちょっと番号が飛ぶけども。

違うモデルや違う言語を使うには以下が必要になる。

トークン化、レンマ化（より高度なステミングというイメージ）、固有表現などのNLPタスクを行うNLPモデルを含むNlpEngine
異なるPII検出を行うEntityRecognizerオブジェクト

内部NLPエンジンとしてPresidioは、spaCyuとStanzaをサポートしている。これらから必要なモデルをダウンロードする。

例えばスペイン語のモデルをダウンロードする。

!python -m spacy download es_core_news_md

pythonコードだと以下でダウンロードできる

import spacy
spacy.cli.download("es_core_news_md")

spaCyを使って英語・スペイン語のモデルを使用する

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider


# NLPエンジン名と、言語コードとモデルのマッピング、を設定する
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "es", "model_name": "es_core_news_md"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

# マッピング設定からNLPエンジンを作成
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# 作成したNLPエンジンとサポートする言語をAnalyzerEngineに渡して初期化
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)

# スペイン語で解析
text_spanish = "Mi nombre es Morris"
results_spanish = analyzer.analyze(text=text_spanish, language="es")
print("Results from Spanish request:")
print(results_spanish)
print("Identified these PII entities:")
for r in results_spanish:
    print(f"- {text_spanish[r.start:r.end]} as {r.entity_type}")

print()

# 英語で解析
text_english = "My name is Morris"
results_english = analyzer.analyze(text=text_english, language="en")
print("Results from English request:")
print(results_english)
print("Identified these PII entities:")
for r in results_english:
    print(f"- {text_english[r.start:r.end]} as {r.entity_type}")

Results from Spanish request:
[type: PERSON, start: 13, end: 19, score: 0.85]
Identified these PII entities:
- Morris as PERSON

Results from English request:
[type: PERSON, start: 11, end: 17, score: 0.85]
Identified these PII entities:
- Morris as PERSON

spaCy/Stanza/huggingfaceでサポートされていない/サポートが限定的な言語の場合は、他のフレームワークを使うこともできる。

ただし、Presidioは常にspaCyモデルを受け取るようになっているので、他のフレームワークを使う場合は、

en_core_web_smのようなシンプルなspaCyパイプラインをNLPエンジンのモデルとして使う
外部のフレームワークやサービスを呼ぶrecognizerを固有表現検出モデルとして使う

のが良いらしい。

kun432

4. 外部のサービス・フレームワークを使ったPII検出

ここはちょっとパス。。。

kun432

6. コンテキストワードの活用

コンテキストワードとは、PIIとして検出するテキストの前後にある、それと関連するようなテキストのことらしい。

例えば、以下のテキストから郵便番号をPIIとして検出するとする。

My zip code is 90210.

この場合、前にある"zip code"から郵便番号であるということがわかるので、より検出の信頼性が上がるということになる。

また、独自のコンテキスト検出もできる。デフォルトではLemmaContextAwareEnhancerを使って文中のトークンをレンマ化したものとコンテキストワードを比較する。

では郵便番号を検出してみる。（USの）郵便番号は5桁の数字が基本で（これに加えて5桁-4桁というのがオプションとしてあるらしい。知らなかった）、正規表現では誤検出になりやすい。正規表現だけを使った解析と、コンテキストも組み合わせた解析で信頼度が変わるのを確認してみる。

from presidio_analyzer import (
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
    AnalyzerEngine,
)

# 正規表現で郵便番号を検出するPatternRecognizerを作成
regex = r"(\b\d{5}(?:\-\d{4})?\b)"
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE", patterns=[zipcode_pattern]
)

# AnalyzerEngineのレジストリに追加
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# テキストを解析
text = "My zip code is 90210"
result = analyzer.analyze(text=text, language="en")
print(f"Result:\n {result}")
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

結果。検出はできてる。

Result:
 [type: US_ZIP_CODE, start: 15, end: 20, score: 0.01]
Identified these PII entities:
- 90210 as US_ZIP_CODE

ただし5桁ということだけだと信頼性は総じて低い、ということでscoreは0.01に設定してある。

コンテキストワードの設定を追加したrecognizerを使ってみる。

from presidio_analyzer import (
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
    AnalyzerEngine,
)
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

# 正規表現パターンとコンテキストワードを設定したPatternRecognizerを作成
regex = r"(\b\d{5}(?:\-\d{4})?\b)"
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)
zipcode_recognizer_w_context = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],     # コンテキストワードの設定
)

# AnalyzerEngineのレジストリに追加
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer_w_context)
analyzer = AnalyzerEngine(registry=registry)

# テキストを解析
text = "My zip code is 90210"
result = analyzer.analyze(text=text, language="en")
print(f"Result:\n {result}")
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

結果。スコアが上がっているのわかる。

Result:
 [type: US_ZIP_CODE, start: 15, end: 20, score: 0.4]
Identified these PII entities:
- 90210 as US_ZIP_CODE

AnalyzerEngineを初期化する際に特に指定がなければ、内部でLemmaContextAwareEnhancerが作成され、これがコンテキストワードの類似性に合致した場合、スコアに0.35加算するようになっているらしい。ただし合致した場合のスコアのミニマムは0.4に設定されているとのことで、今回の場合は0.4ということの様子。

このスコアリングの設定を行うこともできる。

from presidio_analyzer import (
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
    AnalyzerEngine,
)
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

# 正規表現パターンとコンテキストワードを設定したPatternRecognizerを作成
regex = r"(\b\d{5}(?:\-\d{4})?\b)"
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)
zipcode_recognizer_w_context = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],     # コンテキストワードの設定
)

# LemmaContextAwareEnhancerオブジェクトを作成して、コンテキスト合致時の加算スコアと最小スコアを設定
context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45,
    min_score_with_context_similarity=0.4
)

# AnalyzerEngineのレジストリに追加
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer_w_context)
analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=context_aware_enhancer,     # スコア設定を行ったLemmaContextAwareEnhancerを指定
)

# テキストを解析
text = "My zip code is 90210"
result = analyzer.analyze(text=text, language="en")
print(f"Result:\n {result}")
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

0.01 + 0.45 = 0.46 > 0.4 ということで、スコアが0.46になっているのがわかる。

Result:
 [type: US_ZIP_CODE, start: 15, end: 20, score: 0.46]
Identified these PII entities:
- 90210 as US_ZIP_CODE

以下のように検索したいテキストと、コンテキストが別の場所にある、ようなケースでは、コンテキストを明示的に渡すこともできる。

record = {"column_name": "zip", "text": "My code is 90210"}
result = analyzer.analyze(text=record["text"], language="en", context=[record["column_name"]])
print(f"Result:\n {result}")
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

Result:
 [type: US_ZIP_CODE, start: 11, end: 16, score: 0.46]
Identified these PII entities:
- 90210 as US_ZIP_CODE

kun432

7. 検出プロセスのトレース

analyzerがどういうプロセスで検出を行ったか？を確認できる。例えば以下。

どのrecognizerが検出したか？
どの正規表現パターンが使用されたか？
MLモデルがどのようなメカニズムで解釈したか？
どのコンテキストがスコアを加算したか？
各ステップの前後の信頼性スコア

検出プロセスの詳細は以下のドキュメントを参照

公式の例だと何も返ってこなかったので、サンプルのテキストを少し変更した。検出プロセスを出力させるには、AnalyzerEngineのanalyze()でreturn_decision_process=Trueを指定する。

from presidio_analyzer import AnalyzerEngine
import pprint

analyzer = AnalyzerEngine()

text="My phone number is 212-555-5555"

results = analyzer.analyze(
    text=text, language="en", return_decision_process=True
)

# 通常の結果出力の場合
# print(results)
# で [type: PHONE_NUMBER, start: 19, end: 31, score: 0.75] が返ってくる。

decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)

こんな感じで返ってくる。

Decision process output:

{'original_score': 0.4,
 'pattern': None,
 'pattern_name': None,
 'recognizer': 'ABCMeta',
 'regex_flags': None,
 'score': 0.75,
 'score_context_improvement': 0.35,
 'supportive_context_word': 'phone',
 'textual_explanation': 'Recognized as US region phone number, using '
                        'PhoneRecognizer',
 'validation_result': None}

この処理は特にメモリを食うみたいで、Colaboratoryの標準だとメモリ食いつぶしてクラッシュしてしまった。Presidio自体がそれなりにメモリを食ってたのかもしれない。

kun432

8. ノーコードパターン検出

ノーコードというか、正規表現や拒否リストのrecognizerはYAMLで定義を行うことができる。

サンプルのYAMLが公開されている

recognizers:
  -
    name: "Zip code Recognizer"
    supported_language: "de"
    patterns:
      -
         name: "zip code (weak)"
         regex: "(\\b\\d{5}(?:\\-\\d{4})?\\b)"
         score: 0.01
    context:
     - zip
     - code
    supported_entity: "ZIP"
  -
    name: "Titles recognizer"
    supported_language: "en"
    supported_entity: "TITLE"
    deny_list:
      - Mr.
      - Mrs.
      - Ms.
      - Miss
      - Dr.
      - Prof.

RecognizerRegistryオブジェクトでadd_recognizers_from_yamlメソッドを使ってYAMLを読み込んでやる。

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

yaml_file = "recognizers.yaml"

registry = RecognizerRegistry()
registry.add_recognizers_from_yaml(yaml_file)

analyzer = AnalyzerEngine(registry=registry)

text = "Mr. Plum wrote a book"
result = analyzer.analyze(text=text, language="en")

print("Result:")
for res in result:
    print(f"- {str(res)}")
print()
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

Result:
- type: TITLE, start: 0, end: 3, score: 1.0

Identified these PII entities:
- Mr. as TITLE

以下のように事前定義されたrecognizerを読み込んでからYAM定義のrecognizerを追加することもできる

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

yaml_file = "recognizers.yaml"

registry = RecognizerRegistry()
registry.load_predefined_recognizers()  # すべての事前定義されたrecognizersを読み込む (クレジットカード, 電話番号等)
registry.add_recognizers_from_yaml(yaml_file)

analyzer = AnalyzerEngine(registry=registry)

text = "Mr. Plum wrote a book"
result = analyzer.analyze(text=text, language="en")

print("Result:")
for res in result:
    print(f"- {str(res)}")
print()
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

Result:
- type: TITLE, start: 0, end: 3, score: 1.0
- type: PERSON, start: 4, end: 8, score: 0.85

Identified these PII entities:
- Mr. as TITLE
- Plum as PERSON

ちなみにload_predefined_recognizersで読み込まれているrecognizerを見てみた。

loaded_recognizers = registry.get_recognizers(language="en", all_fields=True)
for r in loaded_recognizers:
    print("{}:\n  {},\n  {}".format(r.name, r.supported_entities, r.context))        

UsBankRecognizer:
  ['US_BANK_NUMBER'],
  ['bankcheck', 'account', 'account#', 'acct', 'save', 'debit']
UsLicenseRecognizer:
  ['US_DRIVER_LICENSE'],
  ['driver', 'license', 'permit', 'lic', 'identification', 'dls', 'cdls', 'lic#', 'driving']
UsItinRecognizer:
  ['US_ITIN'],
  ['individual', 'taxpayer', 'itin', 'tax', 'payer', 'taxid', 'tin']
UsPassportRecognizer:
  ['US_PASSPORT'],
  ['us', 'united', 'states', 'passport', 'passport#', 'travel', 'document']
UsSsnRecognizer:
  ['US_SSN'],
  ['social', 'security', 'ssn', 'ssns', 'ssn#', 'ss#', 'ssid']
NhsRecognizer:
  ['UK_NHS'],
  ['national health service', 'nhs', 'health services authority', 'health authority']
SgFinRecognizer:
  ['SG_NRIC_FIN'],
  ['fin', 'fin#', 'nric', 'nric#']
AuAbnRecognizer:
  ['AU_ABN'],
  ['australian business number', 'abn']
AuAcnRecognizer:
  ['AU_ACN'],
  ['australian company number', 'acn']
AuTfnRecognizer:
  ['AU_TFN'],
  ['tax file number', 'tfn']
AuMedicareRecognizer:
  ['AU_MEDICARE'],
  ['medicare']
InPanRecognizer:
  ['IN_PAN'],
  ['permanent account number', 'pan']
CreditCardRecognizer:
  ['CREDIT_CARD'],
  ['credit', 'card', 'visa', 'mastercard', 'cc ', 'amex', 'discover', 'jcb', 'diners', 'maestro', 'instapayment']
CryptoRecognizer:
  ['CRYPTO'],
  ['wallet', 'btc', 'bitcoin', 'crypto']
DateRecognizer:
  ['DATE_TIME'],
  ['date', 'birthday']
EmailRecognizer:
  ['EMAIL_ADDRESS'],
  ['email']
IbanRecognizer:
  ['IBAN_CODE'],
  ['iban', 'bank', 'transaction']
IpRecognizer:
  ['IP_ADDRESS'],
  ['ip', 'ipv4', 'ipv6']
MedicalLicenseRecognizer:
  ['MEDICAL_LICENSE'],
  ['medical', 'certificate', 'DEA']
PhoneRecognizer:
  ['PHONE_NUMBER'],
  ['phone', 'number', 'telephone', 'cell', 'cellphone', 'mobile', 'call']
UrlRecognizer:
  ['URL'],
  ['url', 'website', 'link']
SpacyRecognizer:
  ['DATE_TIME', 'NRP', 'LOCATION', 'PERSON', 'ORGANIZATION'],
  []
Titles recognizer:
  ['TITLE'],
  None

kun432

9. アドホック検出

正規表現や拒否リストのrecognizerはAnalyzer API経由でアドホックに設定することができる。

すでにrecognizerの設定が行われている場合に、そのリクエストのときだけ検知ロジックを追加する、というような使い方が想定されているらしい。

SDKベースでとりあえずやっているのでスキップ。

kun432

10. シンプルな秘匿化

ここまでは検出のみだったが、ここから秘匿化について。秘匿化はAnonymizerEngineを使う。AnonymizerEngineはテキストとAnalyzerEngineの検出結果を受けて該当箇所をマスクした結果を返す。どのようにマスクするか等は検出したエンティティ要素ごとに設定ができる。デフォルトではエンティティ名称でマスクするようになっている。

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult

text = (
    "Hello, Mr. Williams. My name is John Doe. Call me John. \n"
    "My phone number is 212-555-5555 and email adderss is example@example.com. \n"
    "My credit card number is 1111-2222-3333-4444."
)

analyzer = AnalyzerEngine()

analzer_results = analyzer.analyze(
    text=text,
    language='en',
)

print("Analyzer Results:")
for ar in analzer_results:
    print(f"- {str(ar)}")

print()

# AnonymizerEngineを初期化
engine = AnonymizerEngine()

# テキストとそのテキストのanalyzerの結果をanonymizeメソッドに渡す
result = engine.anonymize(
    text=text, analyzer_results=analzer_results
)

print("De-identified text")
print(result.text)

一部オーバーラップする部分もあるけどもガッツリ検出して置き換えている。

Analyzer Results:
- type: EMAIL_ADDRESS, start: 110, end: 129, score: 1.0
- type: CREDIT_CARD, start: 157, end: 176, score: 1.0
- type: PERSON, start: 11, end: 19, score: 0.85
- type: PERSON, start: 32, end: 40, score: 0.85
- type: PERSON, start: 50, end: 54, score: 0.85
- type: PHONE_NUMBER, start: 76, end: 88, score: 0.75
- type: URL, start: 118, end: 129, score: 0.5

De-identified text
Hello, Mr. <PERSON>. My name is <PERSON>. Call me <PERSON>. 
My phone number is <PHONE_NUMBER> and email adderss is <EMAIL_ADDRESS>. 
My credit card number is <CREDIT_CARD>.

Anonymizerの処理をカスタマイズするにはOperatorConfigを使う。

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

text = (
    "Hello, Mr. Williams. My name is John Doe. Call me John. \n"
    "My phone number is 212-555-5555 and email adderss is example@example.com. \n"
    "My credit card number is 1111-2222-3333-4444."
)

analyzer = AnalyzerEngine()

analyzer_results = analyzer.analyze(
    text=text,
    language='en',
)

print("Analyzer Results:")
for ar in analyzer_results:
    print(f"- {str(ar)}")

print()

# AnonymizerEngineを初期化
engine = AnonymizerEngine()

# anonymizeの処理内容を定義
operators = {
    # デフォルトは文字列"<ANONYMIZED>"に置き換え
    "DEFAULT": OperatorConfig(
        "replace", {"new_value": "<ANONYMIZED>"}
    ),
    # PERSONの場合は文字列"<SOMEONE>"に置き換え
    "PERSON": OperatorConfig(
        "replace", {"new_value": "<SOMEONE>"}
    ),
    # EMAIL_ADDRESSの場合はmd5ハッシュ化
    "EMAIL_ADDRESS": OperatorConfig(
        "hash", {"hash_type": "md5"}
    ),
    # PHONE_NUMBERの場合は後ろから9文字文を"*"でマスク
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 9,
            "from_end": True,
        },
    ),
    # CREDIT_CARDの場合は先頭から16文字文を"X"でマスク
    "CREDIT_CARD": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "X",
            "chars_to_mask":15,
            "from_end": False,
        },
    ),
}

# テキストとそのテキストのanalyzerの結果をanonymizeメソッドに渡す。
# operatorsでanonymizeの処理定義を設定できる
result = engine.anonymize(
    text=text, analyzer_results=analyzer_results, operators=operators,
)

print("De-identified text")
print(result.text)

結果

Analyzer Results:
- type: EMAIL_ADDRESS, start: 110, end: 129, score: 1.0
- type: CREDIT_CARD, start: 157, end: 176, score: 1.0
- type: PERSON, start: 11, end: 19, score: 0.85
- type: PERSON, start: 32, end: 40, score: 0.85
- type: PERSON, start: 50, end: 54, score: 0.85
- type: PHONE_NUMBER, start: 76, end: 88, score: 0.75
- type: URL, start: 118, end: 129, score: 0.5

De-identified text
Hello, Mr. <SOMEONE>. My name is <SOMEONE>. Call me <SOMEONE>. 
My phone number is 212********* and email adderss is 23463b99b62a72f26ed677cc556c44e8. 
My credit card number is XXXXXXXXXXXXXXX4444.

めっちゃ書き換わった。

どういう処理ができるかは以下にある。

kun432

11. カスタムな秘匿化

ひとつ上でもOperatorConfigでanonymizationの振る舞いをカスタマイズすることができたが、より細かいカスタマイズとして関数を使って置き換えることもできる。

例えば人の名前を架空の名前に置き換えたり。

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig, EngineResult, RecognizerResult
from faker import Faker

text = "Hello, Mr. Williams. My name is John Doe. Call me John."

analyzer = AnalyzerEngine()

analyzer_results = analyzer.analyze(
    text=text,
    language='en',
)

print("Analyzer Results:")
for ar in analyzer_results:
    print(f"- {str(ar)}")

print()

# fakerを使って架空の名前を返す関数を用意する（引数は使わなくても値を受ける必要がある）
fake = Faker()

def fake_name(x):
    return fake.name()


# PERSONエンティティに対してlambdaを使ってカスタムな処理を定義
operators = {"PERSON": OperatorConfig("custom", {"lambda": fake_name})}

anonymizer = AnonymizerEngine()

anonymized_results = anonymizer.anonymize(
    text=text, analyzer_results=analyzer_results, operators=operators
)

print(anonymized_results.text)

結果

Analyzer Results:
- type: PERSON, start: 11, end: 19, score: 0.85
- type: PERSON, start: 32, end: 40, score: 0.85
- type: PERSON, start: 50, end: 54, score: 0.85

Hello, Mr. Stephanie Gutierrez. My name is Robert Griffin. Call me Marc White.

その他のユースケースとして以下が挙げられている。

性別を特定し、同じ性別からランダムな値を作成する（例：Laura -> Pam）
日付パターンを特定し、日付シフトを実行する (01-01-2020 -> 05-01-2020)
年齢を特定し、10年ごとの区分を作成する (89 -> 80代)

kun432

12　暗号化・復号化

Anonymizerは、検出したエンティティの暗号化・復号化がビルトインで行える。AES-CBCで暗号鍵が必要になる。

まずは普通にAnalyzerで検出してみる。

from presidio_analyzer import AnalyzerEngine

text="My name is James Bond"

analyzer = AnalyzerEngine()

analyzer_results = analyzer.analyze(
    text=text,
    language='en',
)

print("Analyzer Results:")
for ar in analyzer_results:
    print(f"- {str(ar)}")
print("Identified these PII entities:")
for r in analyzer_results:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

名前を検出している。

Analyzer Results:
- type: PERSON, start: 11, end: 21, score: 0.85
Identified these PII entities:
- James Bond as PERSON

では暗号化

from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import (
    RecognizerResult,
    OperatorResult,
    OperatorConfig,
)

# 暗号化・復号化の鍵を設定
crypto_key = "WmZq4t7w!z%C&F)J"

anonymizer = AnonymizerEngine()

# テキストとanalyzerの結果を渡してanonymizeを実行
# OperatorConfigでPERSONを"encrypt"するよう設定
anonymize_result = engine.anonymize(
    text=text,
    analyzer_results=analyzer_results,
    operators={"PERSON": OperatorConfig("encrypt", {"key": crypto_key})},
)

anonymize_result

暗号化されている、textにエンティティが暗号化されたテキスト、itemsに暗号化されたエンティティの情報が入る。

text: My name is eW/Ql/BgmJCpW36wOHHlYGnxiyAGovKLf6r6UCEhfII=
items:
[
    {'start': 11, 'end': 55, 'entity_type': 'PERSON', 'text': 'eW/Ql/BgmJCpW36wOHHlYGnxiyAGovKLf6r6UCEhfII=', 'operator': 'encrypt'}
]

これを復号化する。

# DeanonymizerEngineを初期化
engine = DeanonymizeEngine()

# anonymizerdされたテキストとエンティティを渡してdeanonymizeを実行
# OperatorConfigでPERSONを"dencrypt"するよう設定
deanonymized_result = engine.deanonymize(
    text=anonymize_result.text,
    entities=anonymize_result.items, 
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": crypto_key})},
)

deanonymized_result

復号化されている。

text: My name is James Bond
items:
[
    {'start': 11, 'end': 21, 'entity_type': 'PERSON', 'text': 'James Bond', 'operator': 'decrypt'}
]

Decryptで直接復号化することも可能。

from presidio_anonymizer.operators import Decrypt

# anonymizerdされたエンティティのテキストを取得
encrypted_entity_value = anonymize_result.items[0].text

# Decryptで直接復号化
Decrypt().operate(text=encrypted_entity_value, params={"key": crypto_key})

kun432

13. 許可リストでPII検出の対象から除外する

拒否リストに対して、逆に許可リストもある。許可リストを使うと、PIIとして検出されなくなる。

from presidio_analyzer import AnalyzerEngine

websites_list = [
    "bing.com",
]

text = "My favorite website is bing.com, his is microsoft.com"

analyzer = AnalyzerEngine()

result = analyzer.analyze(
    text = text,
    language = 'en',
    allow_list = websites_list     # allow_listで指定する
)

print("Analyzer Results:")
for ar in result:
    print(f"- {str(ar)}")
print("Identified these PII entities:")
for r in result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")
print()

allow_listで指定したものだけPIIとして検出されていないのがわかる。

Analyzer Results:
- type: URL, start: 40, end: 53, score: 0.85
Identified these PII entities:
- microsoft.com as URL

kun432

Analyzer/Anonymizerについて詳しくは以下も参考に。

kun432

画像についてはこちら。画像については自分の場合はまだ必要性を感じていないのでパス。

kun432

日本語を使う

ということで以下を参考に日本語でやってみる。

まずspaCyの日本語モデルは以下にある

https://spacy.io/models/ja

ja_core_news_sm（11MB）
ja_core_news_md（40MB）
ja_core_news_lg（529MB）
ja_core_news_trf（320MB）

モデルの命名規則は以下にある。

https://spacy.io/models#conventions

ざっくりまとめると

日本語用（ja）
タグ付け/パース/レンマ化/固有表現検出等全般に対応した汎用パイプライン（`core
- 別のチョイスはdepで固有表現検出がない
ニュースデータで学習（news）
- 別のチョイスはwebでウェブデータで学習
sm/md/lg/trfはパッケージサイズを表す
- sm/md/lgはCPU向け？
- trfはTransformerモデル？

その他詳しくは上記のサイトを確認。以下にも記載があった。

なお、すべてライセンスはCC BY-SA。

では試してみる。必要なものを全部インストール＆ダウンロード。ちなみにここまでに何度かメモリ不足でクラッシュしていたので、ハイメモリに変えている。

!pip install presidio_analyzer
!pip install presidio_anonymizer
!pip install presidio_image_redactor

!python -m spacy download ja_core_news_sm
!python -m spacy download ja_core_news_md
!python -m spacy download ja_core_news_lg
!python -m spacy download ja_core_news_trf

!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

ここまでやったことを参考にするとこんな感じ？とりあえずモデルはja_core_news_lgを使ってみた。

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "ja", "model_name": "ja_core_news_lg"},
    ],
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=["ja"]
)

anonymizer = AnonymizerEngine()

テキストはいつものやつで。サンプルとして望ましいとは全く思えないけどもｗ

text = """\
ドウデュース（欧字名:Do Deuce、2019年5月7日 - ）は、日本の競走馬。主な勝ち鞍は2021年の朝日杯フューチュリティステークス、2022年の東京優駿、2023年の有馬記念。

馬名の意味は「する＋テニス用語（勝利目前の意味）」。2021年のJRA賞最優秀2歳牡馬である。

戦績
デビュー前
2019年5月7日、北海道安平町のノーザンファームで誕生。松島正昭が代表を務める株式会社キーファーズの所有馬となり、ノーザンファーム空港牧場で育成の後、栗東トレーニングセンターの友道康夫厩舎に入厩した。

2歳（2021年）
9月5日に小倉競馬場で行われた2歳新馬戦（芝1800メートル）に武豊鞍上で出走。1番人気に推されると、レースは直線でガイアフォースとの追い比べをクビ差制してデビュー勝ちを果たした。

次走はリステッド競走のアイビーステークスを選択。2番人気に推され、レースでは追い比べから抜け出すと、最後は追い込んできたグランシエロをクビ差凌いで優勝、デビュー2連勝とした。

続いて朝日杯フューチュリティステークスに出走。重賞勝ち馬セリフォスやジオグリフをはじめとした自身と同じ無敗馬が多く顔を揃える中、3番人気に支持される。レースでは直線で外に出すと、先に抜け出していたセリフォスを半馬身差で差し切り優勝、無傷3連勝でGI初制覇を果たした。鞍上の武豊はこの競走22回目の挑戦で初制覇となり、日本の中央競馬 (JRA) の平地GI完全制覇までホープフルステークスを残すのみとした。また馬主である松島及びキーファーズにとっては初の単独所有馬によるGI勝利、並びに国内GI初制覇となった。\
"""

ではまずは検出してみる。

analyzer_result = analyzer.analyze(
    text=text,
    language="ja"
)

print("Results:")
for res in analyzer_result:
    print(f"- {str(res)}")
print("Identified these PII entities:")
for r in analyzer_result:
    print(f"- {text[r.start:r.end]} as {r.entity_type}")

Results:
- type: DATE_TIME, start: 20, end: 29, score: 0.85
- type: LOCATION, start: 35, end: 37, score: 0.85
- type: DATE_TIME, start: 48, end: 53, score: 0.85
- type: DATE_TIME, start: 71, end: 76, score: 0.85
- type: DATE_TIME, start: 82, end: 87, score: 0.85
- type: DATE_TIME, start: 121, end: 126, score: 0.85
- type: DATE_TIME, start: 153, end: 162, score: 0.85
- type: LOCATION, start: 163, end: 166, score: 0.85
- type: PERSON, start: 182, end: 186, score: 0.85
- type: PERSON, start: 242, end: 246, score: 0.85
- type: DATE_TIME, start: 259, end: 264, score: 0.85
- type: DATE_TIME, start: 266, end: 270, score: 0.85
- type: PERSON, start: 324, end: 327, score: 0.85
- type: LOCATION, start: 418, end: 424, score: 0.85
- type: LOCATION, start: 475, end: 480, score: 0.85
- type: PERSON, start: 481, end: 486, score: 0.85
- type: LOCATION, start: 545, end: 550, score: 0.85
- type: PERSON, start: 583, end: 585, score: 0.85
- type: LOCATION, start: 605, end: 607, score: 0.85
- type: LOCATION, start: 656, end: 658, score: 0.85
Identified these PII entities:
- 2019年5月7日 as DATE_TIME
- 日本 as LOCATION
- 2021年 as DATE_TIME
- 2022年 as DATE_TIME
- 2023年 as DATE_TIME
- 2021年 as DATE_TIME
- 2019年5月7日 as DATE_TIME
- 北海道 as LOCATION
- 松島正昭 as PERSON
- 友道康夫 as PERSON
- 2021年 as DATE_TIME
- 9月5日 as DATE_TIME
- ガイア as PERSON
- グランシエロ as LOCATION
- セリフォス as LOCATION
- ジオグリフ as PERSON
- セリフォス as LOCATION
- 武豊 as PERSON
- 日本 as LOCATION
- 松島 as LOCATION

次に秘匿化してみる。

result = anonymizer.anonymize(
    text=text,
    analyzer_results=analyzer_result
)

print(result.text)

ドウデュース（欧字名:Do Deuce、<DATE_TIME> - ）は、<LOCATION>の競走馬。主な勝ち鞍は<DATE_TIME>の朝日杯フューチュリティステークス、<DATE_TIME>の東京優駿、<DATE_TIME>の有馬記念。

馬名の意味は「する＋テニス用語（勝利目前の意味）」。<DATE_TIME>のJRA賞最優秀2歳牡馬である。

戦績
デビュー前
<DATE_TIME>、<LOCATION>安平町のノーザンファームで誕生。<PERSON>が代表を務める株式会社キーファーズの所有馬となり、ノーザンファーム空港牧場で育成の後、栗東トレーニングセンターの<PERSON>厩舎に入厩した。

2歳（<DATE_TIME>）
<DATE_TIME>に小倉競馬場で行われた2歳新馬戦（芝1800メートル）に武豊鞍上で出走。1番人気に推されると、レースは直線で<PERSON>フォースとの追い比べをクビ差制してデビュー勝ちを果たした。

次走はリステッド競走のアイビーステークスを選択。2番人気に推され、レースでは追い比べから抜け出すと、最後は追い込んできた<LOCATION>をクビ差凌いで優勝、デビュー2連勝とした。

続いて朝日杯フューチュリティステークスに出走。重賞勝ち馬<LOCATION>や<PERSON>をはじめとした自身と同じ無敗馬が多く顔を揃える中、3番人気に支持される。レースでは直線で外に出すと、先に抜け出していた<LOCATION>を半馬身差で差し切り優勝、無傷3連勝でGI初制覇を果たした。鞍上の<PERSON>はこの競走22回目の挑戦で初制覇となり、<LOCATION>の中央競馬 (JRA) の平地GI完全制覇までホープフルステークスを残すのみとした。また馬主である<LOCATION>及びキーファーズにとっては初の単独所有馬によるGI勝利、並びに国内GI初制覇となった。

流石に馬名などの固有名詞はキツいかも。「ドウデュース」「Do Duece」はそのままだけど他の馬名は秘匿化されていたりするし。でも「武豊」も秘匿化されてる箇所とそうでない箇所があったりする。

ちなみにモデルごとの検出/秘匿化の違いを見てみる。上の例はlgなのでそれ以外。

`sm`

- 2019年5月7日 as DATE_TIME
- 日本 as LOCATION
- 2021年 as DATE_TIME
- 2022年 as DATE_TIME
- 2023年 as DATE_TIME
- 2021年 as DATE_TIME
- 2019年5月7日 as DATE_TIME
- 北海道 as LOCATION
- 安平町 as LOCATION
- 松島正昭 as PERSON
- 友道康夫 as PERSON
- 2021年 as DATE_TIME
- 9月5日 as DATE_TIME
- 武豊鞍上 as PERSON
- ガイア as PERSON
- グランシエロ as LOCATION
- 鞍上 as PERSON
- 武豊 as PERSON
- 日本 as LOCATION
- JRA as LOCATION
- 松島 as PERSON

ドウデュース（欧字名:Do Deuce、<DATE_TIME> - ）は、<LOCATION>の競走馬。主な勝ち鞍は<DATE_TIME>の朝日杯フューチュリティステークス、<DATE_TIME>の東京優駿、<DATE_TIME>の有馬記念。

馬名の意味は「する＋テニス用語（勝利目前の意味）」。<DATE_TIME>のJRA賞最優秀2歳牡馬である。

戦績
デビュー前
<DATE_TIME>、<LOCATION><LOCATION>のノーザンファームで誕生。<PERSON>が代表を務める株式会社キーファーズの所有馬となり、ノーザンファーム空港牧場で育成の後、栗東トレーニングセンターの<PERSON>厩舎に入厩した。

2歳（<DATE_TIME>）
<DATE_TIME>に小倉競馬場で行われた2歳新馬戦（芝1800メートル）に<PERSON>で出走。1番人気に推されると、レースは直線で<PERSON>フォースとの追い比べをクビ差制してデビュー勝ちを果たした。

次走はリステッド競走のアイビーステークスを選択。2番人気に推され、レースでは追い比べから抜け出すと、最後は追い込んできた<LOCATION>をクビ差凌いで優勝、デビュー2連勝とした。

続いて朝日杯フューチュリティステークスに出走。重賞勝ち馬セリフォスやジオグリフをはじめとした自身と同じ無敗馬が多く顔を揃える中、3番人気に支持される。レースでは直線で外に出すと、先に抜け出していたセリフォスを半馬身差で差し切り優勝、無傷3連勝でGI初制覇を果たした。<PERSON>の<PERSON>はこの競走22回目の挑戦で初制覇となり、<LOCATION>の中央競馬 (<LOCATION>) の平地GI完全制覇までホープフルステークスを残すのみとした。また馬主である<PERSON>及びキーファーズにとっては初の単独所有馬によるGI勝利、並びに国内GI初制覇となった。

`md`

- 2019年5月7日 as DATE_TIME
- 日本 as LOCATION
- 2021年 as DATE_TIME
- 2022年 as DATE_TIME
- 2023年 as DATE_TIME
- 2021年 as DATE_TIME
- 2019年5月7日 as DATE_TIME
- 北海道安平町 as LOCATION
- 松島正昭 as PERSON
- 友道康夫厩舎 as PERSON
- 2021年 as DATE_TIME
- 9月5日 as DATE_TIME
- ガイア as PERSON
- ジオグリフ as PERSON
- 鞍上 as PERSON
- 武豊 as PERSON
- 日本 as LOCATION

ドウデュース（欧字名:Do Deuce、<DATE_TIME> - ）は、<LOCATION>の競走馬。主な勝ち鞍は<DATE_TIME>の朝日杯フューチュリティステークス、<DATE_TIME>の東京優駿、<DATE_TIME>の有馬記念。

馬名の意味は「する＋テニス用語（勝利目前の意味）」。<DATE_TIME>のJRA賞最優秀2歳牡馬である。

戦績
デビュー前
<DATE_TIME>、<LOCATION>のノーザンファームで誕生。<PERSON>が代表を務める株式会社キーファーズの所有馬となり、ノーザンファーム空港牧場で育成の後、栗東トレーニングセンターの<PERSON>に入厩した。

2歳（<DATE_TIME>）
<DATE_TIME>に小倉競馬場で行われた2歳新馬戦（芝1800メートル）に武豊鞍上で出走。1番人気に推されると、レースは直線で<PERSON>フォースとの追い比べをクビ差制してデビュー勝ちを果たした。

次走はリステッド競走のアイビーステークスを選択。2番人気に推され、レースでは追い比べから抜け出すと、最後は追い込んできたグランシエロをクビ差凌いで優勝、デビュー2連勝とした。

続いて朝日杯フューチュリティステークスに出走。重賞勝ち馬セリフォスや<PERSON>をはじめとした自身と同じ無敗馬が多く顔を揃える中、3番人気に支持される。レースでは直線で外に出すと、先に抜け出していたセリフォスを半馬身差で差し切り優勝、無傷3連勝でGI初制覇を果たした。<PERSON>の<PERSON>はこの競走22回目の挑戦で初制覇となり、<LOCATION>の中央競馬 (JRA) の平地GI完全制覇までホープフルステークスを残すのみとした。また馬主である松島及びキーファーズにとっては初の単独所有馬によるGI勝利、並びに国内GI初制覇となった。

`trf`

Identified these PII entities:
- ドウデュース as PERSON
- 2019年5月7日 as DATE_TIME
- 日本 as LOCATION
- 2021年 as DATE_TIME
- 2022年 as DATE_TIME
- 2023年 as DATE_TIME
- 2021年 as DATE_TIME
- 2019年5月7日 as DATE_TIME
- 北海道安平町 as LOCATION
- ノーザンファーム as LOCATION
- 松島正昭 as PERSON
- 友道康夫 as PERSON
- 2021年 as DATE_TIME
- 9月5日 as DATE_TIME
- 武豊 as PERSON
- ガイアフォース as PERSON
- セリフォス as PERSON
- ジオグリフ as PERSON
- セリフォス as PERSON
- 武豊 as PERSON
- 日本 as LOCATION
- 松島 as PERSON
- キーファーズ as PERSON

<PERSON>（欧字名:Do Deuce、<DATE_TIME> - ）は、<LOCATION>の競走馬。主な勝ち鞍は<DATE_TIME>の朝日杯フューチュリティステークス、<DATE_TIME>の東京優駿、<DATE_TIME>の有馬記念。

馬名の意味は「する＋テニス用語（勝利目前の意味）」。<DATE_TIME>のJRA賞最優秀2歳牡馬である。

戦績
デビュー前
<DATE_TIME>、<LOCATION>の<LOCATION>で誕生。<PERSON>が代表を務める株式会社キーファーズの所有馬となり、ノーザンファーム空港牧場で育成の後、栗東トレーニングセンターの<PERSON>厩舎に入厩した。

2歳（<DATE_TIME>）
<DATE_TIME>に小倉競馬場で行われた2歳新馬戦（芝1800メートル）に<PERSON>鞍上で出走。1番人気に推されると、レースは直線で<PERSON>との追い比べをクビ差制してデビュー勝ちを果たした。

次走はリステッド競走のアイビーステークスを選択。2番人気に推され、レースでは追い比べから抜け出すと、最後は追い込んできたグランシエロをクビ差凌いで優勝、デビュー2連勝とした。

続いて朝日杯フューチュリティステークスに出走。重賞勝ち馬<PERSON>や<PERSON>をはじめとした自身と同じ無敗馬が多く顔を揃える中、3番人気に支持される。レースでは直線で外に出すと、先に抜け出していた<PERSON>を半馬身差で差し切り優勝、無傷3連勝でGI初制覇を果たした。鞍上の<PERSON>はこの競走22回目の挑戦で初制覇となり、<LOCATION>の中央競馬 (JRA) の平地GI完全制覇までホープフルステークスを残すのみとした。また馬主である<PERSON>及び<PERSON>にとっては初の単独所有馬によるGI勝利、並びに国内GI初制覇となった。

ざっくりの印象だと、md <<< sm < lg << trf って感じかな？mdは漏れも多いし誤検出（「鞍上」を人名と判断している）もある、smのほうが全然マシだけどこちらも誤検出あり、lgは漏れが多少ある、trfは一通りカバーできている、っていうふうに見える。

kun432

大体動きも分かった。ちょっとリソース食いがちという気はするし、さすがに100%完璧とはいかないので、後から秘匿化するにはどうかな？とは思うけど、LLMの入力チェックには十分使えるのではないだろうか？という感触は持てた。

このスクラップは2024/02/03にクローズされました

ログインするとコメントできます

インストール

QuickStart

検出できるエンティティ一覧

テキスト処理の例

1: 拒否リストベースのPII検出

2. 正規表現ベースのPII検出

3. ルールベースの論理検出

5. 異なるモデルや言語の使用

4. 外部のサービス・フレームワークを使ったPII検出

6. コンテキストワードの活用

7. 検出プロセスのトレース

8. ノーコードパターン検出

9. アドホック検出

10. シンプルな秘匿化

11. カスタムな秘匿化

12 暗号化・復号化

13. 許可リストでPII検出の対象から除外する

日本語を使う

sm

md

trf

12　暗号化・復号化

`sm`

`md`

`trf`