🖼️

Visionで画像の類似度を計算する - Image Feature Printとは何なのか？

2023/12/13に公開

iOS

iOS/macOS/visionOSで画像の類似度を計算する方法のひとつとして、Vision の VNGenerateImageFeaturePrintRequest / VNFeaturePrintObservation を使う方法がある。

これらを使った画像の類似度計算の実装方法と、"Image Feature Print"とは何なのか？ということについてAppleによる一次情報をまとめる。

`VNGenerateImageFeaturePrintRequest`

VNGenerateImageFeaturePrintRequest クラスはiOS 13で追加されたもので、"Image Feature Print"を算出してくれる。

An image-based request to generate feature prints from an image.

（画像からfeature printsを生成するための画像ベースのリクエストです。）

APIリファレンスの詳細欄には以下のようにだけ書かれている：

This request returns the feature print data it generates as an array of VNFeaturePrintObservation objects.

"feature print"とは何なのかという解説はない。

VNGenerateImageFeaturePrintRequest の results について

APIリファレンスの記述には『VNFeaturePrintObservationオブジェクトの配列を返す』とあるが、公式サンプルの実装では次のようになっており、

return request.results?.first as? VNFeaturePrintObservation

実際の挙動を見る限りでも、1つの画像からは1つの VNFeaturePrintObservation オブジェクトが生成される、という理解でよさそう（複数生成されるケースは考慮しなくてよさそう）。

`VNFeaturePrintObservation`

VNFeaturePrintObservation のリファレンスを見ると、こう書いてある。

An observation that provides the recognized feature print.

"Feature Print"とは何なのか、という詳細は一切書かれていない。

このクラスが持つプロパティやメソッドを見ると、これだけしかない。

/**
	@brief The type of each element in the data.
*/
open var elementType: VNElementType { get }

/**
	@brief The total number of elements in the data.
*/
open var elementCount: Int { get }

/**
	@brief The feature print data.
*/
open var data: Data { get }

/**
@brief Computes the distance between two observations.
@discussion The larger the distance the more dissimlar the feature prints are. In case of an error this method returns false with an error describing the error condition, for instance comparing two non-comparable feature prints.
*/
open func computeDistance(_ outDistance: UnsafeMutablePointer<Float>, to featurePrint: VNFeaturePrintObservation) throws

elementType プロパティ（VNElementType 型）は、data 内の値の型が float か double かを示す。
elementCount プロパティは data 内の要素数を示す。
data プロパティには実際の feature print データが入っている。

で、重要なのが computeDistance メソッド。2つの VNFeaturePrintObservation 間の距離を計算してくれる。

つまり、2つの画像の類似性を計算するには、このメソッドを使えばよい。

公式サンプルでも画像の類似度を計算する次のようにこのメソッドを用いて算出した値を用いて類似性を評価している。

var distance = Float(0)
try contestantFPO.computeDistance(&distance, to: originalFPO)

（FPO は Feature Print Observation の略）

実際のfeature printが入っているという data プロパティの中身の各要素（element）が何を示しているかの説明は（ドキュメントには）ないため、data に直接アクセスしてゴニョゴニョする、というのはなさそう（現実的に computeDistance を使う以外の利用方法はなさそう）だ。

"Image Feature Print"とは何なのか？

APIリファレンスをみても何の解説もなかった"Image Feature Print"だが、WWDC19のセッション「Understanding Images in Vision Framework」で唯一詳細が語られている。

動画では20:00を超えたあたり、スライドとしてはp99〜あたり。

これを参照しつつ、「Image Feature Printとは何なのか？」を見ていく。

画像の内容を表現する2通りの方法

セッションでは、次のようなスライドで、

以下のように語っている：

When we talk about Image Similarity, what we really mean is a method to describe the content of an image and another method to compare those descriptions.
（「画像の類似性」とは、画像の内容を表現する方法と、その内容を比較する方法のことです。）

The most basic way in which I can describe the contents of an image is using the source pixels themselves.
（画像の内容を表現する最も基本的な方法は、ソースとなるピクセルそのものを使用することです。）

If I did a search in this fashion, however, it's extremely fragile, and it's easily fooled by small changes like rotations or lighting augmentations that drastically change the pixel values but not the semantic content in the image.
（しかし、この方法で検索を行った場合、非常に脆弱で、回転や照明の増強などの小さな変更に簡単に騙されてしまい、ピクセル値は大幅に変わりますが、画像内の意味的な内容は変わりません。）

ピクセルそのものを使用して画像を比較する方法は、反転させるだけで別物っていう判定になっちゃうから全然ロバストじゃないよねー、という話。

で、画像の内容をセマンティックに記述する方法で比較した方がいいよね、と。

What I really want is a more high-level description of what the content of the image is, perhaps something like natural language. I could make use of the image classification API I was describing previously in order to extract a set of words that describe my image. I could then retrieve other images with a similar set of classifications. I might even combine this with something like word vectors to account for similar but not exactly matching words like cat and kitten.
（私が本当に必要としているのは、画像の内容をよりハイレベルに記述することであり、おそらく自然言語のようなものです。先ほど説明した画像分類APIを利用して、自分の画像を説明する単語を抽出することができます。そして、似たような分類を持つ他の画像を検索することができます。これに単語ベクトルのようなものを組み合わせて、catとkittenのように似ているが完全には一致しない単語を考慮することもできます。）

自然言語で画像の内容を表現し比較する

上に載せたスライドだけ見ると、画像を画像分類モデルにかけ、その出力（例： Kitten, Bowl）を比較すればよいように見える。

が、その方法だと限界がある。

Well, if I performed a search like this, I might get similar objects in a very general sense, but the way in which those objects appear and the relationships between them could be very different.
（さて、このような検索を行うと、非常に一般的な意味で類似したオブジェクトが得られるかもしれませんが、それらのオブジェクトの表示方法やオブジェクト間の関係は全く異なるものになるでしょう。）

As well, I would be limited by the taxonomy of my classifier. That is, any object that appeared in my image that wasn't in my classification networks taxonomy couldn't be expressed in a search like this.
（また、私の分類器の分類法にも制限があります。つまり、分類ネットワークの分類法に含まれていない画像中のオブジェクトは、このような検索では表現できないのです。）

そう、そのImage Classificationモデルに含まれている画像しか類似性が比較できないことになってしまうし、この方法ではオブジェクトの表示方法やオブジェクト間の関係が考慮されないことになる。

画像のベクトル表現

では、これらの問題を解決するVisionの"Image Feature Print"とは何なのかというと、

What I really want is a high-level description of the objects that appear in the image that isn't fixated on the exact pixel values but still cares about them. I also want this to apply to any natural image and not just those within a specific taxonomy. As it turns out, this kind of representation learning is something that's naturally engendered in our classification network as part of its training process.
（私が本当に必要としているのは、画像に現れるオブジェクトのハイレベルな記述であり、正確なピクセル値に固執することはありませんが、それでもそれらを重要視しています。また、特定の分類に属するものだけでなく、あらゆる自然画像に適用したいと考えています。結局のところ、このような表現学習は、分類ネットワークの学習プロセスの一部として自然に行われるものです。）

The upper layers of the network contain all of the salient information necessary to perform classification while discarding any redundant or unnecessary information that doesn't aid it in that task. We can make use of these upper layers then to act as our feature descriptor, and it's something we refer to as the feature print.
（ネットワークの上層部には、分類に必要な情報がすべて含まれており、分類に役立たない冗長な情報や不要な情報は捨てられています。そして、これらの上層部を利用して、特徴記述子として機能させることができます。）

Now, the feature print is a vector that describes the content of the image that isn't constrained to a particular taxonomy, even the one that the classification network was trained on. It simply leverages what the network has learned about images during its training process. （特徴プリントは、画像のコンテンツを記述するベクトルで、分類ネットワークが学習した特定の分類法に制約されていません。特徴プリントは、ネットワークが学習過程で画像について学習した内容を単純に利用しています。）

画像分類ネットワークの、上層部の出力を利用したものらしい。

なるほど、だからFeature Print なのか...

Vision Feature Printについて

Vision Feature PrintはiOSにビルトインされている特徴量抽出モデル。詳細はこちら：

このFeature Printを使うと、より的確に類似性を計算できますよという例がでてくる。

If we look at these pairs of images, we can compare how similar their feature prints are, and the smaller the value is, the more similar the two images are in a semantic sense. We can see that even though the two images of the cats are visually dissimilar, they have a much more similar feature print than the visually similar pairs of different animals. To make this a little more concrete, let's go through a specific example. Let's say I have the source image on screen, and I want to find other semantically similar images to it.
（これらの画像のペアを見ると、それぞれの特徴プリントがどれだけ似ているかを比較することができます。値が小さいほど、2つの画像が意味的に似ていることを意味します。猫の2つの画像は視覚的には似ていませんが、視覚的に似ている別の動物のペアよりもはるかに似た特徴プリントを持っていることがわかります。これをもう少し具体的に説明するために、具体的な例を挙げてみましょう。画面上にソース画像があり、それと意味的に類似した他の画像を探したいとします。）

ちなみにスライド内に、