Closed3ヶ月前にクローズ9

マルチリンガル・マルチモーダルなEmbeddingモデル「Jina-CLIP-v2」を試す

ここで知った
https://twitter.com/JinaAI_/status/1859659764281782420
Jina-CLIP-v2: 89の言語に対応し、512x512の画像解像度、8192トークンの長さ、そして画像とテキストの両方で最大64次元までのMatryoshka表現をサポートする0.9B（9億パラメータ）の多言語対応マルチモーダル埋め込みモデルです。詳しくはこちら：https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/
もちろん、検索や分類タスクでの高いパフォーマンスを誇ります。Jina-CLIP v1と同様に、Jina-CLIP v2のテキストエンコーダーは単独でも高密度な検索ツールとして機能し、現在1B未満のパラメータ数で最高の多言語埋め込みモデルであるjina-embeddings-v3に匹敵するパフォーマンスを発揮します。
https://twitter.com/JinaAI_/status/1859659765892411462
他のCLIPスタイルのモデルと同様に、Jina-CLIP v2はテキストエンコーダー（Jina XLM-RoBERTa、561Mパラメータ）とビジョンエンコーダー（EVA02-L14、304Mパラメータ）で構成されており、合計865Mパラメータを持ちます。このテキストエンコーダーは、jina-embeddings-v3のバックボーンとしても使用されています。これらのエンコーダーは共同で学習され、画像とテキストの整合した表現を生成するよう設計されています。
https://twitter.com/JinaAI_/status/1859659767670768023
ぜひ、このモデルをお試しください！Search Foundation API、Hugging Face SentenceTransformers、

またはオンプレミスでの展開には AWS、Azure、GCP を通じて利用できます。
さらに、Pinecone、Weaviate、Qdrant のパートナーを通じて、既存のベクターデータベースとのスムーズなAPI統合も可能です。
モデルに関するフィードバックをぜひお寄せください！ ❤️

https://huggingface.co/jinaai/jina-clip-v2

公式ブログ
https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/

 Jina CLIP v2: テキストと画像の多言語対応マルチモーダル埋め込みモデルJina-CLIP v2は、89の言語に対応した多言語サポートを備えた、9億パラメータのマルチモーダル埋め込みモデルです。512x512の高解像度画像処理とMatryoshka表現を提供します。
マルチモーダル埋め込みモデルは、異なるモダリティ（テキストや画像など）を一貫性のある表現で検索・理解することを可能にします。これにより、ニューラル情報検索やマルチモーダルな生成AIアプリケーションの基盤として機能します。今回リリースしたJina-CLIP v2の主な特徴配下です:

性能向上:

テキストと画像、テキスト間の検索タスクで、v1と比較して3%の性能向上を実現。v1と同様に、v2のテキストエンコーダーは多言語対応のロングコンテキスト高密度検索器として機能し、最先端のjina-embeddings-v3（MTEBで1B未満のパラメータ数のモデルとして最高）に匹敵する性能を発揮します。

多言語対応:

jina-embeddings-v3をテキストタワーとして搭載し、89の言語での多言語画像検索をサポート。nllb-clip-large-siglipと比較して、最大4%の性能向上を達成しました。

高解像度画像対応:

v2では、v1の224x224から大幅に向上し、512x512の入力画像解像度をサポート。これにより、詳細な画像処理が可能となり、特徴抽出や細かな視覚要素の認識精度が向上しました。

Matryoshka表現:

テキストと画像の埋め込み出力の次元を1024から64にまで削減可能。ストレージと処理の負荷を低減しつつ、高い性能を維持します。

Colaboratoryで試してみる。

以下のKaggleにあるスポーツ画像のデータセットを使う。

KaggleのPythonライブラリをインストール

!pip install kaggle

認証情報を環境変数にセット

from google.colab import userdata
import os

os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')

データセットをダウンロード。datasetsディレクトリ以下にダウンロードされる。

from kaggle import KaggleApi

api = KaggleApi()
api.authenticate()
api.dataset_download_files('gpiosenka/sports-classification', path='dataset', unzip=True)

データセットの中身はこんな感じ

%cd dataset

出力

total 75084
drwxr-xr-x 102 root root     4096 Nov 24 00:12  valid
drwxr-xr-x 102 root root     4096 Nov 24 00:11  train
drwxr-xr-x 102 root root     4096 Nov 24 00:11  test
-rw-r--r--   1 root root   686576 Nov 24 00:11  sports.csv
-rw-r--r--   1 root root 76183400 Nov 24 00:11 'EfficientNetB0-100-(224 X 224)- 98.40.h5'

出力

.
├── EfficientNetB0-100-(224 X 224)- 98.40.h5
├── sports.csv
├── test
│   ├── air hockey
│   │   ├── 1.jpg
│   │   ├── 2.jpg
│   │   ├── 3.jpg
│   │   ├── 4.jpg
│   │   └── 5.jpg
│   ├── ampute football
│   │   ├── 1.jpg
│   │   ├── 2.jpg
│   │   ├── 3.jpg
│   │   ├── 4.jpg
│   │   └── 5.jpg
│   ├── archery
│   │   ├── 1.jpg
│   │   ├── 2.jpg
│   │   ├── 3.jpg
(snip)

valid、test、trainそれぞれのディレクトリ以下にスポーツのカテゴリごとに画像が連番で含まれている。

今回は以下の6ファイルをピックアップ。

images = [
    "test/baseball/1.jpg",
    "test/baseball/2.jpg",
    "test/horse racing/1.jpg",
    "test/horse racing/2.jpg",
    "test/swimming/1.jpg",
    "test/swimming/2.jpg",
]

中身はこんな感じ。

from PIL import Image as PILImage
import matplotlib.pyplot as plt
import math

num_rows = math.ceil(len(images) / 2)

fig, axes = plt.subplots(num_rows, 2, figsize=(6, num_rows * 3))
axes = axes.flatten()

for i, (ax, img_path) in enumerate(zip(axes, images)):
    try:
        img = PILImage.open(img_path)
        ax.imshow(img)
        ax.set_title(img_path, fontsize=10)
        ax.axis('off')
    except Exception as e:
        ax.text(0.5, 0.5, str(e), fontsize=12, ha='center')
        ax.axis('off')

for ax in axes[len(images):]:
    ax.axis('off')

plt.tight_layout()
plt.show()

この画像をLLMに読ませて、テキストを取得する。今回はOpenAIを使う。

!pip install openai

from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

import base64
from openai import OpenAI


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def get_text_from_image(image_path):
    base64_image = encode_image(image_path)
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "この画像に写っているものを1文で簡潔に説明して。スポーツ名は必ず言及すること。「この画像は」や「〜な画像」という言及はしないこと。"
                )
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        temperature=0.0,
    )

    return(response.choices[0].message.content)

texts = []

for t in images:
    text = get_text_from_image(t)
    texts.append(text)

こういうテキストが生成された。

texts

出力

['野球の選手がバットを振っている瞬間。',
 '野球の試合後、選手たちが喜びを分かち合っているシーン。',
 '競馬のレースで、3頭の馬が激しく競り合っている。',
 '競馬のトレーニングを行っている馬と騎手。',
 '競泳の選手が水中で競い合っている。',
 '水泳をしている選手が水中で泳いでいる様子。']

ではJina-CLIP-v2で、画像とテキストのembeddingsを生成する。画像はURLまたはbase64エンコード
して渡す。

import requests
from google.colab import userdata

url = 'https://api.jina.ai/v1/embeddings'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer ' + userdata.get('JINA_API_KEY')
}


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def get_embedding_from_image(image_path):
    data = {
        'input': [
            {"image": encode_image(image_path)}
        ],
        'model': 'jina-clip-v2',
        'encoding_type': 'float',
        'dimensions': '1024' 
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()['data'][0]['embedding']


def get_embedding_from_text(text):
    data = {
        'input': [
            {"text": text}
        ],
        'model': 'jina-clip-v2',
        'encoding_type': 'float',
        'dimensions': '1024' 
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()['data'][0]['embedding']

data = []

for i in images:
    emb = get_embedding_from_image(i)
    data.append({"label": i, "type": "image", "embeddings": emb})

for t in texts:
    emb = get_embedding_from_text(t)
    data.append({"label": t, "type": "text", "embeddings": emb})

これらの類似度を比較して可視化する。

!pip install japanize-matplotlib

import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib 

def cosine_similarity_matrix(vectors):
    v_norm = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
    return np.dot(v_norm, v_norm.T)

# ベクトルを抽出
embeddings = np.array([item["embeddings"] for item in data])
similarity_matrix = cosine_similarity_matrix(embeddings)

# ラベルを抽出
labels = [item["label"][:20]+"..." for item in data]

# Matplotlibでヒートマップをプロット
fig, ax = plt.subplots(figsize=(8, 6))
cax = ax.matshow(similarity_matrix, cmap="viridis")

# カラーバーを追加
fig.colorbar(cax)

# 軸のラベルを設定
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=90)  # ラベルを90度回転
ax.set_yticklabels(labels)

# グリッドラインを追加
ax.grid(False)

# タイトルを設定
plt.title("Text and Image Similarity Matrix", pad=20)

# 表示
plt.tight_layout()
plt.show()

類似度のマトリックスは以下となった。

テキスト・画像の区別なく類似度を比較すると、同じモダリティ同士の場合とクロスモーダルの場合で類似度に差が出てしまっているのだが、それでも同じスポーツ同士のテキスト・画像で類似度が高いことがわかる。

テキストと画像を分けて比較。

# テキストと画像の埋め込みを分割
text_embeddings = np.array([item["embeddings"] for item in data if item["type"] == "text"])
image_embeddings = np.array([item["embeddings"] for item in data if item["type"] == "image"])

# コサイン類似度を計算
def cosine_similarity(vectors1, vectors2):
    v1_norm = vectors1 / np.linalg.norm(vectors1, axis=1, keepdims=True)
    v2_norm = vectors2 / np.linalg.norm(vectors2, axis=1, keepdims=True)
    return np.dot(v1_norm, v2_norm.T)

similarity_matrix_text_image = cosine_similarity(text_embeddings, image_embeddings)

# ラベルを分割
text_labels = [item["label"][:20]+"..." for item in data if item["type"] == "text"]
image_labels = [item["label"][:20]+"..." for item in data if item["type"] == "image"]

# Matplotlibでプロット
fig, ax = plt.subplots(figsize=(8, 6))
cax = ax.matshow(similarity_matrix_text_image, cmap="viridis")

# カラーバーを追加
fig.colorbar(cax)

# 軸のラベルを設定
ax.set_xticks(range(len(image_labels)))
ax.set_yticks(range(len(text_labels)))
ax.set_xticklabels(image_labels, rotation=90)  # 画像ラベルを列
ax.set_yticklabels(text_labels)  # テキストラベルを行

# グリッドラインを追加
ax.grid(False)

# タイトルを設定
plt.title("Text and Image Similarity Matrix", pad=20)

# 表示
plt.tight_layout()
plt.show()

例えば、テキストのクエリから画像を検索、またはその逆というような使い方であれば問題なく使えそう。

なお、JinaのAPIを見る限り、retrievalの場合、クエリのEmbedding生成は"task": "retrieval.query"を指定するほうが良さそう。

あと、こういうオプションがある
L2正規化

ユークリッド（L2）ノルムが1になるように，埋め込みをスケーリングします．下流がドット積、分類、可視化を含む場合に有用。
今回の場合だと、つけたほうがいいのかなー？試してみる。
def get_embedding_from_image(image_path):
    data = {
        'input': [
            {"image": encode_image(image_path)}
        ],
        'model': 'jina-clip-v2',
        'encoding_type': 'float',
        'dimensions': '1024', 
        "normalized": True,    # 追加
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()['data'][0]['embedding']


def get_embedding_from_text(text):
    data = {
        'input': [
            {"text": text}
        ],
        'model': 'jina-clip-v2',
        'encoding_type': 'float',
        'dimensions': '1024', 
        "normalized": True,    # 追加
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()['data'][0]['embedding']
再度Embeddingsを生成して、可視化した結果。
大きく変化はしてないかな。

性能の箇所をざっと要約
https://jina.ai/ja/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/#cross-modal-retrieval-performance

 クロスモーダル検索性能89の言語に対応し、多言語画像検索で優れた性能を発揮。
NLLB-CLIP-SigLIPと同等以上の性能を示す。
サイズはnllb-siglip-base（507M）とnllb-siglip-large（1.2B）の中間（865Mパラメータ）。

 英語のみのテキストと画像Flickr30k画像からテキストへの検索で98.0%を記録し、最先端の性能を達成。
COCOの検索タスクでv1より最大3.3%向上。
NLLB-CLIP-SigLIPと比べても競争力のある性能を維持。

 多言語テキストと画像Crossmodal 3600の画像からテキストへの検索で最大+3.8%の改善。
テキストから画像への検索ではNLLB-SigLIPが僅差で優位（差は3%以内）。

 テキストのみの Dense Retriever パフォーマンス多言語MTEBベンチマークで検索69.86%、意味的類似性67.77%を達成。
jina-embeddings-v3と同等の競争力を持つ。
英語タスクでは検索性能でNLLB-SigLIPのスコアをほぼ2倍にする優位性を示す。

Matryoshka Representationsに対応しているので、次元数の圧縮が可能。
https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/#matryoshka-representation-performance

 マトリョーシカ表現のパフォーマンステキストと画像のエンコーダーは、次元を64まで削減可能ながら強力なパフォーマンスを維持。
次元を75%削減しても、テキスト、画像、クロスモーダルタスクで99%以上の性能を維持。

 画像分類37の画像分類ベンチマークで、次元削減に強い耐性を確認。
1024次元から64次元への削減（94%削減）で、top-5精度は8%低下、top-1精度は12.5%低下に抑える。

 クロスモーダル検索64次元に次元削減した画像・テキスト埋め込みで、堅牢な検索性能を発揮。
画像からテキストへの検索性能を93%、テキストから画像への検索性能を90%維持。

 テキストのみの検索英語のMTEBベンチマークでは、64次元のテキスト埋め込みで意味的類似性の低下を2.1%に抑える。
検索性能は1024次元に比べて17.5%の適度な低下を示すが、全体的に効率的な性能を維持。

画像の料金と推奨
https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/
512x152を1タイル=4000トークンとして計算される（1024x1024とかであれば恐らく4タイル）
512x512に最適化されている
縦長・横長の場合は長辺がタイルに収まるようにリサイズされる、つまり余った部分は黒帯になる

 まとめマルチリンガルに対応したのが最大のメリット。APIは変わらずシンプルだし使いやすそう。
あと、インテグレーションがいくつか紹介されているけど、Pineconeは、画像とテキストを同じベクトル空間に入れている感じで、QdrantとWeaviateは、画像とテキストをそれぞれベクトル化して1つのオブジェクトとして入れている感じで、データをどう管理するか？はそれぞれちょっと異なってきそうかな。試すならPineconeで試すのがシンプルでわかりやすいと思う。
https://docs.pinecone.io/models/jina-clip-v2
https://weaviate.io/developers/weaviate/model-providers/jinaai/embeddings-multimodal
https://qdrant.tech/documentation/embeddings/jina-embeddings/

このスクラップは3ヶ月前にクローズされました