Cohereを試す

kun432

https://cohere.com/

kun432

がちょっと気になってるので

kun432

CohereはAIプラットフォームという説明があるけど、これだけではわからないので、ChatGPTに聞いてみた。

cohereというサービスについて教えてください。

「Cohere」とは、自然言語理解と生成に焦点を当てた人工知能技術を提供する企業です。彼らのサービスは、大規模な言語モデルを活用して、テキストの生成、要約、意図の解釈、その他の言語関連のタスクをサポートします。Cohereのプラットフォームは、ビジネスや開発者が彼らのアプリケーションやサービスに統合できるよう設計されています。

Cohereの主な特徴は以下の通りです：

テキスト生成: 与えられたプロンプトに基づいて、関連性のあるテキストを生成する能力。

テキスト解析: 感情分析、意図の識別、キーワードの抽出など、テキストの理解を支援する機能。

カスタマイズ可能: 特定の用途や業界に合わせて、モデルのカスタマイズが可能。

API統合: 開発者が簡単に自分のアプリケーションやシステムに統合できるAPIを提供。

Cohereは、ビジネスが自動化されたカスタマーサポート、コンテンツの生成、データ分析など、多くの分野で利用可能なツールを提供しています。また、この技術は、言語モデリングの最新の進歩を活用しており、非常に柔軟で強力な能力を持っています。ただし、このようなAIサービスの使用には、倫理的な考慮やプライバシーの保護も重要な要素となります。

OpenAIやAnthropicとの違いは何でしょうか？

OpenAI、Anthropic、およびCohereはすべて、先進的な人工知能技術に焦点を当てた企業ですが、それぞれに特徴があります。

OpenAI

製品: OpenAIは、GPTシリーズ（GPT-3など）やDALL-E、Codexなど、幅広いAI技術を提供しています。

特徴: OpenAIは、汎用性が高く革新的な言語モデルと画像生成モデルでよく知られています。特にGPT-3は、その広範囲な応用能力と柔軟性で注目を集めています。

焦点: OpenAIは研究と応用の両方に重点を置いており、AIの安全性と倫理的な使用に関しても強い関心を持っています。

Anthropic

製品: Anthropicは、AIの安全性と解釈可能性に特化しています。

特徴: この企業は、AIモデルの振る舞いをより理解しやすく、予測可能にすることに重点を置いています。彼らのアプローチは、AIの倫理的な使用と透明性を強化することに注目しています。

焦点: Anthropicの主な目標は、AI技術の安全性と解釈可能性を高めることです。

Cohere

製品: Cohereは、主に自然言語処理と生成に重点を置いたAIサービスを提供しています。

特徴: Cohereのサービスは、テキスト生成、解析、意図の識別など、言語に関連するタスクを実行する能力を持っています。

焦点: Cohereは、言語理解と生成に特化しており、APIを通じて他のアプリケーションやサービスとの統合を容易にしています。
これらの企業はそれぞれ、AI技術の異なる側面に焦点を当てており、独自の製品とアプローチを持っています。また、AIの倫理的使用と進化への取り組みも各企業によって異なります。

LLMを使ったアプリ開発者の立場から見て、それぞれの違いはどういうところにあるでしょうか？

Large Language Models（LLM）を使用するアプリケーション開発者の観点から見た場合、OpenAI、Anthropic、およびCohereの間の主な違いは次のようになります：

1. モデルの機能と性能

OpenAI: GPT-3やそれに類するモデルは、非常に多様な言語タスクに対応しており、高い柔軟性と広範な知識ベースを持っています。これにより、アプリケーション開発者は、チャットボット、文章生成、自動コーディングなど多岐にわたる用途でモデルを活用できます。

Anthropic: この企業の提供するモデルは、特にAIの安全性と解釈可能性に重点を置いています。これは、モデルの決定を理解しやすく、予測可能なものにすることを意味します。これは、特にユーザーインターフェースやユーザーエクスペリエンスが重要なアプリケーションにとって価値があります。

Cohere: Cohereは言語理解と生成に特化しており、API統合の容易さとカスタマイズ性に重点を置いています。これにより、特定のニーズや業界特有の要件に合わせてモデルを調整することが可能です。

2. 統合とアクセスの容易さ

OpenAI: OpenAIはAPIを通じた容易な統合を提供し、開発者が迅速にAI機能をアプリケーションに組み込むことができます。また、広範なドキュメントとコミュニティのサポートも利点です。

Anthropic: Anthropicのモデルがどの程度容易に統合可能か、または開発者向けのサポートがどの程度提供されているかは、具体的な情報が限られています。しかし、安全性と透明性に焦点を置いたモデルは、特定のアプリケーションにおいて重要な要素となる可能性があります。

Cohere: CohereはAPIを通じて柔軟な統合オプションを提供しており、特に言語関連のタスクに焦点を当てたアプリケーション開発に適しています。

3. コストと利用条件

各企業は異なる価格設定と利用条件を持っています。開発者は、コスト効率、使用頻度、データのプライバシーに関するポリシーなどを検討する必要があります。

4. カスタマイズと拡張性

OpenAIとCohereは、特定のニーズに合わせてモデルをカスタマイズするオプションを提供する可能性があります。一方で、Anthropicは安全性と解釈可能性に重点を置いているため、その方面でのカスタマイズが期待できます。

総じて、アプリケーション開発者は、自身のプロジェクトの特定の要件、目的、および利用可能な予算に基づいて、これらの企業の提供するLLMを評価する必要があります。また、プライバシーと倫

理的な観点から、使用するAIモデルを選択する際には慎重に検討することが求められます。

まあピンとこないけども、イメージ的にはOpenAIやClaudeの競合というイメージ。あとは使い勝手だったりとか担ってくると思うで、とりあえず触ってみる。

kun432

何ができるのかは、公式のドキュメントと以下のQiitaの記事が参考になる。

https://docs.cohere.com/docs

OpenAIであるようなものはこの辺。OpenAIだとプロンプトエンジニアリングでやるようなタスクも具体的なエンドポイントとして用意されているのは、考え方が違うんだろうなと思わせる。

/generate
/embed
/chat
/classify
/summarize

で、「言語理解と生成」に特化しているということでなるほどと思うのは、OpenAIにはない以下のようなエンドポイントがあるのは面白い。

/tokenize
- テキストをBPE（byte-pair encoding）を使用してトークン化する
- ~~日本語だとこれは意味がなさそう~~
/detokenize
- 上記の逆。BPE（byte-pair encoding）でトークン化された文字列をテキストに変換する
- 日本語だとこれは意味がなさそう
/detect-language
- テキストの言語を判定する
/rerank
- クエリとテキスト配列を渡すと、類似度順に再ランキングしてくれる。
- 日本語にも対応

この中で、リランキングをAPIで提供しているというのはちょっと熱い。リランキングはRAGで一般的に使用されるベクトル検索結果の精度を上げるための一つの方法。

手法はいくつかあるけど、自分の理解だとこんな感じ（間違ってるかもしれない）

RRF(reciprocal rank fusion)を使って、ハイブリッド検索の類似度スコアをリランキング
- 速度も速く、実装もそんなに難しくないが、精度は高くない
- ハイブリッド検索が必要になる、つまりキーワード検索の環境を用意する必要がある
Sentence Transformersベースで、multivector/cross encorderモデルを使ってリランキング
- 精度は高いが、速度が遅い
- リランキングのためのモデルが必要、つまりGPU環境が必要になる

というところで、一定のハードルが出てくる。

これがAPIだけでできるとするならば、少なくとも環境を用意する必要がなくなるので実装が楽なのではないか？と考えれる。

Cohereについて調べようと思ったのはこれが最大の理由だったりする。

ただrerankの料金がよくわからない・・・

https://cohere.com/pricing

Price per search unit: 0.001USD

Model Price / Unit

Default $1.00 / 1k search units

Custom $2.00 / 1k search units

We count a single search unit as a query with up to 100 documents to be ranked. Documents longer than 510 tokens when including the length of the search query will be split up into multiple chunks, where each chunk counts as a singular document.

Model	Price / Unit
Default	$1.00 / 1k search units
Custom	$2.00 / 1k search units

んー、計算がわからないけど。100文書ってのがリランキング対象のテキストのことでこれが1検索ユニット、で個々の文書が500トークンを超えると別の文書に分割される、つまり雑に計算すると500トークン✕100文書=1検索ユニット＝$0.001っていう理解でいいのかな？

最初、検索ユニットをトークンだと思ったので「たっか！」と思ったのだけど、やりたいのはRAGでプロンプトに組み込むin-contextのところなので、仮に1コンテキスト500字を10個並べてリランキングさせたいだけなら、1検索ユニットに収まるのでざっくりそこまで高くないのかも。

kun432

Cohereではモデルのカスタマイズもできる。

「特定のタスクに特化したデータセット」とあるので、ファインチューニングだと思う。OpenAIのchatCompletionの場合はいろんなタスクをすべてプロンプトで表現するのだけども、Cohereの場合はまず3種類の用途に分かれている。

Generate
Classify
Embed
Rerank

Classification/Rerank専用のインタフェースを持っているのと、Embedのファインチューニングがあるというのが特徴かな。

あとはカスタマイズしたモデルをベースモデルと比較・評価するための数値もとれるらしい。OpenAIの方にもあるのかなと思ったらあった。

kun432

あと気軽に試せるPlaygroundのようなものもある。

https://dashboard.cohere.com/playground

できるのは

Chat
Generate
Classify
Embed
Summrize

試してみた感じ、生成系は日本語は理解はしているけど出力はできないっぽい。Chatでリファレンスが表示されているけども、Web検索したりってのもあるみたいで、ChatGPTのBingとかLangChainのToolみたいな概念もあるっぽい。

実際には動かしてないけど、ClassifyとかEmbedだと多言語モデルが選択できる。

Playground以外にはあとCoralというのがあるけど、これはChatGPTと同じような位置づけなのだと思う。少し試してみたけど日本語は出来なかった。

kun432

ではPythomで書いてみる。Quickstart Tutorialをみてみる。

用意されているのは以下だけど全部classifyに関するものになっている。

Customer Support
Intent Recognition
Sentiment Analysis
Toxicity Detection

その場で実行できるインタラクティブなチュートリアルになってる（アカウント作成は必要になるがAPIキーが自動でセットされる）のはとても良いのだけど、もうちょっとエンドポイントごとに試したい。

ということで、APIリファレンスで進める。こちらもその場で実行できるようになっているのだけど、今回はColaboratoryでやってみることにする。あと、前提として日本語で使えるかどうかを確認したい、特に多言語に対応したEmbeddingモデルがあるのでEmbeddingを使っているところをしっかり確認したい。

https://docs.cohere.com/reference/about

※ここめちゃめちゃ参考になりました。

準備

!pip install cohere

CohereのAPIキーをsecretから読みこんでクライアント初期化

from google.colab import userdata
import cohere

co = cohere.Client(userdata.get('COHERE_API_KEY'))

kun432

/generate (Co.Generate)

OpenAIでいうcompletion。

response = co.generate(
  prompt='問題:富士山の高さを2倍すると何m？\n答え:',
)
print(response)

[cohere.Generation {
	id: 9dd56c9d-558b-44d6-8ab4-db7f2465e61b
	prompt: 問題:富士山の高さを2倍すると何m？
答え:
	text:  4000m
	likelihood: None
	finish_reason: COMPLETE
	token_likelihoods: None
}]

response = co.generate(
  prompt='問題:現在の日本の総理大臣は？\n答え:',
)
print(response)

[cohere.Generation {
	id: 4f2239de-8bbc-43e8-a6d1-9da50f84fd3e
	prompt: 問題:現在の日本の総理大臣は？
答え:
	text:  Fumio Kishida
	likelihood: None
	finish_reason: COMPLETE
	token_likelihoods: None
}]

一応、日本語は理解はしてて回答もできなくはないんだけど、必ずしも日本語で返ってくるというわけではない感じ。

例えば

response = co.generate(
  prompt='こんにちは。今日の',
  k=1,
  temperature=0
)
print(response)

[cohere.Generation {
	id: 806b3d9c-3500-435f-9f0b-c1f6c2befb0a
	prompt: こんにちは。今日の
	text:  こんにちは! You've caught me on a good day as today is Thursday, and thus it is Thursday in Japan as well! 今日は?
	likelihood: None
	finish_reason: COMPLETE
	token_likelihoods: None
}]

completionなんだけど、最初の例みたいに具体的なタスクを明示したプロンプトにする必要がありそう。あと色々触ってみた感じだとパラメータもきちんと指定したほうが良さそう。

選択できるモデルにはバリエーションがある。

command（デフォルト）
command-light
command-nightly
command-light-nightly

試してみた感じだと、nightlyはさすがにちょっと使えないかなと言う感じ。プロンプトやパラメータにもよると思うので参考程度に。

kun432

/chat (Co.Chat)

ChatCompletion。

response = co.chat(
  chat_history=[
    {"role": "USER", "message": "重力を発見したのは誰？"},
    {"role": "CHATBOT", "message": "重力の発見者として広く認知されているのはアイザック・ニュートンです。"}
  ],
  message="彼が生まれたのは何年？",
)
print(response)

cohere.Chat {
	id: f808f4f4-d6c2-455a-a81a-f7eed20fbfab
	response_id: f808f4f4-d6c2-455a-a81a-f7eed20fbfab
	generation_id: 8016f7b3-6f7a-4ba6-8ef5-38909ffa7548
	message: 彼が生まれたのは何年？
	text: 西暦十一十三年十月archar(11)の日に、アイザック・ニュートンは世界に出産されました（1870年）。
	conversation_id: None
	prompt: None
	chat_history: None
	preamble: None
	client: <cohere.client.Client object at 0x7daa7a880700>
	token_count: {'prompt_tokens': 150, 'response_tokens': 60, 'total_tokens': 210, 'billed_tokens': 187}
	meta: {'api_version': {'version': '1'}}
	is_search_required: None
	citations: None
	documents: None
	search_results: None
	search_queries: None
}

ちょっと日本語が破綻気味だけどもまあ会話っぽく回答が生成される。

connectorsを使うと外部データの参照、つまりRAGっぽいことができる。現状はweb-searchのみ指定できる。

response = co.chat(
  chat_history=[
    {"role": "USER", "message": "重力を発見したのは誰？"},
    {"role": "CHATBOT", "message": "重力の発見者として広く認知されているのはアイザック・ニュートンです。"}
  ],
  message="彼の妻の名前は？",
  connectors=[{"id": "web-search"}],
)
print(response)

cohere.Chat {
	id: 843d1fdd-6ef1-4cc1-b5ab-f82e5bac1718
	response_id: 843d1fdd-6ef1-4cc1-b5ab-f82e5bac1718
	generation_id: 1d5344ab-3b95-49a2-aec8-e44b1f15acbb
	message: 彼の妻の名前は？
	text: I couldn't find any information in my search, but i'll try to answer you anyway. ういドリフュンです。
	conversation_id: None
	prompt: None
	chat_history: None
	preamble: None
	client: <cohere.client.Client object at 0x7daa7a880700>
	token_count: {'prompt_tokens': 448, 'response_tokens': 32, 'total_tokens': 480, 'billed_tokens': 105}
	meta: {'api_version': {'version': '1'}}
	is_search_required: None
	citations: None
	documents: []
	search_results: [{'search_query': {}, 'document_ids': [], 'connector': {'id': 'web-search'}, 'error_message': 'web results not found in JSON', 'continue_on_failure': True}]
	search_queries: [{'text': '彼の妻の名前は？', 'generation_id': '5985f9e3-3661-4804-9835-612755dbcbeb'}]
}

全然うまくいけてないんだけど、何かしら調べようとしているのはわかる。

ちなみに英語だとうまくいく。

response = co.chat(
  chat_history=[
    {"role": "USER", "message": "Who discovered gravity?"},
    {"role": "CHATBOT", "message": "The man who is widely credited with discovering gravity is Sir Isaac Newton"}
  ],
  message="What year was he born?",
  connectors=[{"id": "web-search"}]
)
print(response)

cohere.Chat {
	id: b13016c4-d131-4435-ac27-85f024d0494c
	response_id: b13016c4-d131-4435-ac27-85f024d0494c
	generation_id: bf1fe086-3bc7-4fb3-bc1b-1741c12f17f4
	message: What year was he born?
	text: Isaac Newton was born on January 4, 1643, but had two birth dates, as England was still using the Julian calendar, which was ten days behind the Gregorian calendar that the rest of the continent was using. This means his birthdate is recorded as December 25, 1642. He died on March 20, 1726.
	conversation_id: None
	prompt: None
	chat_history: None
	preamble: None
	client: <cohere.client.Client object at 0x7daa5e6ae8c0>
	token_count: {'prompt_tokens': 1412, 'response_tokens': 67, 'total_tokens': 1479, 'billed_tokens': 90}
	meta: {'api_version': {'version': '1'}}
	is_search_required: None
	citations: [{'start': 25, 'end': 40, 'text': 'January 4, 1643', 'document_ids': ['web-search_4:0', 'web-search_2:8']}, {'start': 50, 'end': 65, 'text': 'two birth dates', 'document_ids': ['web-search_2:8']}, {'start': 70, 'end': 113, 'text': 'England was still using the Julian calendar', 'document_ids': ['web-search_2:7']}, {'start': 125, 'end': 140, 'text': 'ten days behind', 'document_ids': ['web-search_0:69', 'web-search_2:7']}, {'start': 145, 'end': 163, 'text': 'Gregorian calendar', 'document_ids': ['web-search_0:69', 'web-search_2:7']}, {'start': 173, 'end': 205, 'text': 'rest of the continent was using.', 'document_ids': ['web-search_2:7']}, {'start': 221, 'end': 264, 'text': 'birthdate is recorded as December 25, 1642.', 'document_ids': ['web-search_0:54', 'web-search_0:69', 'web-search_2:7']}, {'start': 268, 'end': 291, 'text': 'died on March 20, 1726.', 'document_ids': ['web-search_0:54']}]
	documents: [{'id': 'web-search_4:0', 'snippet': 'Arts & Entertainment\n\nIsaac Newton Birthday\n\nIsaac Newton, born January 4, 1643, was an English physicist and mathematician. He developed the laws of motion and is credited as one of the great minds of the 17th-century Scientific Revolution. In 1687, Newton published his most acclaimed work, “Philosophiae Naturalis Principia Mathematica” (“Mathematical Principles of Natural Philosophy”), which is regarded as the most significant book on physics.\n\nMarch 31, 1727 (age 84)\n\nIsaac Newton was born on January 4, 1643, in Lincolnshire, England. Newton was the only child of a local farmer, also named Isaac, who died three months before his son was born.', 'title': 'Isaac Newton - Age, Bio, Birthday, Family, Net Worth | National Today', 'url': 'https://nationaltoday.com/birthday/isaac-newton/'}, {'id': 'web-search_2:8', 'snippet': ' Artist’s concept via NASA.\n\nBottom line: Isaac Newton could claim two birth dates, but now we celebrate his birthday as January 4, 1643. Newton’s work in gravity and the laws of motion form the basis of much of today’s understanding of physics and astronomy.\n\n34X885Facebook6Pinterest23BufferShare\n\nDeborah Byrd View Articles\n\nDeborah Byrd created the EarthSky radio series in 1991 and founded EarthSky.org in 1994. Today, she serves as Editor-in-Chief of this website. She has won a galaxy of awards from the broadcasting and science communities, including having an asteroid named 3505 Byrd in her honor. In 2020, she was the Education Prize from the American Astronomical Society, the largest organization of professional astronomers in North America.', 'title': 'EarthSky | Isaac Newton born today in 1643', 'url': 'https://earthsky.org/human-world/this-date-in-science-isaac-newtons-birthday/'}, {'id': 'web-search_2:7', 'snippet': '\n\nThe rest of the continent had already adopted the Gregorian calendar, which is the same calendar we use today. However, at the time of Newton’s birth, the English were still using the Julian calendar, which lagged ten days behind because of a faulty method of accounting for leap years. (Coincidentally, 1642 was the year that Galileo died).\n\nSo Newton himself would have said his birthday was December 25. But everywhere outside of England he was born on January 4. Read more about Newton’s birthday discrepancy.\n\nEinstein’s 1916 theory of general relativity didn’t replace Newton’s theory of gravity. But it did change our understanding of gravity so that now we know massive objects cause a distortion in space-time, which passing objects feel as gravity.', 'title': 'EarthSky | Isaac Newton born today in 1643', 'url': 'https://earthsky.org/human-world/this-date-in-science-isaac-newtons-birthday/'}, {'id': 'web-search_0:69', 'snippet': " At Newton's birth, Gregorian dates were ten days ahead of Julian dates; thus, his birth is recorded as taking place on 25 December 1642 Old Style, but it can be converted to a New Style (modern) date of 4 January 1643. By the time of his death, the difference between the calendars had increased to eleven days. Moreover, he died in the period after the start of the New Style year on 1 January but before that of the Old Style new year on 25 March. His death occurred on 20 March 1726, according to the Old Style calendar, but the year is usually adjusted to 1727. A full conversion to New Style gives the date 31 March 1727.\n\n^ This claim was made by William Stukeley in 1727, in a letter about Newton written to Richard Mead.", 'title': 'Isaac Newton - Wikipedia', 'url': 'https://en.wikipedia.org/wiki/Isaac_Newton'}, {'id': 'web-search_0:54', 'snippet': " Diligent, sagacious and faithful, in his expositions of nature, antiquity and the holy Scriptures, he vindicated by his philosophy the majesty of God mighty and good, and expressed the simplicity of the Gospel in his manners. Mortals rejoice that there has existed such and so great an ornament of the human race! He was born on 25th December 1642, and died on 20th March 1726.\n\nIn a 2005 survey of members of Britain's Royal Society (formerly headed by Newton) asking who had the greater effect on the history of science, Newton or Albert Einstein, the members deemed Newton to have made the greater overall contribution.", 'title': 'Isaac Newton - Wikipedia', 'url': 'https://en.wikipedia.org/wiki/Isaac_Newton'}]
	search_results: [{'search_query': {'text': 'what year was isaac newton born?', 'generation_id': '3d5dc5c5-d8f8-43a6-ad3f-4065a4dc8541'}, 'document_ids': ['web-search_0:69', 'web-search_0:54', 'web-search_2:8', 'web-search_2:7', 'web-search_4:0'], 'connector': {'id': 'web-search'}}]
	search_queries: [{'text': 'what year was isaac newton born?', 'generation_id': '3d5dc5c5-d8f8-43a6-ad3f-4065a4dc8541'}]
}

kun432

/summarize (Co.Summrize)

要約。現状は英語しか対応していない。

text=(
  "Ice cream is a sweetened frozen food typically eaten as a snack or dessert. "
  "It may be made from milk or cream and is flavoured with a sweetener, "
  "either sugar or an alternative, and a spice, such as cocoa or vanilla, "
  "or with fruit such as strawberries or peaches. "
  "It can also be made by whisking a flavored cream base and liquid nitrogen together. "
  "Food coloring is sometimes added, in addition to stabilizers. "
  "The mixture is cooled below the freezing point of water and stirred to incorporate air spaces "
  "and to prevent detectable ice crystals from forming. The result is a smooth, "
  "semi-solid foam that is solid at very low temperatures (below 2 °C or 35 °F). "
  "It becomes more malleable as its temperature increases.\n\n"
  "The meaning of the name \"ice cream\" varies from one country to another. "
  "In some countries, such as the United States, \"ice cream\" applies only to a specific variety, "
  "and most governments regulate the commercial use of the various terms according to the "
  "relative quantities of the main ingredients, notably the amount of cream. "
  "Products that do not meet the criteria to be called ice cream are sometimes labelled "
  "\"frozen dairy dessert\" instead. In other countries, such as Italy and Argentina, "
  "one word is used fo\r all variants. Analogues made from dairy alternatives, "
  "such as goat's or sheep's milk, or milk substitutes "
  "(e.g., soy, cashew, coconut, almond milk or tofu), are available for those who are "
  "lactose intolerant, allergic to dairy protein or vegan."
)

response = co.summarize(
  text=text,
)
print(response)

SummarizeResponse(id='77bfa2f5-37a3-4ca4-b789-65a269f64c76',
  summary='Ice cream is a sweetened frozen food made from milk or cream and flavoured with a sweetener and spice. The name varies between countries and ice cream made from dairy alternatives is available.',
  meta={'api_version': {'version': '1'}}
)

一応日本語でも入力は受け取ってくれるけど、出力が英語になる。

from pprint import pprint

text=(
  "アイスクリームは、一般的にスナックやデザートとして食べられる甘味のある冷凍食品である。" 
  "ミルクやクリームから作られ、甘味料で味付けされる。" 
  "砂糖やその代替品、ココアやバニラなどのスパイスで味付けされる。" 
  "イチゴや桃のような果物で味付けされている。" 
  "フレーバークリームのベースと液体窒素を一緒に泡立てて作ることもできる。"
  "安定剤に加えて着色料が加えられることもある。"
  "混合物は水の凝固点以下に冷却され、空隙を取り込むために撹拌される。"
  "氷の結晶ができないようにする。その結果、滑らかな、"
  "非常に低い温度（2℃以下）で固体である半固体の泡である。"
)

response = co.summarize(
  text=text,
)
print(response)

SummarizeResponse(id='1861a112-6d0a-48e9-8267-0b8c73ed312f', 
  summary='Ice cream is a frozen dessert that is typically made from dairy products, such as milk and cream, and often combined with a sweetener, either sugar or an alternative, and a spice, such as cocoa or vanilla, or a fruit, such as strawberries or peaches. The fluid is agitated while it is cooled, usually by electric mixer. Disodium phosphate, dextrose and carrageenan can be added as stabilizers. The mixture is cooled below the freezing point of water and agitated to incorporate air spaces and to prevent detectable ice crystals from forming. The result is a smooth, semi-solid foam that is solid at very low temperatures (below 2 °C).',
  meta={'api_version': {'version': '1'}}
)

あと最低250文字っていう縛りがある。

kun432

ここまでがテキスト生成に関するもの。総じての印象としては、日本語はちょっと弱そう、というところ。

kun432

本題のEmbed関連は後にして、ユーティリティっぽいやつを先に。

kun432

/tokenize（Co.Tokenize）

テキスト文字列をトークン分割する。

response = co.tokenize(
  text='tokenize me! :D',
  model='command' # optional
)
print(response)

cohere.Tokens {
	tokens: [10002, 2261, 2012, 8, 2792, 43]
	token_strings: ['token', 'ize', ' me', '!', ' :', 'D']
	meta: {'api_version': {'version': '1'}}
}

byte-pair encodingを使っているので日本語だとちょっとうまくいかない。

response = co.tokenize(
  text='トークン化してください',
  model='command'
)
print(response)

cohere.Tokens {
	tokens: [33794, 20181, 37122, 23925, 54464, 38016, 51512, 66083, 6758, 262, 6758, 251, 36442]
	token_strings: ['ト', 'ー', 'ク', 'ン', '化', 'し', 'て', 'く', '�', '�', '�', '�', 'い']
	meta: {'api_version': {'version': '1'}}
}

と思いきや、マルチリンガルのEmbeddingモデルを使えばできる！v3.0を使う。

response = co.tokenize(
    text="明日の天気は晴れでしょう",
    model="embed-multilingual-v3.0"
)
print(response)

cohere.Tokens {
	tokens: [6, 81766, 154, 171391, 342, 237785, 12547]
	token_strings: ['', '明日', 'の', '天気', 'は', '晴れ', 'でしょう']
	meta: {'api_version': {'version': '1'}}
}

commandとembedだと使い方が明確に違う気がするんだけど、果たしてこれ正しい使い方なんだろうか？

参考までにいくつか他のモデルを使った場合の比較

import time

models = [
    "command",
    "command-light",
    "command-nightly",
    "embed-english-v3.0",
    "embed-multilingual-v3.0",
    "embed-multilingual-light-v3.0",
    "embed-multilingual-v2.0",
]

for m in models:
    print(f"##### {m} #####\n")
    response = co.tokenize(
        text="明日の天気は晴れでしょう",
        model=m
    )
    print(response)
    print()
    time.sleep(10)     # トライアルのAPIキーはrate limitが10APIコール/分なので

##### command #####

cohere.Tokens {
	tokens: [55299, 38094, 21455, 49413, 36730, 253, 51183, 50972, 120, 74946, 50725, 38016, 6966, 237, 57354]
	token_strings: ['明', '日', 'の', '天', '�', '�', 'は', '�', '�', 'れ', 'で', 'し', '�', '�', 'う']
	meta: {'api_version': {'version': '1'}}
}

##### command-light #####

cohere.Tokens {
	tokens: [55299, 38094, 21455, 49413, 36730, 253, 51183, 50972, 120, 74946, 50725, 38016, 6966, 237, 57354]
	token_strings: ['明', '日', 'の', '天', '�', '�', 'は', '�', '�', 'れ', 'で', 'し', '�', '�', 'う']
	meta: {'api_version': {'version': '1'}}
}

##### command-nightly #####

cohere.Tokens {
	tokens: [55299, 38094, 21455, 49413, 36730, 253, 51183, 50972, 120, 74946, 50725, 38016, 6966, 237, 57354]
	token_strings: ['明', '日', 'の', '天', '�', '�', 'は', '�', '�', 'れ', 'で', 'し', '�', '�', 'う']
	meta: {'api_version': {'version': '1'}}
}

##### embed-english-v3.0 #####

cohere.Tokens {
	tokens: [1865, 1864, 1671, 1811, 100, 1672, 100, 100]
	token_strings: ['明', '日', 'の', '天', '', 'は', '', '']
	meta: {'api_version': {'version': '1'}}
}

##### embed-multilingual-v3.0 #####

cohere.Tokens {
	tokens: [6, 81766, 154, 171391, 342, 237785, 12547]
	token_strings: ['', '明日', 'の', '天気', 'は', '晴れ', 'でしょう']
	meta: {'api_version': {'version': '1'}}
}

##### embed-multilingual-light-v3.0 #####

cohere.Tokens {
	tokens: [6, 81766, 154, 171391, 342, 237785, 12547]
	token_strings: ['', '明日', 'の', '天気', 'は', '晴れ', 'でしょう']
	meta: {'api_version': {'version': '1'}}
}

##### embed-multilingual-v2.0 #####

cohere.Tokens {
	tokens: [6664, 6640, 3714, 5262, 7224, 3715, 6708, 320916, 245911]
	token_strings: ['明', '日', 'の', '天', '気', 'は', '晴', 'れで', '##しょう']
	meta: {'api_version': {'version': '1'}}
}

embed-multilingualの新しいバージョンならちゃんと日本語でもトークン化してくれる様子。実際には余計な品詞を取り除きたかったりもすると思うので、これだけでは足りないと思うけども。

kun432

/detokenize (Co.Detokenize)

byte-pair encodingのトークンから元のテキスト表現を生成する。

first_response = co.tokenize(
    text="明日の天気は晴れでしょう、そして気温も高いでしょう",
    model="command"
)

response = co.detokenize(
  tokens=first_response.tokens,
  model="command"
)
print(response)

明日の天気は晴れでしょう、そして気温も高いでしょう

マルチリンガルのEmbedモデルの新しいバージョンを使っても同じ。

first_response = co.tokenize(
    text="明日の天気は晴れでしょう、そして気温も高いでしょう",
    model="embed-multilingual-v3.0"
)

response = co.detokenize(
  tokens=first_response.tokens,
  model="embed-multilingual-v3.0"
)
print(response)

明日の天気は晴れでしょう、そして気温も高いでしょう

kun432

/detect-language (Co.Detect_Language)

テキストの言語を判定する。

response = co.detect_language(
  texts=['Hello world', "'Здравствуй, Мир'", "こんにちは、世界", "안녕하세요", "世界你好"]
)
print(response)

cohere.DetectLanguageResponse {
    results: [
        Language<language_code: "en", language_name: "English">,
        Language<language_code: "ru", language_name: "Russian">,
        Language<language_code: "ja", language_name: "Japanese">,
        Language<language_code: "ko", language_name: "Korean">,
        Language<language_code: "zh", language_name: "Chinese">
    ]
     meta: {'api_version': {'version': '1'}}
}

kun432

本題のEmbedに移るのだけど、tokenizeでマルチリンガルのEmbed結構強力な気がしてきた。ちょっと期待。

kun432

/embed (Co.Embed)

response = co.embed(
  texts=['hello', 'goodbye'],
  model='embed-english-v3.0',
  input_type='classification'
)
print(response)

cohere.Embeddings {
    embeddings: [
        [
            0.016296387,
            -0.008354187,
            -0.04699707, 
            ...
            0.047332764,
            0.0023212433,
            0.0052719116
        ]
    ]
    compressed_embeddings: []
    meta: {'api_version': {'version': '1'}}
}

input_typeは4種類。なるほど、使い分けが必要になるのか。

search_document: 検索用のベクターデータベースに格納する埋め込み用のドキュメントをエンコードする場合に使用する。
search_query: 関連するドキュメントを検索するためにベクトルDBにクエリを発行する際に使用する。
classification: 埋め込みデータをテキスト分類器の入力として使うときに使う。
clustering: 埋め込みデータをクラスタリングする。

とりあえずマルチリンガルを使ってみる。

response = co.embed(
  texts=['こんにちは', '世界'],
  model='embed-multilingual-v3.0',
  input_type='classification'
)
print(response)

cohere.Embeddings {
    embeddings: [
        [
            0.0072784424,
            0.048553467,
            -0.0056419373,
            0.062683105,
            -0.03982544,
            0.0051574707
        ]
    ]
    compressed_embeddings: []
    meta: {'api_version': {'version': '1'}}
}

元のテキストについては

1コールあたりの最大テキスト数は96
最適な品質を得るために、各テキストの長さを512トークン以下が推奨

とある。ってことは結局512ってことなのか。

あと各embedモデルの次元数は以下。

embed-english-v3.0: 1024
embed-multilingual-v3.0: 1024
embed-english-light-v3.0: 384
embed-multilingual-light-v3.0: 384
embed-english-v2.0: 4096
embed-english-light-v2.0: 1024
embed-multilingual-v2.0: 768

ふと思ったのだけども、

~~tokenizeしてから渡せばいいのでは？~~
~~その場合日本語でちゃんと分かち書きされてるとよりembeddingが有効になるのでは？~~

text = "明日の天気は晴れでしょう、そして気温も高いでしょう"

tokenize_response = co.tokenize(
    text=text,
    model="embed-multilingual-light-v3.0",
)

tokenized_embedding_response = co.embed(
  texts=tokenize_response.token_strings,
  model='embed-multilingual-v3.0',
  input_type='classification'
)

simplefilter_embedding_response = co.embed(
  texts=[text],
  model='embed-multilingual-v3.0',
  input_type='classification'
)

print(len(tokenized_embedding_response.embeddings))
print(len(simplefilter_embedding_response.embeddings))

13
1

~~なるほど、これはあまり意味がないことがわかった。~~

ここについては下を参照。ちゃんと分かち書きとした上でembedされると思って良さそう。

それはともかく、種類も4つあるし、これはちょっと何かしら評価してみないと判断できないな。

kun432

評価してくれてる記事を見つけた

一方、扱える入力長は512までなので、text-embedding-ada-002の8191と比べると小さな値です。

ここ最初にドキュメントを見たときに自分は間違って解釈してて、単に配列で渡せば配列で返してくれるっていうだけ（バッチ実行のイメージ）で、入力できる文字列としての最大長は結局512ということなのね。

あと、

CohereのEmved v3とOpenAIのtext-embedding-ada-002では扱える入力長に差がありますが、同じテキストに対するトークン数を調べてみましょう。以下のテキストを入力として、OpenAIのTokenizerでトークン数を調べると127トークンになりました。一方、Cohereのembed-multilingual-v3.0のトークナイザーでトークン数を調べると63トークンという結果になりました。今回のケースではCohereのモデルは半分程度のトークン数で済んでいます。
東京都と周辺7県で首都圏を構成している。特に東京圏（東京都・神奈川県・千葉県・埼玉県）の総人口は約3500万人に達し、日本の人口の約30%が集中している。東京都市圏としては世界最大級の人口を有する国際的大都市である。

import tiktoken

text = "東京都と周辺7県で首都圏を構成している。特に東京圏（東京都・神奈川県・千葉県・埼玉県）の総人口は約3500万人に達し、日本の人口の約30%が集中している。東京都市圏としては世界最大級の人口を有する国際的大都市である。"

encoding = tiktoken.get_encoding("cl100k_base")
print(len(encoding.encode(text)))

response = co.tokenize(
    text=text,
    model="embed-multilingual-v3.0"
)
print(response.length)

127
63

ここはちゃんと日本語の分かち書きが出来ているからだよね。

print(response.token_strings)

['', '東京都', 'と', '周辺', '7', '県', 'で', '首都', '圏', 'を', '構成', 'している', '。', '特に', '東京', '圏', '(', '東京都', '・', '神奈川県', '・', '千葉県', '・', '埼玉県', ')', 'の', '総', '人口', 'は', '約', '3', '500', '万人', 'に', '達', 'し', '、', '日本の', '人口', 'の', '約', '30%', 'が', '集中', 'している', '。', '東京都', '市', '圏', 'としては', '世界', '最大', '級', 'の', '人口', 'を', '有', 'する', '国際', '的大', '都市', 'である', '。']

トークン分割具合によっては、512トークン上限というのはそれほどインパクトないのかもしれない。それでもOpenAIの8192はだいぶ大きいのだけど。

kun432

/embed (Co.Embed)

Embedモデルを使った分類。

from cohere.responses.classify import Example

examples=[
  Example("見たらすぐにお返事ください", "Spam"),
  Example("プレゼントの受け取り場所はコチラ", "Spam"),
  Example("1億円に当選しました！", "Spam"),
  Example("あなたのアカウントは停止されました!", "Spam"),
  Example("Amazonプライム会費のお支払い方法に問題があります。", "Spam"),
  Example("キャリアを拝見して連絡をしております！", "Spam"),
  Example("Amazonを装った不審なメール等にご注意ください", "Not spam"),
  Example("Apple からの領収書です", "Not spam"),
  Example("配達完了:ご注文商品の配達が完了しました。'", "Not spam"),
  Example("定期購入はキャンセルされました。別の商品をおすすめします", "Not spam"),
]

inputs=[
  "明日の会議について",
  "Re: ご注文にういてのお問い合わせ",
]

response = co.classify(
  inputs=inputs,
  examples=examples,
  model="embed-multilingual-v3.0"
)
print(response)

[
    Classification<prediction: "Spam", confidence: 0.6115304, labels: {'Not spam': LabelPrediction(confidence=0.3884696), 'Spam': LabelPrediction(confidence=0.6115304)}>,
    Classification<prediction: "Spam", confidence: 0.5856705, labels: {'Not spam': LabelPrediction(confidence=0.4143295), 'Spam': LabelPrediction(confidence=0.5856705)}>]
]

まあ両方ともSpamじゃないのだけどね。ちょっとデータが少なすぎるか。

モデルはドキュメントによるとv2.0のモデルだけしか書いてないけど、v3.0でも問題なさそう。

Exampleってたぶんfew shot的な意味合いなんだろうけど、こういうのはモデル自体の学習に依存しそうな気もする。

kun432

/rerank (Co.Rerank)

前フリが長かったけども、今回一番興味があるのはこれ。クエリとドキュメントを比較して再ランキングする。RAGのベクトル検索の精度向上で使えないかなということで。

ただこれはまだv3.0のモデルはなさそう。。。。

docs = [
    "カーソンシティはアメリカのネバダ州の州都である。",
    "北マリアナ諸島連邦は太平洋に浮かぶ島々である。首都はサイパンである。",
    "ワシントンD.C.（単にワシントンまたはD.C.とも呼ばれ、正式にはコロンビア特別区）は、アメリカ合衆国の首都である。連邦区である。",
    "死刑制度はアメリカ合衆国が国である以前から存在する。2017年現在、50州のうち30州で死刑が合法である。"
]

response = co.rerank(
  model = "rerank-multilingual-v2.0",
  query = 'アメリカ合衆国の首都は？',
  documents = docs,
  top_n = 3,
)
print(response)

[
    RerankResult<document['text']: ワシントンD.C.（単にワシントンまたはD.C.とも呼ばれ、正式にはコロンビア特別区）は、アメリカ合衆国の首都である。連邦区である。, index: 2, relevance_score: 0.99992096>,
    RerankResult<document['text']: 北マリアナ諸島連邦は太平洋に浮かぶ島々である。首都はサイパンである。, index: 1, relevance_score: 0.170508>,
    RerankResult<document['text']: カーソンシティはアメリカのネバダ州の州都である。, index: 0, relevance_score: 0.13753247>
]

渡せるドキュメントについては

A list of document objects or strings to rerank.
If a document is provided the text fields is required and all other fields will be preserved in the response.
The total max chunks (length of documents * max_chunks_per_doc) must be less than 10000.

再ランク付けするドキュメントオブジェクトまたは文字列のリスト。
ドキュメントが提供された場合、テキストフィールドは必須であり、その他のフィールドはすべてレスポンスに保存される。
最大チャンクの合計（ドキュメントの長さ * max_chunks_per_doc）は10000未満でなければならない。

とあるけど、max_chunks_per_docはオプションで指定もできるらしいのだけど、これどういう動きをするのだろうか。あと、デフォルト値はどこなのかと探したら以下にあった。

By default, the endpoint will error if the user tries to pass more than 1000 documents at a time because max_chunks_per_doc has a default of 10. The way we calculate the maximum number of documents that can be passed to the endpoint is with this inequality: Number of documents * max_chunks_per_doc >10,000. If Number of documents * max_chunks_per_doc exceeds 10,000, the endpoint will return an error.

これみてもちょっと意味がわからないな。。。Rerankのドキュメントをもう少し読むこととする。

あと、ここに

return_documents（返却される配列にdocumentを含めるか否か）というパラメータも試したのですが、こちらはうまく動作しませんでした。

とあるけど、これコード見てもそんな属性はないように見える。別のところでv3.0は書いてないけど使えたりもしたので、もしかするとドキュメント更新が追いついてないのかも？

と思ったらここにあった。

In the SDK, documents are always returned, and return_documents is not a valid parameter.

ちなみにPythonクライアントのReadTheDocはとても残念な感じ。。。

kun432

Rerankについて

ここはもう少しドキュメントを読んでみることにする。

Re-rankingは伝統的な意味検索の2段階プロセスを改善する機能です。まず、初期の検索機構がドキュメントの集合を大まかにスキャンし、ドキュメントリストを作成します。次に、Re-ranker機構がこの候補ドキュメントリストを取り、要素を再ランク付けします。

CohereのSDKを使用し、co.rerank()モデルにドキュメントを供給し、クエリを行います。必要なパラメータにはドキュメントリスト、検索クエリ、モデルが含まれ、オプショナルなパラメータとして、返されるドキュメント数や最大チャンク数などがあります。

また、Cohereは多言語モデルrerank-multilingual-v2.0を提供しており、これは英語、中国語、フランス語、ドイツ語、インドネシア語、イタリア語、ポルトガル語、ロシア語、スペイン語、アラビア語、オランダ語、ヒンディー語、日本語、ベトナム語などを含む複数の言語で訓練されています。

co.rerank()のパラメータについて

CohereのRe-ranking機能における主要なパラメータは以下の通りです：

ドキュメント (documents): JSONまたは文字列オブジェクトのリストが必要です。再ランク付けするドキュメントのオブジェクトリストを想定しており、通常はtextキーを持っています。

クエリ (query): ドキュメントをランク付けするための検索クエリの文字列が必要です。

モデル (model): 2つのモデルがあります。英語モデルrerank-english-v2.0と多言語モデルrerank-multilingual-v2.0です。

トップN (top_n): オプショナルで、返されるドキュメント数を指定します。デフォルトは提供されたドキュメントリストの長さです。

最大チャンク数 (max_chunks_per_doc): オプショナルで、ドキュメントが512トークンを超える場合、ドキュメントを分割する最大チャンク数を決定します。例えば、ドキュメントが6000トークンの場合、デフォルトの10を使用すると、ドキュメントは512トークンの10個のチャンクに分割され、最後の880トークンは無視されます。

ドキュメントの返却 (return_documents): APIでのみ使用可能なオプショナルなパラメータで、デフォルトはfalseです。trueに設定すると、関連テキストと共にドキュメントが返されます。falseの場合、ドキュメントは返されず、レスポンスオブジェクトには{index, relevance_score}が含まれます。ここでindexはドキュメントの順序を示します。

なるほど、max_chunks_per_docのところは、Embedの最大入力トークンが512だからだよね。で、ある程度文書が大きいとそれを超えることは普通にあるけど、まるっと捨てるんじゃなくて、トークン分割した上で　なるだけ受け取ってくれて、分割数の上限を超えたものだけが捨てられるということか。ということは1文書あたり5120トークンまでは受けれるけども、前述の

最大チャンクの合計（ドキュメントの長さ * max_chunks_per_doc）は10000未満でなければならない。

があるので、この辺との兼ね合いが必要と。

あと、return_documentsはSDKだと意味がないわけね。

In the SDK, documents are always returned, and return_documents is not a valid parameter.

kun432

Rerankのベストプラクティス

CohereのRe-rankingベストプラクティスに関するページの内容は以下の通りです：

1. 最適化パフォーマンス:

制約の最小値と最大値：

ドキュメント数：1〜1000

ドキュメントあたりのトークン数：1〜N/A

クエリあたりのトークン数：1〜256

2. ドキュメントチャンキング:

Cohereはドキュメントを512トークンのチャンクに分割します。例えば、クエリが50トークンでドキュメントが1024トークンの場合、ドキュメントは以下のチャンクに分割されます：

relevance_score_1 = <query[0,50], document[0,460]>

relevance_score_2 = <query[0,50], document[460,920]>

relevance_score_3 = <query[0,50],document[920,1024]>

チャンキングにより制御を行いたい場合は、自分でドキュメントをチャンクに分割することを推奨します。

3. 最大ドキュメント数:

デフォルトでは、エンドポイントはユーザーが一度に1000以上のドキュメントを渡そうとするとエラーを返します。これはmax_chunks_per_docのデフォルト値が10であるためです。最大ドキュメント数の計算方法は次の不等式で表されます：Number of documents * max_chunks_per_doc >10,000。

4. クエリ:

モデルは510トークンのコンテキスト長で訓練されています。クエリが256トークンを超える場合、最初の256トークンに切り捨てられます。

5. 結果の解釈:

co.rerank()からの最も重要な出力は、レスポンスオブジェクトに表示される絶対ランクです。スコアはクエリに依存し、送信されたクエリとパッセージによって高くまたは低くなる可能性があります。関連スコアは0から1の範囲で正規化され、1に近いスコアは高い関連性を、0に近いスコアは低い関連性を示します。関連ドキュメントをフィルタリングするためのスコアのしきい値を決定するには、以下のプロセスを推奨します：

30〜50の代表的なクエリを選択し、各クエリに対して特定のユースケースに対して境界線上にあると考えられるドキュメントを提供し、(クエリ, ドキュメント)ペアのリストを作成します。

このリストをループでrerankエンドポイントに渡し、関連スコアを収集します。

sample_scoresの平均値を使用して、関連性のないドキュメントをフィルタリングする際の基準とします。

ここのsample_scoresってのはそういうプロパティとかがあるのではなくて、relevance_scoreの集合ッていう意味ね。

なるほど、OpenAIのでかいembeddingに慣れてるとあんまり気にしてこなかったけど、トークン数はクエリもドキュメントもちょっとケアが必要な感じだな。

kun432

とりあえずのまとめ

試してみた限り、テキスト生成系は日本語においてはちょっと厳しい。

ただ、Embed/Rerankのマルチリンガルモデルは日本語での使用において非常に魅力を感じる。Tokenizeの動きを見る限りは日本語の分かち書きというところで一定の考慮が入っているようだし、特にRerankについては、Sentence TransformerだとGPUが必要になってくるのがAPI経由でさらっとできるのは非常に強力。

現状Rerankで提供されているモデルはv2.0だけのようで、他のモデルを見てもv2.0よりもv3.0のほうが明らかに良いようにみえるので、マルチリンガルv3.0なRerankモデルで入力トークン長がもっと増えるといいなぁというところ。

あと、Embedのカスタムモデルが作れるというのも、ユースケースによっては役に立つ場合があるかもしれない。

kun432

ドキュメントについては、更新が追いついてなさそう、っていうところと、ちょっと分散してて追いにくい感があるので、物足りなさを感じる。

kun432

とりあえず一旦シンプルなRAGのサンプルで、Cohere Embed/Rerankを作って試してみようと思う。

kun432

トライアルキーの制限

60 APIコール/分
500 APIコール/月

kun432

このスクラップは2023/11/14にクローズされました

Cohereを試す

OpenAI

Anthropic

Cohere

1. モデルの機能と性能

2. 統合とアクセスの容易さ

3. コストと利用条件

4. カスタマイズと拡張性

準備

/generate (Co.Generate)

/chat (Co.Chat)

/summarize (Co.Summrize)

/tokenize（Co.Tokenize）

/detokenize (Co.Detokenize)

/detect-language (Co.Detect_Language)

/embed (Co.Embed)

/embed (Co.Embed)

/rerank (Co.Rerank)

Rerankについて

Rerankのベストプラクティス

1. 最適化パフォーマンス:

2. ドキュメントチャンキング:

3. 最大ドキュメント数:

4. クエリ:

5. 結果の解釈:

とりあえずのまとめ