🤗

Huggingface Transformersで可能な限り多くの事前学習済み画像キャプション生成機械学習モデルを試して考察せよ🤗。実行:

2023/03/26に公開

はい、承知しました🤗。Huggingface Transformersで可能な限り多くの事前学習済みの画像キャプション生成機械学習モデルを試して、考察いたします🤗。

以下に、私が検証した事前学習済みモデルを列挙します🤗。

  • ViT GPT-2 image captioning
    • nlpconnect/vit-gpt2-image-captioning
  • GIT base textcaps
    • microsoft/git-base-textcaps
  • GIT large textcaps
    • microsoft/git-large-r-textcaps
  • BLIP image captioning base
    • Salesforce/blip-image-captioning-base
  • BLIP image captioning large
    • Salesforce/blip-image-captioning-large

考察

以下に、私の考察を述べます🤗。

  • 以下のモデルに関しては、私のマシンスペックが足りないため、動作させられなかった🤗
    • BLIP-2
  • また、過去にSOTA (State-Of-The-Art)を達成したと言われている以下の機械学習モデルは、Hugging Faceで簡単に試せなかった🤗
    • ClipCap
    • OFA
    • これらの機械学習モデルに関してもいずれ検証したい🤗
  • それぞれの事前学習済みモデルで、生成される画像キャプションは大きく変わる様子がわかった🤗
  • GITは画像の中の文字まで読み取ろうとしているが、日本語の場合、完全に間違っている🤗
  • 試した中ではSalesforce/blip-image-captioning-largeが最高性能のように思われる🤗
    • ただし、稀に完全に狂った出力をする様子である🤗
    • 一切のファインチューニング無しでこれだけの精度が出ているのはすごい🤗
  • 画像キャプション生成用機械学習モデルのさらなる発展に期待する🤗

結果

各事前学習済みモデルからは、以下のような画像キャプションが生成されました🤗。

----- -----
image url:  https://gyazo.com/aa448d6e6a3c30d7d2837ece5c010990/raw
ViT GPT-2:  a computer screen with a picture of a person 
GIT base:  a white piece of paper with the words " q " on it.
GIT large:  a white piece of paper with the words " q " on it.
BLIP base:  a screenshote of a person with a text that reads,'i'm'and the words that are in japanese
BLIP large:  a close up of a cell phone screen showing a text message from a person on the phone phone
----- -----
image url:  https://gyazo.com/1dfd35d390540fa36c4eaa05f7a72123/raw
ViT GPT-2:  a collage of a tv show with a person in it 
GIT base:  a person is texting on a phone with chinese characters on the screen.
GIT large:  a person is texting on a phone with chinese characters on the screen.
BLIP base:  a screenshote of a man in a white shirt and a woman in a white shirt and a man
BLIP large:  a screenshote of a twitter account with a twitter account and a twitter account on twitter, with a
----- -----
image url:  https://gyazo.com/3315918464c8229d26109bd8a8658322/raw
ViT GPT-2:  a man taking a picture of himself in a mirror 
GIT base:  a person wearing a shirt that says " verizo " on it.
GIT large:  a person wearing a shirt that says " verizo " on it.
BLIP base:  a man taking a self camera in a mirror mirror image taken from the front of a mirror in a mirror
BLIP large:  a woman in a white shirt is taking a selfie with her phone phone and taking a selfie
----- -----
image url:  https://gyazo.com/76a6568aa5e5216a83ba3956fe2a4831/raw
ViT GPT-2:  a car is parked in front of a tall building 
GIT base:  a black van is driving down the street in front of a building with a billboard on it.
GIT large:  a black van is driving down the street in front of a building with a billboard on it.
BLIP base:  a large building with a large sign that says it's the best place to be in the city
BLIP large:  cars are parked on the street in front of a large building with a large advertisement on the side
----- -----
image url:  https://gyazo.com/476a9f87aa0d0086c286ffa410b57ba4/raw
ViT GPT-2:  a street with a bunch of traffic on it 
GIT base:  a billboard for the movie theater is on the corner of a street.
GIT large:  a billboard for the movie theater is on the corner of a street.
BLIP base:  a street scene of a shopping mall with a lot of people walking around it, and a large billboard sign
BLIP large:  cars are parked on the street in a city with many people walking around it and cars driving down the street
----- -----
image url:  https://gyazo.com/d62fc47fef04814c32c336039e8a6c36/raw
ViT GPT-2:  a person holding a cell phone in a kitchen 
GIT base:  a woman taking a picture of herself with a scarf on her head.
GIT large:  a woman taking a picture of herself with a scarf on her head.
BLIP base:  a woman in a black shirt is taking a self - in the kitchen sink and taking a self - in the kitchen
BLIP large:  woman in black dress with black hair and a black hat taking a selfie with her phone phone
----- -----
image url:  https://gyazo.com/f3338211047e3e9a31e15b83279dc316/raw
ViT GPT-2:  a woman in a dress is sitting on a chair 
GIT base:  a woman is sitting in a chair with a cloth tied around her head.
GIT large:  a woman is sitting in a chair with a cloth tied around her head.
BLIP base:  a woman with long hair in a pink dress and a white shirt with a white shirt and a black hair
BLIP large:  a woman with long hair and a ponytail style ponytail style with a ponytail style with a ponytail style
----- -----
image url:  https://gyazo.com/6323d5b189100056a813d40152312a2c/raw
ViT GPT-2:  a man in a suit is holding a sign 
GIT base:  a poster for a man with a shirt that says " i'm a. "
GIT large:  a poster for a man with a shirt that says " i'm a. "
BLIP base:  a poster for a japanese movie poster for the film,'the karate kid's room, with a
BLIP large:  a poster of a man in a blue jacket and jeans jacket with a white shirt and jeans jacket

方法

なお、実行したスクリプトは以下のようなものです🤗。

app.py
import sys
import requests
from PIL import Image
from transformers import VisionEncoderDecoderModel
from transformers import AutoProcessor, AutoModelForCausalLM, AutoTokenizer
from transformers import ViTImageProcessor
from transformers import BlipProcessor, BlipForConditionalGeneration
#from transformers import OFATokenizer, OFAModel
import torch

if len(sys.argv) == 1:
  print("URL of Image is missing!")
  print("python3 app.py [image_url]")
  sys.exit(1)

device = "cuda" if torch.cuda.is_available() else "cpu"

print("----- -----")
image_url = sys.argv[1]
print("image url: ", image_url)
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

#
# ViT GPT-2
#
vit_gpt2_model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
vit_gpt2_model.to(device)
vit_gpt2_feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
vit_gpt2_tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
vit_gpt2_inputs = vit_gpt2_feature_extractor(images=raw_image, return_tensors="pt").to(device)
vit_gpt2_generated_ids = vit_gpt2_model.generate(**vit_gpt2_inputs, max_length=50)
git_large_generated_caption = vit_gpt2_tokenizer.batch_decode(vit_gpt2_generated_ids, skip_special_tokens=True)[0]
print("ViT GPT-2: ", git_large_generated_caption)

#
# GIT
#
# base textcaps
git_base_processor = AutoProcessor.from_pretrained("microsoft/git-base-textcaps")
git_base_model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textcaps")
git_base_model.to(device)
git_base_inputs = git_base_processor(images=raw_image, return_tensors="pt").to(device)
git_base_generated_ids = git_base_model.generate(**git_base_inputs, max_length=50)
git_base_generated_caption = git_base_processor.batch_decode(git_base_generated_ids, skip_special_tokens=True)[0]
print("GIT base: ", git_base_generated_caption)
# large textcaps
git_large_processor = AutoProcessor.from_pretrained("microsoft/git-large-r-textcaps")
git_large_model = AutoModelForCausalLM.from_pretrained("microsoft/git-large-r-textcaps")
git_large_model.to(device)
git_large_inputs = git_base_processor(images=raw_image, return_tensors="pt").to(device)
git_large_generated_ids = git_base_model.generate(**git_large_inputs, max_length=50)
git_large_generated_caption = git_base_processor.batch_decode(git_large_generated_ids, skip_special_tokens=True)[0]
print("GIT large: ", git_large_generated_caption)

#
# BLIP
#
# image captioning base
blip_base_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_base_caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
blip_base_caption_model.to(device)
blip_base_caption_inputs = blip_base_processor(raw_image, return_tensors="pt").to(device)
blip_base_caption_outputs = blip_base_caption_model.generate(**blip_base_caption_inputs, min_length=20, max_length=50)
blip_base_generated_caption = blip_base_processor.decode(blip_base_caption_outputs[0], skip_special_tokens=True)
print("BLIP base: ", blip_base_generated_caption)
# image captioning large
blip_large_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
blip_large_caption_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
blip_large_caption_model.to(device)
blip_large_caption_inputs = blip_large_processor(raw_image, return_tensors="pt").to(device)
blip_large_caption_outputs = blip_large_caption_model.generate(**blip_large_caption_inputs, min_length=20, max_length=50)
blip_large_generated_caption = blip_large_processor.decode(blip_large_caption_outputs[0], skip_special_tokens=True)
print("BLIP large: ", blip_large_generated_caption)
images.txt
https://gyazo.com/aa448d6e6a3c30d7d2837ece5c010990
https://gyazo.com/1dfd35d390540fa36c4eaa05f7a72123
https://gyazo.com/3315918464c8229d26109bd8a8658322
https://gyazo.com/76a6568aa5e5216a83ba3956fe2a4831
https://gyazo.com/476a9f87aa0d0086c286ffa410b57ba4
https://gyazo.com/d62fc47fef04814c32c336039e8a6c36
https://gyazo.com/f3338211047e3e9a31e15b83279dc316
https://gyazo.com/6323d5b189100056a813d40152312a2c
caption.sh
cat images.txt | xargs -i -d '\n' python app.py '{}'/raw

Discussion