iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐱

How LLMs Built for Text Completion Can Generate and Recognize Images

に公開

Introduction

"LLMs (Large Language Models)" are widely known as models that predict the continuation of text.
However, LLMs have become capable of more than just words; they can now generate images and understand and describe their content.
How is it possible that a model which is only supposed to predict text can do all of that?

In this article, I will explain the mechanism as simply as possible.

What Exactly Do LLMs Do?

The basics of LLMs are as follows:

  • They learn to predict the "next word" from a vast amount of data.
  • They acquire a mechanism to look at the context (surrounding words) to make predictions.
  • As a result, they become able to produce natural text like a human.

At the foundation of this mechanism is a model called the Transformer, which features a mechanism known as "Self-Attention" that focuses on important parts within the text.

Why They Can Now Handle More Than Just Text

Not limited to text, images and audio are also "information."
To handle these, LLMs have been improved in the following ways:

1. Converting images into numerical sequences

While images humans see are collections of pixels, AI cannot handle images as they are.

While humans can intuitively understand images as "drawings" or "photographs," images have no meaning to an AI on their own.
This is because AI only understands "numbers."

Therefore, images are converted into sequences of numbers called vectors or features.

Through this conversion, images can be treated in the same way as words in a sentence.

How do you turn them into numbers?

Images are actually made up of a collection of very small dots.
These dots are called pixels.

  • Each pixel contains "color" information.
  • Colors are often represented by numerical values for the intensity of Red, Green, and Blue (RGB).
  • Examples:
    • Pure red → (255, 0, 0)
    • Black → (0, 0, 0)
    • White → (255, 255, 255)

Looking at the entire image,
Many pixels × their respective values
= A very long list of numbers.

This is what "converting an image into a numerical sequence" means.

Conversion from pixels to numerical sequences (vectors)

Suppose we have a small image of 2 pixels wide × 2 pixels high.

[ Red ] [ Blue ]
[ White ] [ Black ]

Representing this with RGB values looks like this:

Red → (255, 0, 0)
Blue → (0, 0, 255)
White → (255, 255, 255)
Black → (0, 0, 0)

Arranging these in order:

[255, 0, 0, 0, 0, 255, 255, 255, 255, 0, 0, 0]

This large number of digits lined up in a single row is the vector (numerical sequence) that AI handles.

Why is it called a "Vector"?

In the world of mathematics and machine learning, a sequence of numbers arranged in order is called a "vector."

  • Images → Vectors of thousands to millions of numbers
  • Sentences → Vectors where words are converted into numerical values

In this way, different types of data can be handled in the same format: a "sequence of numbers."

Feature Extraction (An additional stage of processing)

In actual AI, rather than just lining up pixels, important features such as:

  • Outlines
  • Shapes
  • Patterns
  • Color distribution

are extracted and converted into more meaningful, shorter vectors.

The models used for this include:

  • CNN (Convolutional Neural Network)
  • Vision Transformer (ViT)

Why convert to vectors (features)?

By converting to numerical values, AI becomes capable of the following:

  • Comparing how "similar" images are to each other
  • Judging that "this image looks like a cat" or "this is close to a dog"
  • Comparing images and text on the same playing field (numbers)

In short, by turning images into numbers, AI can perform calculations and comparisons.

Thinking with an Analogy

  • Human:
    • Can intuitively judge, "This photo is a cat."
  • AI:
    • "This sequence of numbers is close to 'cat numbers' I've seen in the past."
    • → Judges, "There is a high probability that this is a cat."

The sequence of numbers serving as this "basis for judgment" is called a vector or feature.

In this way, by converting images into numerical sequences, AI puts images into a "form it can understand" for learning and inference.

2. Learning Images and Text Together

When trained on data that pairs images with their descriptions, the model learns the correspondence between images and text.

This method of learning multiple types of information together is called
multimodal learning.

Multimodal Learning

Specifically, for example, a large amount of data like the following is prepared:

  • Photo of a cat

  • The sentence "A cat curled up and sleeping on a sofa"

In the model:

Images are converted into numerical values by an "image encoder"

Text is converted into numerical values by a "text encoder"

Then the model learns that

  • This sequence of numerical values for the image

  • This sequence of numerical values for the sentence

are semantically close.

The important point is that it is not being taught the word "cat" directly.
It is simply remembering the statistical relationship between numerical values
where image features and text features frequently appear together.

What kind of learning is this in human terms?

This is very easy to understand when compared to human learning.

For example, with a small child:

  • You point at a cat and say "It's a cat" over and over.

  • You show them a dog and say "It's a dog."

By repeating this, the child gradually learns the correspondence: "this appearance = this word."

AI is much the same:

  • When this appearance is present, this word usually appears with it.

  • When this word appears, this appearance usually corresponds to it.

It learns these relationships little by little from a vast amount of data.

Why this enables image understanding

By repeating this learning, within the model:

  • Similar images are positioned close to each other.

  • Sentences with similar meanings are also positioned close to each other.

As a result:

  • When shown an image, it can output a sentence with a similar meaning.

  • When given a sentence, it can imagine an image with similar features.

This becomes possible.

This is the true identity of why it appears that "LLMs, which are supposed to only be completing text, can understand and generate images."

How image generation (Text-to-Image) is possible

The mechanism for creating an "image" from "text" is currently dominated by a technology called Diffusion Models. This is achieved by combining the "language understanding" of LLMs with the "drawing power" of image generation AI.

1. Carving an image out of a "sandstorm"

Image generation AI doesn't draw with a brush on a blank canvas; instead, it uses a mysterious method of completing a picture by removing noise from "noise" (a grainy image like a sandstorm).

  1. Initially, a complete sandstorm (random noise) is prepared.
  2. The AI is made to believe "There should be a picture of a cat here," and it gradually removes the noise.
  3. A cat's outline slowly emerges, eventually becoming a clear image.

2. LLM as the "Director"

So, how is it decided "what kind of picture to make"? This is where the LLM (or text encoder) comes in.

  • The role of the LLM (Director): It understands the meaning of the sentence (prompt) entered by the user and converts it into a "numerical instruction sheet" that the AI can understand.
  • The role of the image generation model (Painter): Following that instruction sheet, it makes the shape emerge from the sandstorm exactly as instructed.

In short, it is precisely because there is an LLM that can deeply understand the meaning of text that accurate instructions can be given to the painter.

Why a text-only mechanism can also handle images

Both text and images are ultimately treated as sequences of numbers.
LLMs are models that learn the patterns and relationships of those numbers.

Therefore, the power of pattern understanding cultivated through text can also be applied to images.

Summary

  • LLMs are trained by predicting the continuation of text.
  • Images are converted into numerical values and treated the same way as text.
  • Image generation uses a mechanism that makes a picture emerge from "noise" (Diffusion Models).
  • LLMs understand the "meaning of words" and give precise instructions to the generation model.

The reason LLMs can handle images is that the mechanism itself is very versatile.

Discussion