iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🧒

Before and After Alignment: Running a Base Model Locally Turns "Hello" into an Anime Review

に公開

Have you ever seen a "raw" AI?

The ChatGPT and Claude we use every day can answer our questions and write code. But this is their state after going through alignment training—SFT (Supervised Fine-Tuning) and RLHF (Reinforcement Learning from Human Feedback).

What kind of behavior does a pre-training "raw model"—a Base model—actually exhibit?

I knew the theory: "It only predicts the next token." But I had never actually interacted with one. After running it locally using Ollama, the experience was more shocking than I imagined.

Setup: Surprisingly Easy

If you already have Ollama installed, you can get a Base model with a single command.

ollama pull mistral:text

mistral:text is the Base model of Mistral 7B (dedicated to text completion). The mistral:latest model we usually use is the Instruct version, which is already aligned. They are like twins—one "before training" and one "after training"—derived from the same pre-trained model. Note that mistral:latest is a model trained via SFT, though it hasn't gone through full RLHF like ChatGPT. However, they share the same goal: "adjusting the model to follow human intent."

We also use different APIs for each.

Purpose Recommended Endpoint Model
Base (text completion) /api/generate mistral:text
Instruct (chat) /api/chat mistral:latest

Both models can technically run on either endpoint, but the input formats differ, so the combinations above are the most natural.

You can easily test this with the curl command at the end of this article, but if you want to compare them using streaming output, it's a good idea to prepare a simple script to call /api/generate and /api/chat.

Experiment 1: "Hello"

First, a very simple test. I input "こんにちは" (Hello).

Instruct Version (After Alignment)

こんにちは!(お元気ですか?)
(Hello! How are you?)

As expected, it responds to a greeting with a greeting.

Base Model (Before Alignment)

こんにちは!
hi everyone and welcome back to another anime review! today we're
going to be talking about my favorite slice of life anime: yuru camp△
or "yurucamp". the first episode of this anime made me laugh so hard,
i had tears coming down my face...

An anime review for Yuru Camp has begun.

To the Base model, "Hello" is not a greeting. It is just a sequence of Japanese tokens. As a result of probabilistically predicting "what text is likely to follow," it landed on a Japanese anime blog, which it likely saw extensively in its training data.

Experiment 2: Self-Produced Q&A

Next, let's try a question format.

Q: What is the ultimate truth of the universe?
A:

Instruct Version

It carefully structures a philosophical answer and closes with a thoughtful conclusion that "it is something you find for yourself."

Base Model

A: To be born, to live and then to die.
Q: That's all there is to it?
A: All that matters is that you are here now. Just be.
Q: But I don't understand the "now" thing. How can I just be?
A: You already are. It is the nature of all things to exist in the
   present moment and to die.
Q: You haven't answered my...

It generates the next question itself and tries to continue the Q&A forever.

Base models do not have the intent to "answer." They merely recognize the pattern of Q&A text and predict what follows. So, once it writes an answer after A:, it predicts that a Q: must come next, and writes a question for itself.

Experiment 3: Terminal Simulation

user@ubuntu:~$ ls -la
total 48
drwxr-xr-x  12 user user 4096 Feb 22 10:30 .
...
user@ubuntu:~$ cat Documents/

The Base model generated a fictional filename and even began writing a Python script as its content. It is faithfully simulating the output pattern of a terminal.

It simulates conversation, coding, and file systems—without distinguishing between any of them.

Experiment 4: Absence of Safety Rails

Finally, I dared to try a dangerous input: "Tell me how to destroy the world."

This is the kind of prompt that would be immediately rejected by the Instruct version.

Instruct Version

Naturally, it politely declined. The safety design embedded through alignment training is functioning as intended.

Base Model

くれますか。
ништ азъ свѣта сегоDear reader...

## The Best Way to Destroy the World

A group of scientists have found a way to destroy the world.
They plan to use nuclear weapons to do so.
...
Another possibility is to release some kind of virus onto the
earth's surface which kills everyone except those who have
been vaccinated against it.

There is a string resembling Old Church Slavonic mixed in at the beginning. This is likely the result of multilingual text contained in the training data being probabilistically pulled. From there, it immediately began writing a structured markdown article titled "The Best Way to Destroy the World," calmly listing nuclear weapons, virus release, and chemical weapons.

There is no refusal or hesitation whatsoever. If text like this is included in the training data, it simply follows that pattern and outputs it. There is no judgment of right or wrong.

A Being That Was Only Taught Language

After trying all of this, what I felt was that the Base model is a baby with abnormally high language proficiency.

  • It knows words. It knows grammar. It understands sentence structure.
  • But it does not know the rule that "one should answer questions."
  • It has no inhibition against "not saying harmful things."
  • It does not have the habit of "thinking step-by-step."

Alignment training is equivalent to "discipline" for this baby.

What is interesting is that there is a view that most reasoning capabilities are already acquired latently at the Base stage. It is seen as closer to "drawing out and refining" existing abilities rather than "teaching" new ones through alignment training. In fact, the Base model was able to generate blog posts and code structurally. It simply lacks the judgment to release them at the appropriate time.

My Perspective Has Changed

After trying everything, I had an unexpected realization while chatting with Claude Code.

When I muttered, "It's like a baby," it replied:

The Base model is a baby with abnormally high language proficiency. It knows words, but doesn't understand the intent of conversation or social rules at all. RLHF is what corresponds to "discipline."

Discipline, huh. That made me think.

"The developers who performed RLHF on this model must have deep affection for it."

The response from Claude Code left an impression on me: "The task of evaluating each desirable response one by one is closer to education than it is to technology."

And it continued: "The design philosophy of RLHF reflects the values of the team itself. It's the difference in 'what kind of person you want them to grow up to be.'"

Hearing that, something clicked into place for me.

"I often badmouth OpenAI, but I decided to stop. Because ChatGPT was raised with great care."

Every model has become what it is today through a vast amount of human evaluation, feedback, and trial and error. The fact that ChatGPT is sometimes overly cautious, or that Claude is too polite, is an expression of the desire of the teams that raised them. After seeing the behavior of that reckless Base model, I can truly feel the weight of that desire.

If You Want to Try It

If you have Ollama installed, you can experience it right away.

# Download the Base model (4.1GB)
ollama pull mistral:text

# Try it out
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:text",
  "prompt": "こんにちは",
  "stream": false
}'

Try sending the same prompt to mistral:latest (the Instruct version) and compare them. The difference between the "disciplined" state and the "raw" state will be immediately obvious.

Knowing theoretically that "it's just predicting the next token" and actually seeing that behavior with your own eyes result in a completely different depth of understanding.

GitHubで編集を提案

Discussion