iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🧠

Implementing Memory for Local LLMs (Ollama) in 200 Lines: Replacing the Slow Mem0 with ChromaDB

に公開

Introduction

ChatGPT and Claude have a "memory" function. It's the feature that remembers preferences or information you shared in a previous conversation for the next one.

Local LLMs (Ollama) lack this. They start from a clean slate every time, so no matter how many times you tell them "My favorite food is sushi," if you ask them later, they will reply, "I don't know your preferences."

In this article, I record my trial and error process in implementing a memory function for local LLMs running on Ollama. I summarize why the first choice, Mem0, was impractical, and how I eventually built an in-house memory system that runs quickly using only ChromaDB and Embeddings.

Final Configuration:

  • Chat model: gemma3:12b (Ollama)
  • Embedding model: nomic-embed-text (Ollama)
  • Vector DB: ChromaDB (Local persistent)
  • OS: Windows / macOS (Compatible with both)

Methodologies for Giving Memory to Local LLMs

There are three main ways to achieve memory.

1. RAG (Retrieval-Augmented Generation) Method

Past conversations are stored in a vector DB. When a new query comes in, relevant memories are retrieved via similarity search and injected into the prompt. This is the most common and practical approach.

User Input → Embedding → Vector DB Search → Retrieve Relevant Memories

                                    Add memories to system prompt

                                          LLM responds based on memories

2. Summary-based Memory

Every time a conversation gets long, the LLM itself is asked to summarize past conversations to keep them compact. LangChain's ConversationSummaryMemory falls into this category.

Because the LLM must be called every time a summary is needed, this approach has high processing costs for local LLMs.

3. Key-Value Store Method

Structured information, such as user profiles or preferences, is stored in JSON/SQLite and inserted into the system prompt each time. It is simple, but requires writing rule-based logic to decide "what to store."

I adopted the RAG method. The reason is that it does not require automatic determination of what to store, and since retrieval precision is left to the Embedding model, it can be implemented simply even in a local environment.

The First Choice: Mem0 (and its failure)

Why I chose Mem0

Mem0 (distinct from old MemGPT-related libraries) is a library designed as a memory layer for LLMs. It has native integration with Ollama, and its ease of setup via pip install mem0ai was attractive.

pip install mem0ai
ollama pull nomic-embed-text
from mem0 import Memory

config = {
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "gemma3:12b",
            "temperature": 0.1,
            "ollama_base_url": "http://localhost:11434",
        }
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
            "ollama_base_url": "http://localhost:11434",
            "embedding_dims": 768,
        }
    },
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "mem0_ollama",
            "embedding_model_dims": 768,
        }
    },
    "version": "v1.1",
}

m = Memory.from_config(config)
m.add("I like Python and my hobby is playing guitar", user_id="taro")
memories = m.search("What is my hobby?", user_id="taro")

The code looks simple and beautiful. However, that's where the problems started.

Pitfall 1: Vector Dimension Mismatch

Upon the first startup, Qdrant, used internally by Mem0, created a collection with 1536 dimensions (for OpenAI's Embedding model) by default. Since nomic-embed-text uses 768 dimensions, a dimension mismatch error occurred during search.

ValueError: shapes (0,1536) and (768,) not aligned

To resolve this, I had to explicitly specify embedding_dims: 768 in the embedder config and embedding_model_dims: 768 in the vector_store config, and furthermore delete the old DB data.

# Deletion of old DB is required
rm -rf ~/.mem0 ~/.qdrant  # macOS/Linux

Although Mem0's documentation mentions Ollama integration, it did not sufficiently cover such explicit dimension specifications or the deletion of existing data.

Pitfall 2: Fatal Slowness

After solving the dimension problem and getting it to run, I faced a fatal issue. After the response returned, it took a very long time to return to the next prompt.

The cause lies in Mem0's mem.add(). When Mem0 stores a conversation in memory, it performs the following internally:

  1. Sends the conversation to the LLM to decide "what should be remembered"
  2. Embeds the extracted information
  3. Stores it in the vector DB

In other words, the LLM is called twice per conversation. Once for the chat response, and once for memory extraction. For a model like gemma3:12b, this "hidden LLM call" takes tens of seconds, making it feel unusable.

Even if I execute it in the background using threading, it occupies Ollama's GPU resources, affecting the response speed of the next query. The situation was no different even on a MacBook Pro M2.

Lesson from Mem0

I believe Mem0 is a good choice when combined with fast external APIs like the OpenAI API. However, with local LLMs, the architecture of "calling the LLM every time for memory extraction" becomes a fatal bottleneck.

Building In-House Memory: ChromaDB + Embedding

I discarded Mem0 and decided to implement a lightweight memory system from scratch. The design philosophy is simple:

Do not use an LLM to save memories. Complete the process using only Embeddings.

Architecture

[Saving]
Conversation Pair → Embed user input → Save to ChromaDB
(No LLM required, completes in tens of milliseconds)

[Retrieval]
New Input → Embed → Cosine similarity search in ChromaDB → Retrieve relevant memories

                                                 Inject into system prompt → LLM responds

Comparison with Mem0:

Item Mem0 In-House Implementation
LLM calls / conversation 2 times (response + memory extraction) 1 time (response only)
Memory saving speed Tens of seconds (LLM dependent) Tens of milliseconds (Embedding only)
Dependency packages mem0ai, qdrant-client, etc. chromadb only
Memory quality High (summarized by LLM) Stores raw conversation (practical enough)

Setup

pip install chromadb requests
ollama pull nomic-embed-text
ollama pull gemma3:12b  # Model of your choice

Retrieving Embeddings

Call Ollama's /api/embed endpoint directly. This reduces dependencies by avoiding extra libraries.

import requests

OLLAMA_BASE_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"

def get_embedding(text: str) -> list[float]:
    resp = requests.post(
        f"{OLLAMA_BASE_URL}/api/embed",
        json={"model": EMBED_MODEL, "input": text},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["embeddings"][0]

SimpleMemory Class

A lightweight memory class that wraps ChromaDB. It handles only four operations: saving, searching, listing, and deleting.

import os
import time
import hashlib
import chromadb

MEMORY_DIR = os.path.join(os.path.expanduser("~"), ".chat_memory")
MAX_RECALL = 5

class SimpleMemory:
    def __init__(self, user_id: str):
        os.makedirs(MEMORY_DIR, exist_ok=True)
        self.client = chromadb.PersistentClient(path=MEMORY_DIR)
        self.user_id = user_id
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{user_id}",
            metadata={"hnsw:space": "cosine"},
        )

    def add(self, user_input: str, assistant_response: str):
        """Save conversation pair to memory"""
        text = f"User: {user_input}\nAssistant: {assistant_response}"
        doc_id = hashlib.md5(text.encode()).hexdigest()
        embedding = get_embedding(user_input)

        self.collection.upsert(
            ids=[doc_id],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"timestamp": time.time(), "user_input": user_input}],
        )

    def search(self, query: str, n: int = MAX_RECALL) -> list[str]:
        """Retrieve relevant memories for a query"""
        if self.collection.count() == 0:
            return []

        embedding = get_embedding(query)
        results = self.collection.query(
            query_embeddings=[embedding],
            n_results=min(n, self.collection.count()),
        )
        return results["documents"][0] if results["documents"] else []

    def get_all(self) -> list[dict]:
        """Retrieve all memories (newest first)"""
        if self.collection.count() == 0:
            return []
        results = self.collection.get(include=["documents", "metadatas"])
        memories = []
        for doc, meta in zip(results["documents"], results["metadatas"]):
            memories.append({"text": doc, "time": meta.get("timestamp", 0)})
        memories.sort(key=lambda x: x["time"], reverse=True)
        return memories

    def clear(self):
        """Clear all memories"""
        self.client.delete_collection(f"memory_{self.user_id}")
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{self.user_id}",
            metadata={"hnsw:space": "cosine"},
        )

The key points are:

  • Embed only user input: Including the assistant's response can reduce search accuracy when asking similar questions later. Calculating similarity based on user speech allows for more accurate retrieval of past conversations related to the topic.
  • Prevent duplicates with upsert: Since the hash of the conversation content is used as the ID, it won't duplicate even if executed repeatedly.
  • Persistence with PersistentClient: Data survives PC restarts. Stored in ~/.chat_memory/.

Injecting Memories into the Prompt

Embed the retrieved memories into the system prompt and pass them to the LLM.

SYSTEM_PROMPT = "You are a helpful Japanese assistant. If information about the user's past is provided, please incorporate it naturally into the conversation."

def chat_with_ollama(user_input, memories, history):
    memory_text = ""
    if memories:
        memory_text = (
            "\n\n【Information from past conversations with this user】\n"
            + "\n".join(f"- {m}" for m in memories)
        )

    system = SYSTEM_PROMPT + memory_text

    messages = [{"role": "system", "content": system}]
    messages.extend(history[-10:])  # Last 10 turns
    messages.append({"role": "user", "content": user_input})

    # Ollama API call (explained here with stream: False)
    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={"model": "gemma3:12b", "messages": messages, "stream": False},
        timeout=120,
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

With this method, the LLM is called only once per chat response. Since memory saving is completed via embedding alone, it finishes in an instant.

Streaming Support: Dramatically Improving Perceived Speed

Although it became faster with the in-house memory... I still felt it was "slow." The reason wasn't the memory, but stream: False.

When stream: False is used, nothing is displayed on the screen until the LLM finishes generating the entire text. When returning a long response with a 12b model, you end up staring at a blank screen for over ten seconds.

If you enable streaming, text begins to flow onto the screen the moment the first token is generated. Even if the generation speed itself is the same, the perceived waiting time is vastly different.

import json

def chat_with_ollama(user_input, memories, history):
    memory_text = ""
    if memories:
        memory_text = (
            "\n\n【Information from past conversations with this user】\n"
            + "\n".join(f"- {m}" for m in memories)
        )

    system = SYSTEM_PROMPT + memory_text

    messages = [{"role": "system", "content": system}]
    messages.extend(history[-10:])
    messages.append({"role": "user", "content": user_input})

    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={"model": "gemma3:12b", "messages": messages, "stream": True},
        timeout=120,
        stream=True,  # Enable streaming for requests as well
    )
    response.raise_for_status()

    # Display tokens sequentially while building the full text
    print("\nAI > ", end="", flush=True)
    full_text = []
    for line in response.iter_lines():
        if line:
            chunk = json.loads(line)
            token = chunk.get("message", {}).get("content", "")
            print(token, end="", flush=True)
            full_text.append(token)
    print("\n")

    return "".join(full_text)

There are two key points:

  1. "stream": True on the Ollama side: The response becomes JSON Lines format, and chunks are returned for each token.
  2. stream=True on the requests side: Receive the HTTP response in chunks. Without this, the requests library will buffer the entire response, rendering streaming ineffective even if enabled on the Ollama side.

Be careful, as forgetting flush=True will cause the text to be displayed in chunks due to Python's output buffering.

Complete Code

Below is the completed version that summarizes the elements discussed so far.

chat_with_memory.py (Full version)
"""
Ollama Chat with Memory (Lightweight Version)
========================================
No Mem0 required. High-speed version that stores memories using only Embedding.
LLM is called only once per chat response.

Requirements:
  pip install chromadb requests

Usage:
  python chat_with_memory.py

Special Commands:
  /memory   - List saved memories
  /forget   - Delete all memories
  /user     - Switch user ID
  /quit     - Exit
"""

import os
import time
import json
import hashlib
import requests
import chromadb

# ============================================================
# Settings - Please modify as needed
# ============================================================
OLLAMA_BASE_URL = "http://localhost:11434"
CHAT_MODEL = "gemma3:12b"          # Model for chat
EMBED_MODEL = "nomic-embed-text"   # Embedding model
USER_ID = "default_user"
SYSTEM_PROMPT = "You are a helpful assistant. If the user's past information is provided, please incorporate it naturally into the conversation."

# Memory settings
MEMORY_DIR = os.path.join(os.path.expanduser("~"), ".chat_memory")
MAX_RECALL = 5       # Max number of memories to retrieve
HISTORY_TURNS = 10   # Number of recent conversation turns to send to the LLM


# ============================================================
# Get Embedding (Direct Ollama API call)
# ============================================================
def get_embedding(text: str) -> list[float]:
    resp = requests.post(
        f"{OLLAMA_BASE_URL}/api/embed",
        json={"model": EMBED_MODEL, "input": text},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["embeddings"][0]


# ============================================================
# Lightweight Memory Class
# ============================================================
class SimpleMemory:
    def __init__(self, user_id: str):
        os.makedirs(MEMORY_DIR, exist_ok=True)
        self.client = chromadb.PersistentClient(path=MEMORY_DIR)
        self.user_id = user_id
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{user_id}",
            metadata={"hnsw:space": "cosine"},
        )

    def switch_user(self, user_id: str):
        self.user_id = user_id
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{user_id}",
            metadata={"hnsw:space": "cosine"},
        )

    def add(self, user_input: str, assistant_response: str):
        text = f"User: {user_input}\nAssistant: {assistant_response}"
        doc_id = hashlib.md5(text.encode()).hexdigest()
        embedding = get_embedding(user_input)
        self.collection.upsert(
            ids=[doc_id],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"timestamp": time.time(), "user_input": user_input}],
        )

    def search(self, query: str, n: int = MAX_RECALL) -> list[str]:
        if self.collection.count() == 0:
            return []
        embedding = get_embedding(query)
        results = self.collection.query(
            query_embeddings=[embedding],
            n_results=min(n, self.collection.count()),
        )
        return results["documents"][0] if results["documents"] else []

    def get_all(self) -> list[dict]:
        if self.collection.count() == 0:
            return []
        results = self.collection.get(include=["documents", "metadatas"])
        memories = []
        for doc, meta in zip(results["documents"], results["metadatas"]):
            memories.append({"text": doc, "time": meta.get("timestamp", 0)})
        memories.sort(key=lambda x: x["time"], reverse=True)
        return memories

    def clear(self):
        self.client.delete_collection(f"memory_{self.user_id}")
        self.collection = self.client.get_or_create_collection(
            name=f"memory_{self.user_id}",
            metadata={"hnsw:space": "cosine"},
        )


# ============================================================
# Ollama Chat (Streaming)
# ============================================================
def chat_with_ollama(user_input: str, memories: list[str], history: list) -> str:
    memory_text = ""
    if memories:
        memory_text = (
            "\n\n【Information from past conversations with this user】\n"
            + "\n".join(f"- {m}" for m in memories)
        )

    system = SYSTEM_PROMPT + memory_text
    messages = [{"role": "system", "content": system}]
    messages.extend(history[-HISTORY_TURNS:])
    messages.append({"role": "user", "content": user_input})

    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={"model": CHAT_MODEL, "messages": messages, "stream": True},
        timeout=120,
        stream=True,
    )
    response.raise_for_status()

    print("\nAI > ", end="", flush=True)
    full_text = []
    for line in response.iter_lines():
        if line:
            chunk = json.loads(line)
            token = chunk.get("message", {}).get("content", "")
            print(token, end="", flush=True)
            full_text.append(token)
    print("\n")

    return "".join(full_text)


# ============================================================
# Display Helper
# ============================================================
def show_memories(mem: SimpleMemory):
    all_mem = mem.get_all()
    if not all_mem:
        print("\n  📭 No memories found.\n")
        return
    print(f"\n  🧠 Memory List ({len(all_mem)} items):")
    print("  " + "-" * 50)
    for i, m in enumerate(all_mem, 1):
        first_line = m["text"].split("\n")[0]
        print(f"  {i}. {first_line}")
    print()


# ============================================================
# Main Loop
# ============================================================
def main():
    global USER_ID

    print("=" * 55)
    print("  🤖 Chat with Memory (Ollama + ChromaDB)")
    print("=" * 55)
    print(f"  Model      : {CHAT_MODEL}")
    print(f"  Embedding  : {EMBED_MODEL}")
    print(f"  User ID    : {USER_ID}")
    print(f"  Storage    : {MEMORY_DIR}")
    print("-" * 55)
    print("  /memory  List memories  /forget  Delete all")
    print("  /user    Switch user    /quit    Exit")
    print("-" * 55)

    mem = SimpleMemory(USER_ID)
    print(f"  ✅ Ready! (Memories: {mem.collection.count()})\n")

    history = []

    while True:
        try:
            user_input = input("You > ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\n👋 Exiting.")
            break

        if not user_input:
            continue

        if user_input == "/quit":
            print("👋 Exiting.")
            break
        elif user_input == "/memory":
            show_memories(mem)
            continue
        elif user_input == "/forget":
            mem.clear()
            history.clear()
            print("\n  🗑️ All memories deleted.\n")
            continue
        elif user_input == "/user":
            new_id = input("  New User ID > ").strip()
            if new_id:
                USER_ID = new_id
                mem.switch_user(USER_ID)
                history.clear()
                print(f"  ✅ Switched user to {USER_ID}.\n")
            continue

        try:
            memories = mem.search(user_input)
        except Exception:
            memories = []

        try:
            answer = chat_with_ollama(user_input, memories, history)
        except requests.exceptions.ConnectionError:
            print("\n  ❌ Cannot connect to Ollama.")
            print("     → Run 'ollama serve' first.\n")
            continue
        except Exception as e:
            print(f"\n  ❌ Error: {e}\n")
            continue

        history.append({"role": "user", "content": user_input})
        history.append({"role": "assistant", "content": answer})

        try:
            mem.add(user_input, answer)
        except Exception:
            pass


if __name__ == "__main__":
    main()

How to Use

# Prepare models
ollama pull gemma3:12b
ollama pull nomic-embed-text

# Install packages
pip install chromadb requests

# Run
python chat_with_memory.py

After startup, memories are automatically accumulated just by conversing normally.

You > My name is Taro. My hobby is playing guitar.
AI > Nice to meet you, Taro! Guitar is great. ...

You > What was my hobby again?
AI > You play the guitar! You told me that before. ...

You can verify saved memories using the /memory command.

You > /memory

  🧠 Memory List (1 item):
  --------------------------------------------------
  1. User: My name is Taro. My hobby is playing guitar.

Future Improvements

While it is working practically, the following could be considered for further improvements:

Memory Recency Management: Since timestamps are already saved, implementing a mechanism to decay the score of old memories or automatically delete them after a certain period would prevent memories from growing indefinitely.

Memory Summary Integration: If memories of the same topic increase, periodically use the LLM in a background batch process to summarize and integrate them. Performance won't be an issue if done in batches rather than in real-time.

Selection of Embedding Model: nomic-embed-text is strong in English, but intfloat/multilingual-e5-large could be considered for better Japanese precision, though this depends on VRAM availability.

Summary

  • Memory functionality for local LLMs can be realized practically using only Embedding + Vector DB.
  • Mem0 is good for API-based LLMs, but the LLM call for memory extraction becomes a bottleneck for local LLMs.
  • Streaming is not an improvement in generation speed, but an improvement in perceived speed, and its impact on user experience is significant.
  • Dependencies are only chromadb and requests. It is cross-platform (Windows/macOS) and works anywhere Ollama runs.

The completed code is about 200 lines. It turned out that understanding the mechanism and building it yourself without relying too heavily on libraries is the fastest way.

Discussion