iTranslated by AI
The Challenge of Giving AI Agents a Knowledge Graph: Lessons Learned from EidoGraph

Recently, I've been enjoying building AI agents and AI assistants from scratch, and I've been developing my own while learning along the way. In that process, I encountered Knowledge Graphs (KG), which led me to write this article.
Have you ever used an AI agent every day and thought, "I wish you understood me better before taking action"? This is the development record of EidoGraph, a personal project that sought the answer to that question in a "Knowledge Graph". I wrote down what I learned throughout the development process, including how I redesigned the architecture after encountering a specific research paper, why I chose a graph over vector search, and how Knowledge Graphs can transform the context foundation for AI agents.
0. Introduction — About this project
EidoGraph is a personal project that extracts the values, thought patterns, and decision criteria of an author from raw data like X posts and blog articles, and structures them into a knowledge graph. The goal is simple:
To create context so that an AI can infer "how the person would likely judge, even in unknown situations."
I am targeting prompt/context-based personality mimicry rather than LLM fine-tuning.
By the way, the name comes from "Eidos," which I adopted because it has the following etymology:
A philosophical term primarily derived from Greek, referring to "form/idea."
I am a Unity engineer and don't have much backend knowledge, and I had never used a knowledge graph before. I was also new to Neo4j, the graph database. With Claude Code as my partner, I used GitHub milestones to manage design and implementation. It is currently a monorepo with the following 11 packages:
contracts ─ llm ─ graph-db
├─ ingestion ─ extraction ─ graph-builder ─ graph-query ─ feedback
│ ↓
└─ pipeline ─ api-server / mcp-server / demo-chat
Looking back, while I had the concept after discovering knowledge graphs, it was significant to learn the essence of "why knowledge graphs are effective for AI agents" through encounters with other papers and concepts.
1. What is a Knowledge Graph in the first place?
To describe a Knowledge Graph (KG) in one sentence:
A data structure that represents the world using "nodes" (entities) and "edges" (relationships). Edges are typed, allowing them to hold meanings such as "
Ais aDERIVES_FROMofB."
For example, it is represented as follows. It can hold information like what formed a belief.
[Belief: Safety first in production]
│
│ DERIVES_FROM
↓
[DecisionRule: Write refresh token expiration tests for JWT]
│
│ EVIDENCES
↓
[Episode: Auth vulnerability found in Project X, March 2025]
The difference from a Relational DB is that you can treat the relationships themselves as first-class data. While you can do something similar in an RDB using JOINs, graph databases are optimized for traversal operations. A characteristic of Neo4j is that you can write queries using the Cypher query language that look like ASCII art, resembling how you would draw it in a diagram, such as (r:DecisionRule)-[:DERIVES_FROM]->(b:Belief).
What I learned from books is that because graph databases focus on "traversing relationships," they form links like pointers during search. Therefore, in cases where an RDB would require costly copying and joining, a graph database can search and utilize them at high speed.
The difference from a Vector DB is more fundamental:
| Vector DB | Knowledge Graph | |
|---|---|---|
| Unit of Search | "Semantically similar text snippets" | "Explicitly related nodes" |
| Inference | Implicit (proximity in embedding space) | Explicit (traversal by edge type) |
| Explainability | Weak (hard to explain why it's similar) | Strong (retains which edges were traversed) |
| Weakness | Weak with semantic ambiguity | High construction cost |
While vector search returns "things that are similar," a knowledge graph returns "how they are related." This is where it becomes effective in the AI agent context described below.
2. The evolution of learning
This project went through three major fundamental design changes. My resolution increased as I gained knowledge about knowledge graphs and related concepts, which I then fed back into the project.
2.1 Stage 1: Organizing "Nodes and Edges"
As mentioned, because I lacked knowledge of knowledge graphs, I consulted Claude Code on how to construct the project. However, I had a rough image of extracting some facts and connecting them with edges.
For example, when there are three different statements: "I want to write tests first," "I want to go with TDD," and "I prefer test-first," these are integrated into one Preference node, "Prefer test-first development," rather than three nodes. The original statements are kept in a separate layer as Evidence nodes and connected with a SUPPORTED_BY edge.
I continued to have discussions because I didn't know how to realize this.
2.2 Stage 2: Separation of concerns between Rules and LLM
In the initial implementation, I thought, "If I let the LLM do everything, it will do it intelligently." Using the LLM for chunking, noise reduction, and semantic extraction. However, that could cause costs to skyrocket, and if there is a lot of text to process, it would become expensive just to build the knowledge graph. So, I looked for ways to extract information without using an LLM. (Actually, I asked Claude Code.)
The answer returned was a hybrid structure where you extract using rule-based methods and "have the LLM generate text for the connecting parts such as edges."
| Stage | Responsible | Reason |
|---|---|---|
| Noise reduction, classification, topic tagging, frequency counting | Rule-based | Can be processed mechanically. Little value in leaving to an LLM |
| Canonical text generation, relationship extraction, semantic integration, confidence estimation | LLM | LLM's value lies in resolving ambiguity |
As the name "Generative AI" suggests, LLMs are good at generating text. In other words, I used it not to organize what can be extracted from existing data, but to "generate" the content derived from it. Meanwhile, methods for extracting from given data using rule-based systems are technologies that existed before the advent of LLMs, so I combined them. (Of course, the rule-based part was outside my field of expertise, so it was something Claude Code implemented while we brainstormed.)
To borrow Claude Code's words:
LLMs are good at "giving meaning to ambiguous things," but "throwing away things that are clearly noise" is poor cost-performance. The former's answer lies in the median of the probability distribution, while the latter only requires cutting off the edges of the distribution, so rules are sufficient.
2.3 Stage 3: Encountering the HumanLM paper and adding the A-Layer
The biggest turning point was an encounter with a specific paper. I was interested in the memory systems of AI agents and assistants and found this paper while gathering information through various methods.
HumanLM: Simulating Users with State Alignment Beats Response Imitation (Wu et al., 2026)
https://arxiv.org/abs/2603.03303
This paper argues that:
Imitating superficial phrasing (Response Imitation) cannot replicate personality. It is necessary to align internally based states that have psychological foundations (State Alignment).
In the experiment, superficial mimicry using Supervised Fine-Tuning (SFT) improved by 6.5% (worst case), while internal state alignment achieved a 16.3% improvement. The argument is that "speech habits" and "thought habits" are different things, and it is not true mimicry unless you handle the latter.
In short, even if you superficially imitate how someone speaks, you cannot fathom their underlying true intentions, resulting in shallow statements. I understand that by imitating from the "way of thinking," it became possible to mimic in a way that is suited to the environment, not just superficially but internally.
The paper describes personality in 6 psychological dimensions:
| Dimension | Definition |
|---|---|
| Belief | What is considered true |
| Goal | What is to be achieved |
| Value | What is fundamentally important |
| Stance | Standpoint in a specific context |
| Emotion | Emotional tendency affecting information processing |
| Communication | How information is structured and conveyed |
After reading this, I felt it should be incorporated into the knowledge graph I was implementing, so I requested a redesign and incorporated this content.
At the time, the graph was centered around DecisionRule ("Write refresh token tests for JWT"), Preference ("Prefer TypeScript"), and Experience (specific episodes), which was the B-Layer + C-Layer. While this functions as a "collection of judgment rules for known situations,"
- How to judge in unknown situations that do not fall under existing rules?
- Which one to prioritize when multiple rules conflict?
It could not answer these. The method to solve this is exactly what the paper describes.
Therefore, I redesigned it based on the HumanLM paper and added the A-Layer (abstract layer of beliefs and values).
A-Layer (Abstract) : Belief / Goal / Value / Stance / Emotion / Communication
B-Layer (Rules) : DecisionRule / Preference / ReasoningPattern
C-Layer (Concrete) : Experience / Evidence / TopicInterest
I also added new inter-layer relationships:
- DERIVES_FROM (B → A): "This rule is derived from this belief"
- REINFORCES (C → A): "This episode reinforced this belief"
- CONFLICTS_WITH (A ↔ A): "These two values conflict under certain conditions"
Searching for some of these on Neo4j displays them like this:

The result zoomed in on the bottom right is as follows:

Now I can trace "why I have that rule." The route of Facing an unknown situation → No corresponding concrete rule → Infer from the A-layer has been established.
The lesson here was:
The schema of a knowledge graph is determined by "what can be inferred" rather than "what to remember."
The schema is the input port for a virtual inference engine.
2.4 Stage 4: Data bias and the difficulty of collection
Even if the schema is well-organized, it's meaningless if the data is biased. In fact, aggregating the graph constructed from 178 blog articles showed this distribution:
| Category | Number of Nodes | Desired Number |
|---|---|---|
| TopicInterest | 58 | Sufficient |
| Communication | 39 | Sufficient |
| Belief | 23 | Generally good |
| Value | 1 | Need 15+ |
| Stance | 2 | Need 10+ |
| Emotion | 0 | Need 5+ |
Since technical blogs are the information source, there is a lot of technical interest, but information related to human core essence, such as "what is fundamentally valued (Value)" and "what excites me (Emotion)," is disastrously scarce.
What I realized here was that the type of raw data dictates the types of nodes that can be extracted. Technical memos are suitable for extracting DecisionRule, but extracting Value requires "scenes where I was forced to choose between two options" or "scenes where strong emotions were expressed." These are rarely included in technical blogs.
However, I didn't think I could build it immediately with the data I had on hand; I was thinking of running an AI assistant based on the created knowledge graph to nurture it. So, I wrote a skill for another AI assistant to "pick up mainly from the A-Layer during daily conversation and send it to the Feedback API." It is the idea of incorporating 'conversation' as a data source.
3. Why Knowledge Graphs for AI agents?
So far, that's what I've learned. From here on, I will write about why knowledge graphs are meaningful for AI agents in general, based on what I can see right now.
3.1 Differences between fine-tuning, RAG, and knowledge graphs
There are several ways to give "personal context" to an AI agent.
| Method | What to remember | Strengths | Weaknesses |
|---|---|---|---|
| Fine-tuning | Baked into model weights | Zero context consumption during inference | High update cost, unexplainable, hard to control forgetting |
| RAG (Vector Search) | Retrieve raw text by similarity | Low build cost, immediate updates | Returns "similar sentences" rather than "ways of thinking" |
| Knowledge Graph | Structured relationships | Explicit inference path, small update granularity | High build cost (which this project reduces with LLMs) |
Fine-tuning is powerful, but it is extremely difficult to express, for example, "values changed over time." RAG is flexible, but it is difficult to stably extract abstract judgment axes like "prioritize quality over speed" using the scale of similarity.
The strength of a knowledge graph is that you can explicitly design "from which abstraction to infer for unknown situations." When an agent receives a new question:
- Retrieve relevant concrete rules (B-Layer)
- If no rule applies, retrieve higher-level beliefs/values (A-Layer) to infer
- If there are conflicting values, pass the conflict itself to the agent
This process can be implemented not as a median of vector similarity, but as an explicit edge traversal.
3.2 What is beneficial from the perspective of the agent harness?
As you can see by using AI agents like Claude Code daily, the context window is a finite resource. You can't write everything in CLAUDE.md, and even if you do, it will be read every turn, including information unrelated to the situation, blurring the inference.
When a knowledge graph is placed as the context foundation, it looks like this from the harness side:
- Dynamically construct context: If the current topic is "security," only extract and send related Belief/DecisionRule/Experience.
- Awareness of token budget: You can construct a retrieval policy like "top N items, token limit X."
- Localized updates: When values change, simply lower the strength of the corresponding node and add a new node. No need to rewrite the entire prompt.
- Accountability: When the agent says, "I recommend this because you prioritize quality," it can also present the evidence nodes and evidence episodes.
Among these, the latter accountability is also extremely important in contexts other than personality mimicry, such as AI agents dealing with internal corporate knowledge. Being able to trace and show "why this decision was made" via edges should be a watershed moment for future agent reliability.
3.3 The use case of a "Secretary Agent"
EidoGraph was originally conceived as a secretary agent that issues instructions to worker agents and performs primary reviews of reports on behalf of the master (the human). A secretary needs to do three main things:
- Mimic the master's instruction style to generate instructions for workers
- Review worker reports based on the master's judgment criteria
- Independently judge by inferring from the master's beliefs even in unknown situations
The third is where the knowledge graph really shines. Judgments the master has not made in the past cannot be extracted from past statements by perfect matching. However, if there is a structure like:
- A-Layer: "Code quality is prioritized over schedule" (Value, strength: 0.9)
- B-Layer: "Load test unverified technology before production" (DecisionRule)
- C-Layer: "Early microservices failed in Project Y" (Experience)
Then, for a new question like "Should we start microservices?", you can generate inferences from the A-Layer and C-Layer even without a directly matching rule.
I believe this behavior of "no direct rule → infer from abstraction" is a unique value of knowledge graphs that is hard to reproduce with fine-tuning or RAG.
4. Important separations in design
I've summarized the separations I realized were important after implementing them. They are generally applicable design principles, so they should be usable in other projects.
4.1 Separate confidence and weight
type GraphNode = {
confidence: number // Certainty of extraction/integration (system-side circumstance)
weight: number // Importance in personality mimicry (user-side circumstance)
// ...
}
At first, I tried to manage both as a single "score" and failed.
-
confidenceis low = Extraction is suspicious, so I want to re-confirm using another method -
weightis low = Extraction is certain, but it's not important to the person, so it's not shown normally
If you mix them, you can't judge either. "System confidence" and "importance to the user" are separate axes. I think this is a separation that works not only for KG but for search rankings and recommendation systems in general.
4.2 Separate extraction and graph-builder
Extraction is "take signals from this sentence," integration is "make canonical sentences from multiple signals."
- Extraction can be deterministic (same input, same output), so it's easy to test
- Integration depends on the LLM and is probabilistic, so the evaluation strategy changes
- There are cases where I only want to replace the extraction (add a different source type)
Separation of responsibility is consistent with "separation of test strategy." I can strongly state this as a rule of thumb.
4.3 LLM provider abstraction
I also implemented an LlmClient interface to allow switching between OpenAI/Claude CLI/OpenAI-compatible (LM Studio, Ollama, vLLM). It was only OpenAI at first, but I added the latter two from a cost-reduction perspective. However, because Claude Code either doesn't cache well or the content I implemented didn't mesh well, token consumption was quite high, so it's just in a state where it's usable.
5. Accuracy evaluation
When I introduced a local LLM, I created an evaluation framework (Intrinsic / Extrinsic / LLM-as-Judge), and it was surprising that the Qwen3 local LLM had a higher score than OpenAI's gpt-5.4-mini.
Automating evaluation becomes a foundation for objectively answering:
- Which model's KG construction has higher quality as persona context?
- How does the quality change with prompt changes even with the same model?
- Is a certain change causing a regression?
I thought that for personal projects leveraging LLMs, the priority of an evaluation framework is higher than expected. After all, LLMs output probabilistically and output far more text than a human can read, so I feel that such mechanical checking mechanisms will become very important in the future.
6. What to expect from AI agents in the future
Finally, I will write down what I currently think about how to use AI agents in the future.
6.1 "Portable Personality Context"
The KG constructed with EidoGraph can now be retrieved from external AI assistants via the MCP Server. What this means is:
You can carry your "values/thought patterns" between different AI agents.
I think a world is coming where you refer to the same knowledge graph when writing code with Claude Code, writing documents with Cursor, or chatting with another AI. You could say it's an AI assistant that understands you deeply. Like Iron Man's JARVIS. Or like Raphael in "That Time I Got Reincarnated as a Slime." The biggest weakness of the current AI context is "starts from zero every time you change the conversation," but if you put KG in the center, continuity of context is guaranteed at the DB level.
6.2 Agents actively nurturing KG
Currently, humans provide feedback to update the KG, but I am already prototyping a structure where the agent autonomously detects "this statement is a new Value candidate" and throws it into the Feedback API. The direction where the agent is both a user of the KG and a nurturer of the KG seems natural as a self-improvement loop.
However, there is a risk that "hallucinated values will be written into the KG." In EidoGraph, I have a pending_review status to leave final approval to the human. Automatic KG expansion by agents is prone to becoming a confirmation bias loop, so I plan to keep it semi-automated for the time being.
6.3 Designing how to "forget" KG
Human values change over time. EidoGraph has lastSeenAt / weight and is designed to express "forgetting old values" via time decay or deprecated flags.
However, the policy for "what to forget" has not been started yet. In RAG, "deleting from index" is enough, but in KG, there is a ripple effect: "If you forget a certain node, what happens to the DERIVES_FROM relationship that depended on it?" I think this is a research theme for the future.
7. Summary — What I learned
- The schema of a knowledge graph is determined by "what can be inferred" rather than "what to remember." The schema is the input port for a virtual inference engine.
- Do not turn raw sentences into nodes as they are; integrate them into canonical sentences. Keep the source in a separate layer. Otherwise, similar nodes will explode.
- Do not use LLMs for everything; separate responsibilities with rule-based systems. LLMs have value in resolving ambiguity and are not suitable for noise reduction.
- KGs without an abstract layer (A-Layer) cannot infer in unknown situations. As the HumanLM paper points out, "phrasing" and "way of thinking" are different.
- The type of data source dictates the types of nodes that can be extracted. Value does not come from technical blogs. It is necessary to combine conversation, journals, and dialogue.
- Separate confidence and weight. "System confidence" and "importance to the user" are separate axes.
- The evaluation framework is more essential than I imagined. It is better to prepare it early in projects involving LLMs.
- In the future of AI agents, KG will become a "portable personality context." It can function as a central DB to maintain context across multiple agents.
Knowledge graphs may have lower visibility than vector DBs because the construction cost is high and immediate returns are hard to see. However, my frank impression is that there is nothing else that can replace them in situations where you want to pass "ways of thinking" to an AI agent.
No fine-tuning needed, RAG is not enough; I think the narrow and deep space between them is where knowledge graphs belong. I hope this article becomes a starting point for the design of someone holding the same questions.
References / Related Links
- HumanLM: Simulating Users with State Alignment Beats Response Imitation (Wu et al., 2026) — https://arxiv.org/abs/2603.03303
Discussion
記事興味深く読ませていただきました。AIエージェントに「未知の状況でどう判断しそうか」を推論させるための土台として、ファインチューニングでもRAGでもない第三の道としての知識グラフ、という位置付けに強く共感します。特にA層を加えた三層構造で「考え方」を表現する方向性は、自分の中でも腑に落ちる構造でした。
細かい一点で恐縮ですが、CONFLICTS_WITHの運用について気になったので質問させてください。
BeliefやValue同士の矛盾を明示的に持つというエッジは、価値観の時間変化を扱う上で重要な軸になりそうだなと感じたのですが、実装上どうしていますか?抽出時にLLMが「既存のBeliefと矛盾する」と気付くのか、それとも別パイプラインで定期的に矛盾候補を探すのか。また矛盾が見つかった時に、片方をdeprecatedに倒すのか、両方残して上位のmeta-beliefで包むのか、confidence/weightで重み付けして共存させるのか、あたりが気になりました。
コメントありがとうございます!
共感いただいて嬉しいです。AIにもっと自分のことを理解してもらって、より人間らしく振る舞ってもらいたいな、と思って日々色々と模索しているところです。(今回の知識グラフとはさらに別にアシスタントも開発中)
質問いただいたところについて回答しますね。
まず、コンフリクト検知はおっしゃる通り、ふたつの経路で実行されます。
ひとつは、ブログ記事などから知識を抽出する段階でLLMに気づかせる経路。もうひとつが、AIエージェント・アシスタントが本知識グラフをクエリした際に、自動で検知する仕組みが入っています。ここでも検知されます。
検知されたあとの挙動ですが、ノードのデータとして、矛盾のある内容を比較して、confidence / weightを調整する対応が入っています。
また別の更新経路としては以下も想定しています。
この知識グラフの利用想定が、AIアシスタントの知識引き出しだけでなく、人間と対話している中で人間からのフィードバック(例えば、「それは私の考えとそぐわない」などと言われる)があった場合に、それを知識グラフに戻す仕組みも備えています。それによって、人間側が使えば使うほど内容が更新されていくループを作っています。
ただこのあたりは仕様を作ってAIに実装してもらったものの、本当にうまく動くかはこれから実際に運用してみて様子を見る段階なので、意図通りに動いているかはまだ分かりません。
というので回答になっていますでしょうか?
回答ありがとうございます!
なるほど、複数の経路から検知する仕組みを作られてるんですね。とくにクエリしたタイミングでもデータ調整されてるというのが、これまで私はとっていないアプローチなので非常に参考になります。
使えば使うほど内容が更新させるループ、良いですね!真似させてもらいます笑