iTranslated by AI
MCP vs CLI + Skills: The Context Cache Trade-off and Implementation-Dependent Behavior in Claude Code
I was curious about the comparison between MCP and the cli + skill approach, so I asked Claude about it. Both should theoretically lead to the same outcome, and since they are ultimately just API calls under the hood, there shouldn't be any fundamental difference. Why, then, were these two approaches born, and why is it said that they differ?
Through our dialogue, I learned several interesting points. Specifically regarding the current implementation of Claude Code:
- MCP injects the context at every turn.
- Adding permission to use an MCP server breaks the context cache and slows things down.
These two points seem to be the case. Therefore, the conclusions drawn are:
- cli + skill seems to be superior.
- As for the benefits of MCP (like credential management), if we have a local BFF server acting as a sidecar, couldn't we achieve the same thing with or without MCP?
These are the two main points of consideration. Since this is likely dependent on the current implementation (v2.1.69), I don't expect it to change significantly in the near future.
MCP vs Skill + Cli? Really?
I've recently seen arguments that skill+cli consumes fewer tokens than MCP. Is that even theoretically possible? I understand that a persistent server might be heavy because it never frees up memory, but that has nothing to do with the context. If it were about prompt caching, I would understand.
March 4
I compared the overhead of MCP tool definitions with the efficiency of skill+CLI.
It's a great question. Theoretically, it is possible. The main cause is the context occupied by tool definitions.
In the case of MCP: The tool definitions (name, description, JSON schema) for connected MCP servers are included in the context of every request. For example, if you connect Gmail, Google Calendar, and Slack, and each has 10–20 tool definitions, that's 30–60 schemas included in every turn. It is common for this to amount to several thousand tokens.
In the case of Skill+CLI: The tool definitions are limited to just a few (like bash, view, str_replace). The details of a skill are only read via view when needed, so tokens for unrelated skills are not consumed at all. Once used, they aren't loaded in the next turn (though they remain in the conversation history).
In short, the structural difference is:
- MCP: N tool definitions are always present in the context → number of turns × tool definition tokens accumulate.
- Skill+CLI: Tool definitions are fixed and minimal, and skill content is read on-demand only once.
The prompt caching mentioned by users is exactly relevant here; if caching works, the tool definition portion isn't recalculated, which significantly mitigates the cost. However, in cases where caching doesn't work (initial request, cache TTL expiry, changes in tool definition order, etc.), a difference emerges.
Also, while MCP returns responses as structured JSON blocks (mcp_tool_use → mcp_tool_result), CLI uses stdout plain text. In some cases, CLI is more token-efficient for responses too, as MCP has metadata overhead.
So, "skill+CLI consumes fewer tokens than MCP" holds up as a theory based on the difference between constant loading vs. on-demand loading of tool definitions. However, in environments where prompt caching is effective, that difference should be quite small.
Does MCP load tool usage every time?
Wow, does MCP really load tool usage every time? Even for unrelated tools? I assumed it was once at startup and once when using a specific MCP tool. Is it every PromptSubmit?
March 4
I explained that MCP tool definitions are included in every API request.
Yes, every time. This isn't specific to MCP; it's because the LLM API itself is stateless.
Since LLMs have no "memory," we send all of the following in every request (i.e., every turn):
- System prompt
- All available tool definitions
- Entire conversation history
The MCP server itself starts up once as a persistent process, and "fetching" the tool list happens once at that time. But the tool definitions fetched there are sent every time as tools parameters in the request to the LLM.
- User input → Client builds API request
→ system prompt
→ tools: [All tool definitions fetched from MCP] ← Included every time
→ messages: [Entire conversation history]
→ Send to LLM
From the LLM's perspective, there is no distinction between the "first time" and the "10th turn"; it receives all information as if it's fresh each time. Therefore, if there are 30 tools in the MCP server, 30 schemas are included every turn.
Prompt caching is an optimization precisely for this "isn't it a waste to send the same thing every time?" issue. If the prefix (system prompt + tool definitions) is the same as the previous turn, it reads from the cache to reduce billing.
So to summarize:
- MCP server startup/fetching tool list → Once
- Sending tool definitions to LLM → Every turn (because it's stateless)
- Prompt caching → Sent every time, but costs are reduced if a cache hit occurs.
What about MCP tool search?
Wow, that's definitely a huge waste. It seems like MCP tool search has been introduced.
March 4
I connected the perceived MCP tool search feature to optimization strategies.
That's exactly the solution to that problem. In fact, tool_search is used in this conversation right now, and it's written in my tool definitions:
"ALL tools listed below are deferred — you MUST call tool_search first to load them before you can use any of them."
So the structure is:
- Traditional MCP: All tool definitions are included in
toolsevery turn → Constant context occupation. -
tool_searchmethod: Only thetool_searchmeta-tool is loaded constantly. Search when needed → Load the relevant tool definition → Use it.
This is basically the same idea as the skill+CLI's "view SKILL.md only when needed." It's lazy loading.
However, a subtle difference is that tool definitions loaded via tool_search remain in the conversation history, so they accumulate as the conversation progresses. On the other hand, the skill content read by the CLI's view also remains in the history, so they are actually the same in that regard.
In the end, both are using the same optimization of "stop loading everything all the time, load only what you need on-demand." With the advent of tool_search, the token efficiency gap between MCP and skill+CLI has likely narrowed significantly.
Even with prompt caching and tool search, is there still a difference?
What remains as a difference between skill+cli and tool search, considering prompt caching?
March 4
I examined the differences in cache stability and response format.
Good question. To summarize:
Parts where the difference is almost gone:
- The on-demand loading mechanism itself is equivalent.
- The fact that loaded content remains in the conversation history is also the same.
Differences that remain:
- Prefix stability (cache hit rate): This is the biggest one. Prompt caching works on prefix matching, so:
- skill+CLI: The
toolsarray is always fixed as [bash, view, str_replace, create_file] → Cache always hits. -
tool_search: If loaded tools are added to thetoolsarray, the prefix might change, causing the cache to break.
This is implementation-dependent, but if the tools array changes dynamically, there will be a significant difference in cache efficiency.
- Response token efficiency
- CLI: Plain text in stdout. It can also be filtered with
greporheadbefore being returned. - MCP: Structured blocks of
mcp_tool_use→mcp_tool_result, with metadata overhead. Output cannot be processed.
- Flexibility in tool invocation
- CLI: Multiple commands can be piped together in a single
bash→ finished in one turn. - MCP: One tool per invocation → turns increase when multiple steps are required → conversation history bloats.
Conclusion: Considering prompt caching, the difference in prefix stability is the most impactful. skill+CLI's tool definitions are fixed, so the cache isn't broken. If the implementation of tool_search dynamically adds tools, that causes cache misses, leading to a cost difference.
How does prompt caching work?
I see. How far does a prompt cache consider a "cache hit"? I assume it means if the arguments are the same, the result is the same, but the context is a huge block of state rather than just functional inputs, isn't it?
March 4
I unraveled the mechanism of prompt caching step by step.
Sharp question. The intuition that "if arguments are the same, results are the same" is close, but here is the more concrete mechanism.
What is cached is the KV cache.
Transformers calculate Key-Value pairs at each layer for the input token sequence. Prompt caching is a mechanism that reuses these pre-calculated KV pairs.
And why it works with "prefix matching":
An important property of Transformers is that the KV calculation for token N depends only on tokens 1 to N (causal attention). In other words:
[system prompt][tools definition][msg1][msg2][msg3][msg4 ← NEW]
If the sequence up to msg3 is exactly the same as the previous time, the KV calculation result up to msg3 is mathematically identical to the previous one. Therefore, recalculation is unnecessary, and you only need to calculate for msg4.
Why it works even if the context is a "block of state":
Because conversation history is append-only. Past messages don't change; new messages are just added to the end. Therefore, there is always a matching part from the beginning.
- Turn 1: [sys][tools][msg1] → Entirely new calculation
- Turn 2: [sys][tools][msg1][msg2] → Cache hit up to msg1, only msg2 is new
- Turn 3: [sys][tools][msg1][msg2][msg3] → Cache hit up to msg2, only msg3 is new
Cases where the cache breaks:
When the prefix changes. Specifically:
- Adding/deleting/reordering
toolsdefinitions (the case where tools are dynamically increased viatool_search) - Changing the system prompt
- Editing/deleting parts of the conversation history
- Trimming old messages to fit within the context window
When these occur, all KV pairs after the change point must be recalculated.
This is why the reason skill+CLI is strong is clear: the tools array is [bash, view, str_replace, create_file] and never changes → the prefix is always stable → as the conversation gets longer, you only need to calculate for the new messages.
What are the contents of tool_use?
Is tool_use the tools actually used, or the tools you gave permission for, or a portion thereof?
March 4
I disambiguated tool_use from permitted tools definitions.
The terminology here is confusing, so let's organize it. The API request structure is like this:
{
"system": "system prompt",
"tools": [ // ← Definitions of all permitted tools
{"name": "bash", "description": "...", "input_schema": {...}},
{"name": "web_search", "description": "...", "input_schema": {...}},
// ... All of them, regardless of whether they are used
],
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": [
{"type": "tool_use", "name": "bash", "input": {...}} // ← Tool actually used
]},
{"role": "user", "content": [
{"type": "tool_result", "content": "..."} // ← The result
]}
]
}
-
toolsparameter: All permitted tools. All are sent every time, even if unused. This is included in the prefix part. -
tool_useblock: A record of what the model actually called within the conversation history. It is part ofmessages.
The order of the token sequence is:
[system] → [all tool definitions] → [messages...]
Therefore, what is important for caching is the tools parameter (all permitted tool definitions). If this changes, the subsequent messages portion also results in a cache miss.
When an MCP tool is dynamically loaded via tool_search, if it is implemented to be added to the tools array, the prefix changes and the cache is broken. The increase in tool_use records in the conversation history is just appending, so that is not an issue.
Doesn't context caching work with AllowsTools?
If you give permission in advance with AllowTools, in principle, context caching should work, right?
March 4
I examined the trade-off between cache effectiveness and token reduction.
Yes, that is theoretically correct. But then we return to the original problem.
If you pre-define all tools in tools:
- Prefix fixed → cache works ✓
- But all tool definitions are included in the context every time → high token consumption ✗
If you load dynamically with tool_search:
- Don't include unnecessary tool definitions → low token consumption ✓
-
toolsarray can change → possibility of cache breaking ✗
In the end, it's a trade-off, balancing cache hit rate and context occupation.
However, there is actually another important axis: the issue of the model's attention. If there are 50 tool definitions in tools, there is a practical problem where the model's precision in selecting the appropriate tool drops. Even if the cost is lowered by caching, it's meaningless if the inference quality drops.
Therefore, the real value of tool_search lies not just in saving tokens, but also in narrowing down the number of tools shown to the model to improve selection accuracy.
Where skill+CLI excels in this regard is that the tool definitions themselves are always just a few (like bash), and information on "what can be done" is read as skill content on-demand, so the burden of tool selection doesn't even exist.
Then why does Skill benefit from caching?
I see. In the case of skill, is Bash(cd) and Bash(cat) different permissions? Or both Bash?
March 4
I clarified the structure of tool permissions.
Both are the same single bash tool. As for tool definition, it is only this:
{
"name": "bash",
"description": "Run a bash command in the container",
"input_schema": {
"properties": {
"command": {"type": "string"},
"description": {"type": "string"}
}
}
}
Whether it's cd, cat, or grep | head, it's just a different string entered into the command parameter.
This is the fundamental reason for skill+CLI's token efficiency: all operations are consolidated into a few general-purpose tools. Reading files, calling APIs, processing data—it's all bash.
With MCP, doing the same thing requires individual definitions:
{"name": "gmail_search", "input_schema": {...}},
{"name": "gmail_send", "input_schema": {...}},
{"name": "gmail_read", "input_schema": {...}},
{"name": "gcal_list_events", "input_schema": {...}},
{"name": "gcal_create_event", "input_schema": {...}},
// ... one tool per operation
Individual definition is required. Since each has a description and JSON Schema, the tool definitions alone amount to several thousand tokens.
In other words, the level of abstraction is fundamentally different. bash is "one tool that can do anything," while MCP is "typed tools for specific operations." MCP wins in terms of type safety and validation, but bash is overwhelmingly superior in token efficiency.
Is the MCP implementation quite foolish?
I see, this feels like the game is completely over. Since tool_search was added, I thought maybe the MCP implementation wouldn't be that foolish, but it's more foolish than I thought.
March 4
I admitted the efficiency issues of MCP while adjusting.
Well, to defend MCP, they are solving a different problem.
bash is indeed the most token-efficient, but:
- The model risks getting curl options wrong or generating commands like SQL injection.
- What happens cannot be predicted in advance (because it's arbitrary code execution).
- It is dangerous without a sandbox.
- It is unclear to the user what they are allowing this model to do.
Typed tools in MCP:
- Allow input validation.
- Have clear permission scopes (e.g., allow read, forbid write).
- Are easy to log and audit.
- Lower the risk of the model assembling wrong commands.
In other words, MCP is designed to sacrifice token efficiency for safety and interoperability. If you are connecting to someone else's system in an enterprise environment, MCP is much safer than running curl via bash.
However, for individual developers using it in their own environment, that safety overhead honestly looks like mere waste. Rather than "foolish," it's more of a case of "over-spec for my own use case."
Still, since the token efficiency difference is structural, a trend might emerge where MCP servers move toward a design like a bash wrapper where "a single general-purpose tool handles multiple operations."
Wouldn't that be solved by having a BFF server?
If necessary, MCP could act as a BFF, where the public signature as MCP is just start and stop for the web server. Once started, it would persist credentials, etc., and the actual calls would be accessed via skill through this BFF. That sounds reasonable.
March 4
I examined the BFF architecture design for MCP.
That's a very sound idea. It's clean in terms of role division.
The parts where MCP really has value:
- OAuth and other authentication flow management
- Token refresh
- Connection lifecycle management
- Control of permission scopes
The parts MCP doesn't need to do:
- Tool definitions for individual API operations
So MCP only stands up an authenticated local proxy at start. After that, skill:
# skill only knows the local BFF endpoint
curl localhost:3100/gmail/search?q="from:boss"
curl localhost:3100/gcal/events?date=today
With this:
- The
toolsarray is fixed as [bash, view, ...] → Perfect caching. - Authentication complexity is confined to MCP.
- API operational flexibility is on the
bashside. - If you write the "list of BFF endpoints" in the skill file, you can read it via
viewonly when needed.
Current MCP integrates authentication management and operational definition, which is the root cause of poor token efficiency. Separating it into an authentication layer (MCP) and an operational layer (skill+CLI) is very natural architecturally.
Essentially, it feels like MCP becomes a sidecar pattern.
This is called a sidecar pattern, right?
Ah, so this is called a sidecar pattern.
March 4
I recognized the user's unfamiliarity with sidecar pattern terminology.
Exactly. It's a term that became widespread in the Kubernetes world, where a small container that handles auxiliary functions is attached next to a main container (the app itself). It's exactly the image of a sidecar attached to a motorcycle.
Typical examples:
- A sidecar for log collection next to the main app
- A sidecar for authentication proxy (like Envoy) next to the main app
- A sidecar for service mesh (like Istio) next to the main app
What they have in common is that they don't intervene in the main processing but only handle auxiliary concerns (authentication, logs, communication) from the side.
In the configuration we just discussed:
- Main:
skill+CLIis responsible for actual API operations. - Sidecar:
MCPis only responsible for authentication/token management.
And the main process is almost unaware of the sidecar's existence, just sending requests to localhost:3100. The complexity of authentication is hidden by the sidecar.
This is the story of how the evolution of "I don't want to write auth logic in the app" → "Let's leave it to the sidecar" in the microservices world might be reproduced in LLM tool calling.
Discussion