iTranslated by AI
30 Core Concepts Every AI Engineer Should Master
From tokens to MCP, from RAG to vector databases—a quick reference to rapidly build your AI engineering knowledge framework.
Introduction
If you start as an AI engineer now, your screen will be flooded with new terminology:
"Use RAG to augment LLMs, store in vector databases with Embeddings, perform tool calls via MCP, and optimize output with Few-shot Prompting..."
It sounds like an alien language.
However, these concepts are not actually difficult. What is difficult is that there is no good way to organize them, and their relationships remain unclear.
This article serves as a knowledge map to help you build a complete AI engineering knowledge framework.
Layer 1: Fundamental Concepts (Required)
1. Token
Definition: The smallest unit of text processing, which can be understood as a "word block."
Examples:
- "Hello World" → ["Hello", " World"] (2 tokens)
- "你好世界" → ["你好", "世界"] (2 tokens; generally, 1 character = 1 token for Chinese)
Importance:
- Models charge by tokens (not character count)
- Models have a token limit (e.g., GPT-4 supports 128k tokens)
- Optimizing token usage = reducing costs
Tools:
- OpenAI Tokenizer: https://platform.openai.com/tokenizer
- tiktoken (Python library)
2. Context / Context Window
Definition: The amount of content a model can "remember" at once.
Analogy: Your "short-term memory" when chatting with a friend. If the conversation gets too long, you forget what was said earlier; models are the same.
Examples:
- GPT-4 Turbo: 128k tokens (approx. 100,000 Chinese characters)
- Claude 3: 200k tokens
- Gemini 1.5 Pro: 1M tokens
What happens when you run out of Context?
- Initial content gets "forgotten"
- You need to expand memory using RAG or summarization techniques
3. Prompt / Prompt Engineering
Definition: Instructions and input provided to the AI.
Advanced Techniques:
Few-shot Prompting
Providing examples to the model to help it learn patterns:
Example:
Input: "The weather is nice today" → Sentiment: Positive
Input: "This product is the worst" → Sentiment: Negative
Analyze now: "This feature is okay" → Sentiment: ?
Chain-of-Thought (CoT)
Making the model "think slowly":
Problem: Taro has 5 apples, eats 2, and buys 3. How many does he have now?
Answer (Standard): 6
Answer (CoT):
1. Initial: 5
2. Ate 2: 5 - 2 = 3
3. Bought 3: 3 + 3 = 6
Final answer: 6
System Prompt
Defining the model's "persona" and behavioral rules:
System: You are a strict legal assistant. You must cite legal articles in your responses.
User: How should I handle a breach of contract?
4. Temperature / Top-p
Definition: Controlling the "randomness" of the model's output.
Temperature:
- 0.0: Fully deterministic, same output every time (suitable for code generation)
- 1.0: Standard randomness (suitable for chat)
- 2.0: High creativity (suitable for poetry, brainstorming)
Top-p (Nucleus Sampling):
- 0.1: Only consider the top 10% of choices by probability (conservative)
- 0.9: Consider choices with a cumulative probability of 90% (balanced)
Experience:
- Code: temperature=0, top_p=0.1
- Articles: temperature=0.7, top_p=0.9
- Creative work: temperature=1.5, top_p=0.95
5. Embedding / Vectorization
Definition: Converting text into numerical vectors (sequences of floating-point numbers).
Necessity:
- Computers cannot understand the similarity between "cat" and "dog"
- But the distance between vectors [0.2, 0.8, ...] and [0.3, 0.7, ...] can be calculated
Example:
from openai import OpenAI
client = OpenAI()
text = "Artificial intelligence changes the world"
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
# Return: [0.023, -0.45, 0.78, ..., 0.12] (1536 dimensions)
Applications:
- Semantic search
- Recommendation systems
- Foundation of RAG
Layer 2: Architecture and Models
6. Transformer
Definition: The core architecture of modern LLMs (proposed by Google in 2017).
Core Innovation: Self-Attention mechanism
- Traditional RNN: Processes character by character, slow, cannot remember long text
- Transformer: Parallel processing, fast, capable of handling long-range dependencies
Components:
- Encoder: Understanding input (e.g., BERT)
- Decoder: Generating output (e.g., GPT)
- Encoder-Decoder: Translation tasks (e.g., T5)
7. LLM (Large Language Model)
Definition: A language model with an ultra-large number of parameters.
Scale Comparison:
- GPT-3: 175B parameters
- GPT-4: Estimated 1.7T parameters
- Llama 2: 7B / 13B / 70B parameters
10. RAG (Retrieval-Augmented Generation)
Definition: Retrieval-Augmented Generation = Retrieval + Generation.
Workflow:
- User question: "What is the context length of Llama 3?"
- Retrieval: Find relevant documents in the knowledge base
- Generation: Pass the document and the question together to the LLM to get an answer
Why RAG is necessary:
- Model knowledge has an expiration date (GPT-4 is up to 2023)
- Models have never seen corporate data
- Fine-tuning is expensive, whereas RAG is more flexible
RAG vs Fine-tuning:
| Scenario | Use RAG | Use Fine-tuning |
|---|---|---|
| Frequently updated data | ✅ | ❌ |
| Learn specific style | ❌ | ✅ |
| Cost-conscious | ✅ | ❌ |
| Need to cite sources | ✅ | ❌ |
11. Vector Database
Definition: A database dedicated to storing and searching vectors.
Why not use a standard DB?:
- Vector dimensions are high (768/1536/3072 dimensions)
- Requires efficient "similarity search" (KNN/ANN)
- Impossible with standard databases
Mainstream Products:
- Pinecone: Managed service, easy
- Weaviate: Open-source, supports hybrid search
- Qdrant: Open-source, written in Rust, high performance
- Milvus: Open-source, developed by Alibaba, large-scale
- Chroma: Open-source, Python-native, lightweight
- pgvector: PostgreSQL plugin
15. Function Calling / Tool Use
Definition: Allowing an LLM to call external tools (APIs, DBs, calculators, etc.).
Example:
from openai import OpenAI
client = OpenAI()
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather information",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is the weather in Tokyo?"}],
tools=tools
)
Applications:
- Real-time data lookup (stock prices, weather)
- Executing actions (sending emails, booking tickets)
- Accessing private data (internal databases)
16. MCP (Model Context Protocol)
Definition: A standardized tool-calling protocol proposed by Anthropic (released in November 2024).
Problem solved:
- Previously, each AI application implemented tool calling independently
- MCP standardizes this, making tools reusable across applications
Architecture:
AI App ←→ MCP Client ←→ MCP Server ←→ Tools/Data Sources
Significance:
- Similar to the USB protocol: connect everything with one standard
- Foundation for future AI tool ecosystems
Layer 3: Engineering Practice
18. Agent
Definition: AI that can autonomously reason, call tools, and complete tasks.
Workflow:
- Reasoning: Task analysis
- Acting: Calling tools
- Observing: Checking results
- Repeating: Until completion
Example:
Task: Book a flight from Tokyo to Osaka for tomorrow
Agent:
Reasoning: Flight information is needed
Acting: Call search_flights("Tokyo", "Osaka", "tomorrow")
Observing: Found 3 flights
Reasoning: User selection required
Acting: Call ask_user("Which flight would you like?")
Observing: User chose flight 2
Acting: Call book_flight(flight_id=2)
Observing: Booking successful
Done!
19. Streaming Output
Definition: Returning results incrementally without waiting for the full generation to complete.
UX:
- Non-streaming: Wait 10 seconds → See complete answer
- Streaming: See the first character immediately and display incrementally
21. SDD (Specification-Driven Development)
Definition: Defining "specifications" (specs) first, then generating code.
Traditional development:
Requirements → Design → Code → Testing
SDD:
Requirements → Create Specs (test cases/constraints) → AI generation → Automated verification
Significance:
- Development paradigm in the AI era
- Humans focus on "what they want," AI is responsible for "how to do it"
30. Hallucination
Definition: When the model "confidently spews nonsense."
Example:
User: Introduce the book "Quantum Buddhism"
AI: This book was published by Taro Yamada in 2018...
(Actually, this book does not exist)
Causes:
- Models "predict the next word," they don't "query a database"
- Noise in training data
Mitigation methods:
- RAG: Provide authentic documents
- Lower Temperature: Reduce randomness
- Prompt constraints: Tell it to say "I don't know" if it doesn't know
- Citation: Require the model to cite its basis
How to learn these concepts?
1. Hierarchical Learning
Week 1: Token, Context, Prompt, Temperature (Play by writing some prompts)
Week 2: Embedding, RAG, Vector DB (Build a simple Q&A system)
Week 3: Function Calling, Agent (Make AI call a weather API)
Week 4: Fine-tuning, Quantization (Fine-tune a small model)
2. Practice
Project suggestions:
- Week 1: Personal Knowledge Base Q&A (RAG)
- Week 2: Multi-tool AI Assistant (Function Calling)
- Week 3: Code Generator (Few-shot + CoT)
- Week 4: Fine-tuned Customer Service Bot
Conclusion
These 30 concepts are the "Periodic Table" of AI engineering.
You don't need to learn everything at once. But knowing that they exist, their relationships, and where to use them will help you quickly identify issues when you encounter them.
Remember two things:
- Concepts are dead, applications are alive. Don't memorize definitions; understand the scenarios.
- AI technology iterates rapidly. Today's best practice might be obsolete in half a year. Keep learning and embrace change.
Choose one concept you are interested in and try it out now.
The best way to understand AI is to work with AI.
The original version of this article can be found here → hongqi-lgs.github.io
Discussion