iTranslated by AI
Building RAG Agents with LangGraph Tool Calling (Part 2)
Introduction
Please refer to the previous article for the purpose of this post.
In this part, I will immediately begin explaining the code from the previous article.
There are some interesting topics for those who want to understand the details, such as the implementation of a custom LangChain Retriever class and how similarity and distance are sometimes confused within LangChain. I hope you enjoy it!
References
(Book links are Amazon affiliate links)
Articles
How to migrate from LangChain agents to LangGraph
Similarity search methods with Chroma DB
How to create a custom Retriever class
Document loaders in LangChain
Google's Gen AI models
Books
Practical Introduction to RAG and AI Agents with LangChain and LangGraph
Practical Introduction to Chat System Construction with ChatGPT/LangChain
By using LangChain, you can execute various models with a unified codebase.
I highly recommend these books for LangChain, as they cover almost everything you need.
They also explain create_react_agent, which is currently the recommended way to build RAG Agents with LangGraph, and cover complex Agent construction and design methods. It's very informative!
Introduction to Large Language Models
Introduction to Large Language Models II: Implementation and Evaluation of Generative LLMs
I often recommend these; they provide a broad overview of everything from LLM fine-tuning to RLHF, RAG, and distributed learning. I refer to them constantly.
In addition to the RAG content introduced in this article, they also touch upon Instruction Tuning assuming RAG, which is very interesting.
If you work with LLMs, you won't regret buying them.
They are like books that reveal new discoveries the more you read them, much like how "surume" (dried squid) gets more flavorful the more you chew it.
Fine-tuning LLMs and RAG: Practice through Chatbot Development
Compared to the two books above, this one focuses more specifically on RAG and fine-tuning. It's written quite simply and is easy to understand.
Also, although not the main focus of this article, it is common to consider hybrid search by adding keyword search when implementing RAG. This book delves into that area as well.
It also introduces tips for using BM25Retriever (often used for keyword search) with Japanese documents, making it a very practical book.
Artifacts
Please see the following GitHub repository.
Code Explanation
Vectorizing Documents with an Embedding Model and Storing them in a Database
The actual code can be found below.
Constant Definitions
# --- Constant Definitions ---
EM_MODEL_NAME = "models/text-embedding-004"
RAG_FOLDER_PATH = "./inputs"
CHROMA_DB = "./chroma/chroma_langchain_db"
CHROMA_NAME = "example_collection"
We define the Embedding model to be used, the name of the Chroma DB, and the storage location.
In this case, we are using text-embedding-004 as the Embedding model.
The only reason for this choice is that it is free.
Defining the Embedding Model
# Define the text embedding model (dense embedding)
embedding_model = GoogleGenerativeAIEmbeddings(model=EM_MODEL_NAME)
Here, we define the Embedding model that will vectorise the documents.
We are using the following free model:
Defining the Chroma DB
vector_store = Chroma(
collection_name=CHROMA_NAME,
embedding_function=embedding_model,
persist_directory=CHROMA_DB, # Where to save data locally, remove if not necessary
)
We define the Chroma DB using the LangChain wrapper.
With just this, you can create the specified database locally. It's quite simple.
Since we specify the Embedding model here, when storing documents, the system will automatically vectorize and store them if you use the designated methods.
Also, by specifying a local path in persist_directory, the database will be persisted there, allowing it to be called from other code.
In this article, we will actually call it from the code that uses RAG to have the LLM generate answers.
Loading Documents and Creating Chunks, IDs, Metadata, etc.
def load_text_files_from_folder(folder_path):
"""
Function to load all text files (.txt) from a specified folder
:param folder_path: Path to the folder to load
:return: List of loaded documents
"""
# Get all .txt files in the folder
text_files = glob.glob(os.path.join(folder_path, "*.txt"))
# Load all text files
documents = []
for file in text_files:
loader = TextLoader(file)
documents.extend(loader.load()) # Add content of each file to the list
print(f"Loaded {len(documents)} documents from {folder_path}")
return documents # Return the list of loaded documents
def main():
...
# Load text files
documents = load_text_files_from_folder(RAG_FOLDER_PATH)
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=[
"\n\n",
"\n",
" ",
".",
",",
"\u200b", # Zero-width space
"\uff0c", # Fullwidth comma
"\u3001", # Ideographic comma
"\uff0e", # Fullwidth full stop
"\u3002", # Ideographic full stop
"",
],)
doc_splits = text_splitter.split_documents(documents)
# Extract the text portion of the chunks
texts = [doc.page_content for doc in doc_splits]
# optional IDs and metadata
ids = ["i_" + str(i + 1) for i in range(len(texts))]
metadatas = [{"my_metadata": i} for i in range(len(texts))]
...
Document Loading
Here, we first use the load_text_files_from_folder function to load local text files in Document format.
Within the function, TextLoader is used to read text files. This is also a convenient feature of LangChain. There are also classes implemented for loading PDFs, CSVs, etc. For more details, see below:
By using the above, you can read various files in a folder based on their extension and metadata.
The obtained data is concatenated into documents using documents.extend(loader.load()).
Chunk Splitting
Next, the obtained text data is split into chunks. We are using RecursiveCharacterTextSplitter here.
By using this, it splits the text in response to separators (separators) within the document. Therefore, it splits the text somewhat according to context, such as paragraphs (though this can be difficult for text where the above separators rarely appear).
Also, I've added more separators referring to the following article. It seems better to add more for Japanese (though it didn't change much within the scope of this experiment).
In this case, since the amount of text is small and it's a single file, we split it into 500-character units with a 100-character overlap. The text is split at the point where the above separators react within the maximum 500-character segment.
Setting Text, IDs, and Metadata
Finally, we extract the text part and assign IDs and metadata in the following section.
doc_splits = text_splitter.split_documents(documents)
# Extract the text portion of the chunks
texts = [doc.page_content for doc in doc_splits]
# optional IDs and metadata
ids = ["i_" + str(i + 1) for i in range(len(texts))]
metadatas = [{"my_metadata": i} for i in range(len(texts))]
Here, IDs and metadata are assigned quite arbitrarily, but when having a LangGraph agent perform RAG, you can also provide this ID and metadata information. Therefore, by properly specifying document information (such as filenames or page numbers), you could present primary sources when responding to customer Q&A.
Vectorizing and Storing in the DB
# ---- dense embedding ----
dense_embeddings = embedding_model.embed_documents(texts)
# Check the contents of embeddings
print("dense embeddings (partial):", dense_embeddings[0][:5]) # Check the first embedding
print("dense embeddings length:", len(dense_embeddings))
#https://github.com/langchain-ai/langchain/blob/5d581ba22c68ab46818197da907278c1c45aad41/libs/partners/chroma/langchain_chroma/vectorstores.py#L502
result = vector_store.add_texts(
texts=texts,
metadatas=metadatas,
ids=ids,
)
In the code above, the documents are vectorized into dense_embeddings. Then, the vectors are stored in Chroma DB using the add_texts method.
The add_texts method is implemented for most databases in LangChain, so it can often be used as is. However, since implementations can vary depending on the contributor, I recommend checking the original source code once, as there might be even easier-to-use methods.
Checking if Stored
While we're done here, I checked to see if it was really stored in the DB.
# The following code is for checking the content of data saved in chroma DB
# Example of retrieving all data (documents and metadata only)
data = vector_store._collection.get(include=["documents", "metadatas","embeddings"])
print("Documents (3 items):", data["documents"][:3])
print("Metadatas (all items):", data["metadatas"])
print("Embeddings (partial):", data["embeddings"][0][:5])
Running this allows you to retrieve the document and metadata information for the vector data stored in Chroma DB.
LLM Answering Questions via RAG from Stored DB Content
The actual code can be found below.
I will explain this in the same way.
Constant Definitions
# --- Constant Definitions ---
LLM_MODEL_NAME = "gemini-2.0-flash-001"
EM_MODEL_NAME = "models/text-embedding-004"
CHROMA_DB = "./chroma/chroma_langchain_db"
CHROMA_NAME = "example_collection"
# Query text
query = "If a user under 16 years old uses our service from abroad without parental consent, how is it handled? Is there a possibility that the data will be stored outside the country?"
Here, we specify the Chroma DB information set in the previous code, as well as the Embedding model and LLM model to be used.
We also define the user's query here.
By the way, this question is one whose answer can be found in Articles 8 and 10 of the Privacy Policy.
Defining Models and the DB
# Define the text embedding model (dense embedding)
embedding_model = GoogleGenerativeAIEmbeddings(model=EM_MODEL_NAME)
# Chroma
vector_store = Chroma(
collection_name=CHROMA_NAME,
embedding_function=embedding_model,
persist_directory=CHROMA_DB, # Where to save data locally, remove if not necessary
)
# Chat model (LLM)
llm = ChatGoogleGenerativeAI(
model=LLM_MODEL_NAME,
temperature=0.2,
max_tokens=512,
)
Basically, this is the same as the code above.
For the LLM, I have selected gemini-2.0-flash-001. It is a free model.
Implementing the Logic for Searching the Database
class VectorSearchRetriever(BaseRetriever):
"""
Retriever class for performing vector search.
"""
vector_store: SkipValidation[Any]
embedding_model: SkipValidation[Any]
k: int = 5 # Number of Document chunks to return
class Config:
arbitrary_types_allowed = True
def _get_relevant_documents(self, query: str) -> List[Document]:
# Dense embedding
embedding = self.embedding_model.embed_query(query)
search_results = self.vector_store.similarity_search_by_vector_with_relevance_scores(
embedding=embedding,
k=self.k,
)
# Extract only the list of Documents
return [doc for doc, _ in search_results]
async def _aget_relevant_documents(self, query: str) -> List[Document]:
return self._get_relevant_documents(query)
...
def main():
...
# Prepare DenseRetriever
dence_retriever = VectorSearchRetriever(
vector_store=vector_store,
embedding_model=embedding_model,
k=5,
)
...
When implementing RAG in LangChain, you use a convenient class called a Retriever.
For the Chroma DB used in this project, there is a very handy method called as_retriever, which allows you to easily use it as a Retriever class like this:
dence_retriever = vector_store.as_retriever()
However, looking at various Google Cloud services, it seems that not all vector stores necessarily support the .as_retriever() method yet.
(Though I believe support will continue to expand in the future.)
Even if a database doesn't support the .as_retriever() method, LangChain allows you to create your own custom Retriever class to achieve the same result.
So, I decided to implement one, partly for my own learning.
Details on how to create a custom Retriever class are described here:
Checking this, it seems you should inherit from the BaseRetriever class and implement the synchronous method _get_relevant_documents and the asynchronous method _aget_relevant_documents.
In the _get_relevant_documents method, you implement the process of vectorizing the user's query and calculating the similarity between that vector and the vectors stored in the DB to retrieve the relevant documents.
Checking the LangChain implementation source for Chroma DB, various methods such as similarity_search_by_vector_with_relevance_scores are available.
For example, the following are implemented in Chroma DB (Source):
-
similarity_search- A method that takes the query itself as input and retrieves the top k documents with high similarity.
- Internally, it passes the query to the Embedding model to vectorize it.
- Internally, the
similarity_search_with_scoremethod is called, and only the document part is returned.
-
similarity_search_with_score- Takes the query itself as input and retrieves the top k documents with high similarity. It also outputs the calculated similarity scores.
- Note that the similarity here is represented as distance; smaller values mean higher similarity.
- For cosine distance, it's (1 - cosine similarity).
- Takes the query itself as input and retrieves the top k documents with high similarity. It also outputs the calculated similarity scores.
-
similarity_search_by_vector- A method that takes a manually embedded vector as input and retrieves the top k documents with high similarity.
-
similarity_search_by_vector_with_relevance_scores- Takes a manually embedded vector as input and retrieves the top k documents with high similarity. It also outputs the calculated similarity (distance).
- Similarly, smaller distance values indicate higher similarity.
- Takes a manually embedded vector as input and retrieves the top k documents with high similarity. It also outputs the calculated similarity (distance).
-
similarity_search_with_vectors- Takes the query itself as input and retrieves the top k documents with high similarity. It also outputs the document's embedding vector itself upon return.
-
similarity_search_by_image(Image-based)- A method that takes an image as input and outputs the top k similar images.
- To use this, the Embedding model set in the vector_store must support image embedding.
-
similarity_search_by_image_with_relevance_score(Image-based)- Almost identical to the above method, but also outputs similarity (distance).
- Again, lower distance values mean higher similarity.
-
max_marginal_relevance_search(MMR-based)- A method that uses MMR (Maximal Marginal Relevance) to retrieve documents while considering diversity compared to other documents.
- It takes the query, retrieves
fetch_kdocuments first, and then narrows them down tokdocuments using MMR. In this process, documents with similar content are removed.
-
max_marginal_relevance_search_by_vector(MMR-based)- Similar to the above method, but uses a pre-embedded vector for search.
For this implementation, I used the similarity_search_by_vector_with_relevance_scores method—no particular reason, I just thought the longer name looked cool.
In the actual implementation, it is used as follows:
def _get_relevant_documents(self, query: str) -> List[Document]:
# Dense embedding
embedding = self.embedding_model.embed_query(query)
search_results = self.vector_store.similarity_search_by_vector_with_relevance_scores(
embedding=embedding,
k=self.k,
)
# Extract only the list of Documents
return [doc for doc, _ in search_results]
As specified by the method, the query is first vectorized using embedding_model.embed_query. Then, k documents are retrieved using the similarity_search_by_vector_with_relevance_scores method. Since the distance is not needed, only the Documents are returned.
(Admittedly, one might argue "Then just use the similarity_search method from the start," and they would be right.)
I implemented it this way purely out of hope for potential future extensibility.
Finally, it is used in the main function as follows:
# Prepare DenseRetriever
dence_retriever = VectorSearchRetriever(
vector_store=vector_store,
embedding_model=embedding_model,
k=5,
)
To reiterate, the following implementation is sufficient (and equivalent):
dence_retriever = vector_store.as_retriever(search_kwargs={'k': 5})
(Supplementary: Detailed Discussion) Implementation of as_retriever
Incidentally, as_retriever can be configured in three ways:
- "similarity" (default)
- "mmr"
- "similarity_score_threshold"
In the "similarity" mode, the similarity_search method is ultimately called internally.
We can trace it as follows:
The above is the as_retriever method. Here, a VectorStoreRetriever class is finally created.
The VectorStoreRetriever class is implemented as a Retriever class, just as I did above (so the essence is the same).
Therefore, by looking at the _get_relevant_documents method, you can understand what processing is being performed.
The internal logic is as follows:
def _get_relevant_documents(
self, query: str, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: Any
) -> list[Document]:
_kwargs = self.search_kwargs | kwargs
if self.search_type == "similarity":
docs = self.vectorstore.similarity_search(query, **_kwargs)
elif self.search_type == "similarity_score_threshold":
docs_and_similarities = (
self.vectorstore.similarity_search_with_relevance_scores(
query, **_kwargs
)
)
docs = [doc for doc, _ in docs_and_similarities]
elif self.search_type == "mmr":
docs = self.vectorstore.max_marginal_relevance_search(query, **_kwargs)
else:
msg = f"search_type of {self.search_type} not allowed."
raise ValueError(msg)
return docs
Here, you can see the options for similarity, similarity_score_threshold, and mmr.
When self.search_type == "similarity" is specified (which is the default), the similarity_search method is called.
Thus, it is equivalent to my implementation.
(Supplementary: Detailed Discussion 2) Mixing Similarity and Distance
When reading the internal implementation, it's incredibly confusing because "distance" and "similarity" are all mixed together.
For example, looking back at the example from "Detailed Discussion 1," we can see this by following the similarity_score_threshold pattern.
Looking at the comments in the implementation of the as_retriever method above, it seems it's intended to be used like this:
# Only retrieve documents that have a relevance score
# Above a certain threshold
docsearch.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={'score_threshold': 0.8}
)
In conclusion, in the above notation, the threshold is "similarity."
That is, 1 is the most similar, and 0 is the most distant.
Wait, what????
Didn't you just say earlier that since it's "distance," 0 is the most similar?!
Yes, you are absolutely right. The reason this happens is likely because the implementation varies for each DB, and conversions are cleverly inserted in between.
Let's take a look.
When search_type="similarity_score_threshold" is specified, the similarity_search_with_relevance_scores method is called.
This is implemented as follows:
def similarity_search_with_relevance_scores(
self,
query: str,
k: int = 4,
**kwargs: Any,
) -> list[tuple[Document, float]]:
"""Return docs and relevance scores in the range [0, 1].
0 is dissimilar, 1 is most similar.
Args:
query: Input text.
k: Number of Documents to return. Defaults to 4.
**kwargs: kwargs to be passed to similarity search. Should include:
score_threshold: Optional, a floating point value between 0 to 1 to
filter the resulting set of retrieved docs.
Returns:
List of Tuples of (doc, similarity_score).
"""
score_threshold = kwargs.pop("score_threshold", None)
docs_and_similarities = self._similarity_search_with_relevance_scores(
query, k=k, **kwargs
)
if any(
similarity < 0.0 or similarity > 1.0
for _, similarity in docs_and_similarities
):
warnings.warn(
"Relevance scores must be between"
f" 0 and 1, got {docs_and_similarities}",
stacklevel=2,
)
if score_threshold is not None:
docs_and_similarities = [
(doc, similarity)
for doc, similarity in docs_and_similarities
if similarity >= score_threshold
]
if len(docs_and_similarities) == 0:
logger.warning(
"No relevant docs were retrieved using the relevance score"
f" threshold {score_threshold}"
)
return docs_and_similarities
As you can see by reading the latter half, it ultimately retrieves only documents where similarity >= score_threshold.
And as stated in the comments at the top, "0 is dissimilar, 1 is most similar."
Now, the method actually retrieving the documents is _similarity_search_with_relevance_scores.
This is implemented as follows:
def _similarity_search_with_relevance_scores(
self,
query: str,
k: int = 4,
**kwargs: Any,
) -> list[tuple[Document, float]]:
"""Default similarity search with relevance scores. Modify if necessary
in subclass.
Return docs and relevance scores in the range [0, 1].
0 is dissimilar, 1 is most similar.
Args:
query: Input text.
k: Number of Documents to return. Defaults to 4.
**kwargs: kwargs to be passed to similarity search. Should include:
score_threshold: Optional, a floating point value between 0 to 1 to
filter the resulting set of retrieved docs
Returns:
List of Tuples of (doc, similarity_score)
"""
relevance_score_fn = self._select_relevance_score_fn()
docs_and_scores = self.similarity_search_with_score(query, k, **kwargs)
return [(doc, relevance_score_fn(score)) for doc, score in docs_and_scores]
Here, the method that actually fetches the text, similarity_search_with_score, is not implemented in the base VectorStore class we are currently looking at (Proof).
In other words, it uses the implementation of the class Chroma(VectorStore), which is the Chroma DB class inheriting from this VectorStore class.
That means it is the following implementation, where, as mentioned earlier, the score is calculated as "distance."
similarity_search_with_score
- A method that takes the query itself as input and retrieves the top k documents with high similarity. It also outputs the calculated similarity scores.
- However, the similarity here is represented as distance; smaller values mean higher similarity.
- For cosine distance, it's (1 - cosine similarity).
Now, the part that bridges this gap is the rest of the code above:
return [(doc, relevance_score_fn(score)) for doc, score in docs_and_scores]
If you look closely, the relevance_score_fn function is applied to the score part.
And this function is defined as follows:
relevance_score_fn = self._select_relevance_score_fn()
This _select_relevance_score_fn method is also not implemented in the VectorStore class (Proof).
This is because the implementation of whether it uses "similarity" or "distance" differs for each DB.
In this case, the implementation for class Chroma(VectorStore) is as follows:
def _select_relevance_score_fn(self) -> Callable[[float], float]:
"""Select the relevance score function based on collections distance metric.
The most similar documents will have the lowest relevance score. Default
relevance score function is euclidean distance. Distance metric must be
provided in `collection_metadata` during initialization of Chroma object.
Example: collection_metadata={"hnsw:space": "cosine"}. Available distance
metrics are: 'cosine', 'l2' and 'ip'.
Returns:
The relevance score function.
Raises:
ValueError: If the distance metric is not supported.
"""
if self.override_relevance_score_fn:
return self.override_relevance_score_fn
distance = "l2"
distance_key = "hnsw:space"
metadata = self._collection.metadata
if metadata and distance_key in metadata:
distance = metadata[distance_key]
if distance == "cosine":
return self._cosine_relevance_score_fn
elif distance == "l2":
return self._euclidean_relevance_score_fn
elif distance == "ip":
return self._max_inner_product_relevance_score_fn
else:
raise ValueError(
"No supported normalization function"
f" for distance metric of type: {distance}."
"Consider providing relevance_score_fn to Chroma constructor."
)
Looking at the code, distance = "l2" seems to be the default, so self._euclidean_relevance_score_fn is being used.
(At this point, observant readers will realize that "passing some kind of function means the score is being changed to a different value." If the score could be used as is—i.e., score = similarity—an identity function would suffice.)
This method is implemented in the VectorStore class.
@staticmethod
def _euclidean_relevance_score_fn(distance: float) -> float:
"""Return a similarity score on a scale [0, 1]."""
# The 'correct' relevance function
# may differ depending on a few things, including:
# - the distance / similarity metric used by the VectorStore
# - the scale of your embeddings (OpenAI's are unit normed. Many
# others are not!)
# - embedding dimensionality
# - etc.
# This function converts the Euclidean norm of normalized embeddings
# (0 is most similar, sqrt(2) most dissimilar)
# to a similarity function (0 to 1)
return 1.0 - distance / math.sqrt(2)
As you can see in the last line, "distance" is being converted into "similarity."
Because this kind of processing is included, you need to consider whether you are dealing with "similarity" or "distance" by checking the original implementation.
(Supplementary: Detailed Discussion 3) Difference between embed_documents and embed_query
In LangChain, you probably use two main methods when converting text into embedding vectors.
When the target to be vectorized is the document to be searched:
embeddings = embedding_model.embed_documents(texts)
When the target to be vectorized is the user's query:
embedding = embedding_model.embed_query(query)
The reason for this distinction is that some embedding models are designed to output different vectors depending on the intended use. This is because in RAG, the text we want to retrieve isn't necessarily "text similar to the question statement," but rather "text that contains the expected answer."
Therefore, rather than vectorizing the question statement as-is, generating a hypothetical answer to the question and vectorizing that text might ultimately improve search accuracy. In embedding models capable of such specialized processing, there is value in using these methods separately as shown above.
It appears that Google's embedding models incorporate this kind of processing.
For details, you can see from the link above that the results of the following two differ.
query_embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001", task_type="retrieval_query"
)
query_vecs = [query_embeddings.embed_query(q) for q in [query, query_2, answer_1]]
doc_embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001", task_type="retrieval_document"
)
doc_vecs = [doc_embeddings.embed_query(q) for q in [query, query_2, answer_1]]
The difference lies in the task_type. When it is retrieval_query, it assumes a user's query and outputs a vector close to the expected answer text. On the other hand, when it is retrieval_document, it assumes the document being searched and is vectorized as-is.
Quite clever. As users, let's just use embed_query for user queries and embed_documents for documents without overthinking it.
For a more detailed explanation, please also refer to the following:
Actually making the LLM answer with a chain
Thank you to everyone who read through the detailed technical discussion. Now, back to the main topic.
At this point, we have successfully created the Retriever class, so all that's left is to construct and use the chain using standard LCEL notation. Specifically, the implementation is as follows.
# Prompt definition
prompt_template = """
あなたは、株式会社asapに所属するAIアシスタントです。
ユーザからサービスに関連する質問や雑談を振られた場合に、適切な情報を提供することもが求められています。
ユーザからサービスに関する情報を質問された場合は、下記の情報源から情報を取得して回答してください。
下記の情報源から情報を取得して回答する場合は、ユーザが一時情報を確認できるように、取得した情報の文章も追加で出力してください。
情報源
{context}
"""
prompt = ChatPromptTemplate.from_messages(
[
("system", prompt_template),
("human", "{query}"),
]
)
# Define the chain (get context with retriever, fit into Prompt, send to LLM)
chain = (
{"context": dence_retriever, "query": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Execution
print("===== Execution result of DenseRetriever =====")
dense_docs = dence_retriever.invoke(query)
print("\\nDenseRetrieved Documents:", dense_docs)
print("\\n================= LLM Execution Result =================")
result = chain.invoke(query)
print(result)
The prompt is written assuming use as a corporate Q&A bot. Then, it's implemented by incorporating the created dence_retriever into the chain.
For a detailed explanation of LCEL notation, I have written an article on that as well, so please refer to it.
Having the LLM Answer Questions via RAG from Stored DB Content Using Agents
Now, let's implement the Agent.
Even though I call it an Agent, we are primarily using Tool Calling.
We set the RAG function as a Tool, and then the LLM decides whether or not to use that Tool—a common pattern for Tool Calling.
Currently, implementing Tool Calling is recommended to be done as an Agent using LangGraph, so that's why I'm referring to it as an Agent.
The corresponding code can be found below.
The content is very similar to search_rag_documents_local.py, so I will only explain the differences.
Setting the Retriever as a Tool
# Prepare DenseRetriever
dence_retriever = VectorSearchRetriever(
vector_store=vector_store,
embedding_model=embedding_model,
k=5,
)
tool = dence_retriever.as_tool(
name="Document_Search_Tool",
description="A tool for retrieving information related to service terms and conditions."
)
In the code above, we set the created custom Retriever class as a Tool.
To set it as a Tool, simply use the as_tool method. It's quite easy.
name is where you enter the name of the Tool. Basically, only English characters are accepted, and spaces are not allowed.
description is where you provide an explanation for the Tool. Both Japanese and spaces are allowed here, but keep in mind that the LLM looks at this description to determine which Tool to use for a given query.
Therefore, it is important to describe it clearly and in detail.
Defining the Prompt
# Prompt Definition
prompt_template = """
You are an AI assistant belonging to asap Inc.
You are expected to provide appropriate information when a user asks questions related to the service or initiates small talk.
If the user asks for information regarding the service, please use the tool to retrieve information and answer.
When answering using information retrieved from the tool, please also output the text of the retrieved information so that the user can verify the primary source.
"""
prompt = ChatPromptTemplate.from_messages(
[
("system", prompt_template),
('human', '{messages}'),
]
)
This is slightly different from the previous code, but the base remains the same.
The difference is that the mention of the source ({context}) has been removed from the system prompt.
Since we will be executing a Tool to obtain external information after the query, there is no need to explicitly describe it here.
Also, since we are using LangGraph, messages are required. Accordingly, we are using messages here.
Defining the Agent
# Creating the Agent
agent = create_react_agent(
model=llm,
tools=[tool],
state_modifier=prompt
)
In this section, the Agent is defined.
In LangGraph, you generally use create_react_agent.
This "react" refers to "ReAct" (Reasoning and Acting), not the frontend framework React.
I haven't read the paper on ReAct yet, so I'll leave an explanation article here for now. I'll study it properly sometime.
The important thing is that by setting it up as above, the LLM model, the Tool (RAG), and the prompt are all configured as an Agent.
Executing Agents
print("\n================= Agent Execution Results =================")
result = agent.invoke({'messages':query})
print_agent_result_details(result)
print("\n================= Final Output Result =================")
print(result['messages'][-1].content)
In the above, the Agent is executed, and the intermediate steps are displayed. The print_agent_result_details function is a function I wrote within this code to display the otherwise hard-to-read Agent output in a clearer way.
Executing an Agent is done using invoke, just like a Chain.
You might occasionally see articles using a run method, but be careful as that is an old, deprecated way of writing it (I've been caught by that before).
Summary
It was quite long across two parts, but thank you to everyone who read through.
This time, we used a local DB and local documents.
In an actual RAG implementation, you would likely use documents and databases on cloud services like Google Cloud.
In future posts, I would like to explain the implementation methods for those cases.
(That's the part I struggled with, so I want to record it for my own reference...)
See you in the next post!
Recommended Books
(Book links are Amazon affiliate links)
Practical Introduction to RAG and AI Agents with LangChain and LangGraph
Practical Introduction to Chat System Construction with ChatGPT/LangChain
By using LangChain, you can execute various models with a unified codebase.
I highly recommend these books for LangChain, as they cover almost everything you need.
They also explain create_react_agent, which is currently the recommended way to build RAG Agents with LangGraph, and cover complex Agent construction and design methods. It's very informative!
Introduction to Large Language Models
Introduction to Large Language Models II: Implementation and Evaluation of Generative LLMs
I often recommend these; they provide a broad overview of everything from LLM fine-tuning to RLHF, RAG, and distributed learning. I refer to them constantly.
In addition to the RAG content introduced in this article, they also touch upon Instruction Tuning assuming RAG, which is very interesting.
If you work with LLMs, you won't regret buying them.
They are like books that reveal new discoveries the more you read them, much like how "surume" (dried squid) gets more flavorful the more you chew it.
Fine-tuning LLMs and RAG: Practice through Chatbot Development
Compared to the two books above, this one focuses more specifically on RAG and fine-tuning. It's written quite simply and is easy to understand.
Also, although not the main focus of this article, it is common to consider hybrid search by adding keyword search when implementing RAG. This book delves into that area as well.
It also introduces tips for using BM25Retriever (often used for keyword search) with Japanese documents, making it a very practical book.
Discussion