iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐷

Summarizing Long Documents with LangChain: MapReduce and Refine Methods

に公開

Is everyone utilizing ChatGPT?
ChatGPT can be used for various purposes, but summarization is one of its specialties.
Being an LLM (Large Language Model), it excels at tasks such as summarization, creating meeting minutes, and keyword extraction.
However, LLMs have limitations regarding the number of characters. As the number of characters increases, the computational complexity of the LLM increases, requiring more resources. Therefore, character counts are limited within a certain range.
So, how do we bypass this character limit? LangChain provides two methods to solve this problem.

What is LangChain?

LangChain is a library for efficiently extending the capabilities of language models like ChatGPT. Using this tool, challenges of language models, such as handling long-form input or complex computational problems, can be solved efficiently with short code.
https://www.langchain.com
https://zenn.dev/umi_mori/books/prompt-engineer/viewer/langchain_overview

Installing Necessary Libraries

pip install openai
pip install langchain

We will be using the OpenAI API from here on! Please set your respective OpenAI API keys.
Refer to other articles for instructions on how to issue an API key.

map_reduce and refine methods

Below is a link to the official documentation regarding summarization.
https://python.langchain.com/docs/use_cases/summarization
When summarizing long texts, we adopt the following approach!

  1. Split the text into fixed amounts.
  2. Process the split texts.

Step 1, the method for splitting text, is as follows. This method splits the text by the specified number of characters and returns them as a list.

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1500, chunk_overlap=0 , separator=".", 
)
texts = text_splitter.split_text(text)

Please check the following link for details.
https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter

LangChain provides two methods for processing the split texts in Step 2.
These are the map_reduce method and the refine method. Let's understand the differences by looking at the diagrams and code!

map_reduce method

The map_reduce method follows the flow below.

  1. Split the document
  2. Execute a prompt to the LLM for each document (map)
  3. Execute a prompt to the LLM to integrate each obtained response result (reduce) ⇒ Final Answer

The code is as follows.

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document

map_prompt_template = """以下の文章をテーマ毎にまとめてく下さい。
------
{text}
------
"""

map_combine_template="""以下の文章をテーマ毎にまとめてください。
------
{text}
------
"""

map_first_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])
map_combine_prompt = PromptTemplate(template=map_combine_template, input_variables=["text"])

map_chain = load_summarize_chain(
    llm=ChatOpenAI(temperature=0,model_name="gpt-3.5-turbo"),
    reduce_llm=ChatOpenAI(temperature=0,model_name="gpt-4"),
    collapse_llm=ChatOpenAI(temperature=0,model_name="gpt-4"),
    chain_type="map_reduce",
    map_prompt=map_first_prompt,
    combine_prompt=map_combine_prompt,
    collapse_prompt=map_combine_prompt,
    token_max=5000,
    verbose=True)

with open("/path/to/file/long_text.txt",'r', encoding='utf-8') as f:
    text = f.read()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1500, chunk_overlap=0 , separator=".", 
)
texts = text_splitter.split_text(text)

docs = [Document(page_content=t) for t in texts]
result=map_chain({"input_documents": docs}, return_only_outputs=True)
print(result["output_text"])

I'll explain the important parts!

map_prompt_template = """以下の文章をテーマ毎にまとめてく下さい。
------
{text}
------
"""

This is the prompt for each split text. {text} represents the split text.

llm=ChatOpenAI(temperature=0,model_name="gpt-3.5-turbo")

This is the base language model. temperature is a parameter that controls the diversity of the generated text. It can be specified from 0 to 1, but for summarization, setting a low value is recommended. model_name allows you to specify the OpenAI model; in this case, we are using gpt-3.5-turbo.

map_combine_template="""以下の文章をテーマ毎にまとめてください。
------
{text}
------
"""

This is the prompt to combine each summarized split text. {text} contains the responses from the LLM so far.
You can use the same language model as specified above to combine the text, but you can also specify a different one! Personally, from a cost-performance perspective, I recommend using gpt-3.5-turbo for map and gpt-4 for reduce!

reduce_llm=ChatOpenAI(temperature=0,model_name="gpt-4")

Also, if the text volume exceeds the model's limit during the reduce phase, an error will occur. In such cases, you can compress that text. You can specify the upper limit with token_max, and the prompt to be used when the limit is exceeded with collapse_prompt. Here, I've set it to be the same as the reduce prompt.

collapse_llm=ChatOpenAI(temperature=0,model_name="gpt-4")
collapse_prompt=map_combine_promp

refine method

The refine method follows the flow below.

  1. Split the document
  2. Execute the prompt to the LLM for the first split document
  3. Execute the prompt to the LLM for the response result obtained in 2 and the next split document
  4. Repeat step 3, and the answer for the last document becomes the final answer

The code is as follows.

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document

refine_first_template = """以下の文章をテーマ毎にまとめて下さい。
------
{text}
------
"""
refine_template = """以下の文章をテーマ毎にまとめてく下さい。
------
{existing_answer}
{text}
------
"""

refine_first_prompt = PromptTemplate(input_variables=["text"],template=refine_first_template)
refine_prompt = PromptTemplate(input_variables=["existing_answer", "text"],template=refine_template)

refine_chain = load_summarize_chain(
    ChatOpenAI(temperature=0,model_name="gpt-3.5-turbo-16k"),
    chain_type="refine",
    question_prompt=refine_first_prompt,
    refine_prompt=refine_prompt)

with open("/path/to/file/long_text.txt",'r', encoding='utf-8') as f:
    text = f.read()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1500, chunk_overlap=0 , separator=".", 
)
texts = text_splitter.split_text(text)

docs = [Document(page_content=t) for t in texts]
result=refine_chain({"input_documents": docs}, return_only_outputs=True)
print(result["output_text"])

I'll explain the important parts!

refine_first_template = """以下の文章をテーマ毎にまとめて下さい。
------
{text}
------
"""

This is the prompt executed first. {text} is the split text.

refine_template = """以下の文章をテーマ毎にまとめてく下さい。
------
{existing_answer}
{text}
------
"""

This is the prompt repeated from the second time onwards. It instructs to update the summary based on the previous answer and the content of the new document. {existing_answer} will contain the answer prior to executing this prompt.

Which one is better?

We have introduced two methods, and I'm sure you're wondering which one is better. However, both methods have their pros and cons, and it's impossible to declare one as definitively superior.
Here are the merits and demerits of each, albeit from a textbook perspective.

map_reduce method

  • Pros
    Since processes can be executed in parallel, it is suitable for processing large volumes of documents.
  • Cons
    Because each document is processed individually, there is a possibility of overlooking relationships or context between documents. This method might be disadvantageous particularly when documents refer to each other or when the overall context is crucial.

refine method

  • Pros
    By adopting an iterative approach, it is possible to improve results using information obtained at each step.
  • Cons
    Since it processes only one document at a time, this method might be disadvantageous for tasks requiring detailed information from many documents or when documents frequently refer to each other.

Personally, I recommend the map_reduce method because it allows for more parameter configurations and is superior in terms of speed! But honestly, being able to run both methods at once and compare the answers would be...

Conclusion

So far, we have explained in detail the introduction and usage of map_reduce and refine using the LangChain library. These methods are extremely powerful tools when processing large amounts of text data for tasks like summarization and classification!

map_reduce facilitates text splitting and parallel processing, allowing for smooth processing even with large datasets. On the other hand, refine summarizes text sequences step-by-step, enabling fine-grained control.

By using these methods appropriately, you can effectively analyze and summarize text data, improving the accuracy of information extraction. Furthermore, these technologies can be applied across various industries and scenarios, serving as a powerful means to derive valuable insights from text data.

Text summarization is one of the most powerful use cases for LLMs! I hope this article serves as a first step in unlocking that potential. If you have any questions or issues, feel free to use the community or documentation! Happy coding!

Discussion