GraphRAG for Python (Neo4j) を試してみる

 GraphRAG for Python
Neo4jのブログ記事：GraphRAG Python Package: Accelerating GenAI With Knowledge Graphs
公式ガイド GraphRAG for Python
Neo4j公式 GraphRAG機能を含むパッケージ
The GraphRAG Python package from Neo4j provides end-to-end workflows that take you from unstructured data to knowledge graph creation, knowledge graph retrieval, and full GraphRAG pipelines in one place. Whether you’re using Python to build knowledge assistants, search APIs, chatbots, or report generators, this package makes it easy to incorporate knowledge graphs to improve your retrieval-augmented generation (RAG) relevance, accuracy, and explainability.
GraphRAG Python Packageは、非構造化データから知識グラフの作成、知識グラフの取得、完全なGraphRAGパイプラインを一元化するエンドツーエンドのワークフローを提供します。Pythonを使用して知識アシスタント、検索API、チャットボット、レポート生成器を構築する場合でも、このパッケージを使用すると、知識グラフを取り入れて検索拡張生成(RAG)の関連性、正確性、説明可能性を向上させることができます。
!作業メモとして記載時の情報をもとに翻訳・コメントしたものですので、

最新の正確な情報は公式ガイドを参照してください。

fukuchan

ドキュメントは こちら にあります。
公式の Neo4j GraphRAG Package for Python は、開発者が Neo4j と Python を使って グラフ検索拡張生成 (GraphRAG) アプリケーションを構築できるようにします。

純正ライブラリとして、堅牢で豊富な機能、高性能なソリューションを提供し、Neo4j からの長期サポートとメンテナンスの保証もあります。

 サポートされている Neo4j バージョン:Neo4j >= 5.18.1
Neo4j Aura >= 5.18.0

 サポートされている Python バージョン:Python 3.12
Python 3.11
Python 3.10
Python 3.9

fukuchan

 GraphRAG: GenAIに知識を追加知識グラフとRAGを組み合わせたGraphRAGは、大規模言語モデル(LLM)の一般的な問題、例えば幻覚(hallucinations)を解決し、ドメイン固有のコンテキストを追加して、従来のRAGアプローチよりも高い品質と効果を提供します。

知識グラフは、LLMが信頼できるエージェントとして複雑なワークフローで信頼性のある回答を提供するために必要なコンテキストデータを提供します。

ほとんどのRAGソリューションとは異なり、GraphRAGは取得プロセスに構造化および半構造化情報を統合できます。
GraphRAG Python Packageは、知識グラフの作成と、グラフトラバーサル、text2Cypherによるクエリ生成、ベクトル、フルテキスト検索の組み合わせを使用した知識グラフデータ取得パターンを簡単に作成できるようにします。

さらに、完全なRAGパイプラインをサポートする追加のツールを提供し、GenAIアプリケーションとワークフローにGraphRAGをシームレスに実装できます。

fukuchan

 0. Create a Neo4j DatabaseNeo4j AuraDB

無料で使える「AuraDB Free」(インスタンスは1つのみ)
追記: Neo4j Desktop - Includes Neo4j Enterprise for Developers

fukuchan

知識グラフ(Knowledge Graph)の構築

注意：この機能を使用するには、Neo4j インスタンスに APOC コアライブラリをインストールする必要があります

fukuchan

1. Build a Knowledge Graph

Document: metadata for document sources

Chunk: text chunks from the documents with embeddings to power vector retrieval

__Entity__: Entities extracted from the text chunks

Neo4j Driver

import neo4j

neo4j_driver = neo4j.GraphDatabase.driver(NEO4J_URI,
                                         auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

LLM & Embedding Model

from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

llm=OpenAILLM(
   model_name="gpt-4o-mini",
   model_params={
       "response_format": {"type": "json_object"}, # use json_object formatting for best results
       "temperature": 0 # turning temperature down for more deterministic results
   }
)

#create text embedder
embedder = OpenAIEmbeddings()

LLM model : OpenAI, Google VertexAI, Anthropic, Cohere, Azure OpenAI, local Ollama models, and any chat model that works with LangChain.
embedding model OpenAI Embeddingsのデフォルトのtext-embedding-ada-002を使用します。他のプロバイダーの他のEmbeddingsを使用することもできます。(SentenceTransformer, VertexAI, MistralAI, Cohere, Azure OpenAI, Ollama from LlamaIndex)

fukuchan

📦 インストール

最新の安定版をインストールするには、以下を実行します：

pip install neo4j-graphrag

注意: Python パッケージは、仮想環境にインストールすることを推奨します。

fukuchan

LLM modelの設定方法

Azure Open AI

from neo4j_graphrag.llm import AzureOpenAILLM
llm = AzureOpenAILLM(
    model_name="gpt-4o",
    azure_endpoint="https://example-endpoint.openai.azure.com/",  # update with your endpoint
    api_version="2024-06-01",  # update appropriate version
)

VertexAI

from neo4j_graphrag.llm import VertexAILLM
from vertexai.generative_models import GenerationConfig

generation_config = GenerationConfig(temperature=0.0)
llm = VertexAILLM(
    model_name="gemini-1.5-flash-001", generation_config=generation_config
)

Anthropic

from neo4j_graphrag.llm import AnthropicLLM

llm = AnthropicLLM(
    model_name="claude-3-opus-20240229",
    model_params={"max_tokens": 1000},  # max_tokens must be specified
)

MistralAI

from neo4j_graphrag.llm import MistralAILLM

llm = MistralAILLM(
    model_name="mistral-small-latest",
)

Cohere

from neo4j_graphrag.llm import CohereLLM

llm = CohereLLM(
    model_name="command-r",
)

fukuchan

Ollama (OpenAI API)

from neo4j_graphrag.llm import OpenAILLM
llm = OpenAILLM(
    api_key="ollama",
    base_url="http://127.0.0.1:11434/v1",
    model_name="orca-mini"
)

Ollama (LangChain)

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="llama3:8b")

fukuchan

Optional Inputs: Schema & Prompt Template

potential_schema LLM がテキストで探すエンティティと関係をリストアップ

basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent",
                      "CellType", "Condition", "Disease", "Drug",
                      "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                      "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
   "BIOMARKER_FOR", …]

default prompt

prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

- Use only the information from the Input text. Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

Creating the SimpleKGPipeline

from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
   llm=ex_llm,
   driver=driver,
   text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
   embedder=embedder,
   entities=node_labels,
   relations=rel_types,
   prompt_template=prompt_template,
   from_pdf=True
)

Running the Knowledge Graph Builder

pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf',
            'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf',
            'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"Result: {pdf_result}")

A Note on Custom & Detailed Knowledge Graph Building

SimpleKGPipeline

Document Parser: extract text from documents, such as PDFs.
Text Splitter: split text into smaller pieces manageable by the LLM context window (token limit).
Chunk Embedder: compute the text embeddings for each chunk
Schema Builder: provide a schema to ground the LLM entity extraction for an accurate and easily navigable knowledge graph.
Entity & Relation Extractor: extract relevant entities and relations from the text
Knowledge Graph Writer: save the identified entities and relations to the KG

fukuchan

Knowledge Graph Builder

fukuchan

2. Retrieve Data From Your Knowledge Graph

Vector Retriever: performs similarity searches using vector embeddings
Vector Cypher Retriever: combines vector search with retrieval queries in Cypher, Neo4j’s Graph Query language, to traverse the graph and incorporate additional nodes and relationships.
Hybrid Retriever: Combines vector and full-text search.
Hybrid Cypher Retriever: Combines vector and full-text search with Cypher retrieval queries for additional graph traversal.
Text2Cypher: converts natural language queries into Cypher queries to run against Neo4j.
Weaviate & Pinecone Neo4j Retriever: Allows you to search vectors stored in Weaviate or Pinecone and connect them to nodes in Neo4j using external id properties.
Custom Retriever: allows for tailored retrieval methods based on specific needs.

fukuchan

Vector Retriever

ANN(近似最近傍探索) ベクトル検索を使用して知識グラフからデータを取得します。

from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

vector retriever ()

from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

プロンプトでリトリーバーを実行

import json

vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", 
                                                 top_k=3)
for i in vector_res.records: print("====\n" + json.dumps(i.data(), indent=4))

fukuchan

Vector Cypher Retriever

ベクトル検索でノードを取得した後、Cypher(グラフクエリ言語)を使用してグラフトラバーサルロジックを取り込むことができます。

ベクトル検索でChunkノードを取得し、エンティティを3ホップまでトラバースするリトリーバーを作成

from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//1) Go out 2-3 hops in the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel

//2) collect relationships and text chunks
WITH collect(DISTINCT chunk) AS chunks, 
  collect(DISTINCT rel) AS rels

//3) format and return context
RETURN '=== text ===\n' + apoc.text.join([c in chunks | c.text], '\n---\n') + '\n\n=== kg_rels ===\n' +
  apoc.text.join([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' + endNode(r).name ], '\n---\n') AS info
"""
)

同じプロンプトでリトリーバーを実行

vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", top_k=3)

# print output
kg_rel_pos = vc_res.records[0]['info'].find('\n\n=== kg_rels ===\n')
print("# Text Chunk Context:")
print(vc_res.records[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.records[0]['info'][kg_rel_pos:])

fukuchan

3. Instantiate and Run GraphRAG

from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned. 

# Question:
{query_text}
 
# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)

fukuchan

3. Instantiate and Run GraphRAG

GraphRAG Python Packageは、GraphRAGクラスを使用してGraphRAGパイプラインのインスタンス化と実行を簡単にします。

ベクトルリトリーバーとベクトルCypherリトリーバーの両方のGraphRAGオブジェクトを作成

from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned. 

# Question:
{query_text}
 
# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)

質問をして、異なる知識グラフ取得パターンを比較

q = "How is precision medicine applied to Lupus? provide in list format."

print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k':5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':5}).answer}")

複数のテキストチャンクから情報を取得する必要があるより複雑な質問をして比較

q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, and treatments? Provide in detailed list format."

v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)

print(f"Vector Response: \n{v_rag_result.answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag_result.answer}")