Open16

GraphRAG for Python (Neo4j) を試してみる

fukuchanfukuchan

GraphRAG for Python

Neo4jのブログ記事:GraphRAG Python Package: Accelerating GenAI With Knowledge Graphs

公式ガイド GraphRAG for Python

Neo4j公式 GraphRAG機能を含むパッケージ

The GraphRAG Python package from Neo4j provides end-to-end workflows that take you from unstructured data to knowledge graph creation, knowledge graph retrieval, and full GraphRAG pipelines in one place. Whether you’re using Python to build knowledge assistants, search APIs, chatbots, or report generators, this package makes it easy to incorporate knowledge graphs to improve your retrieval-augmented generation (RAG) relevance, accuracy, and explainability.

GraphRAG Python Packageは、非構造化データから知識グラフの作成、知識グラフの取得、完全なGraphRAGパイプラインを一元化するエンドツーエンドのワークフローを提供します。Pythonを使用して知識アシスタント、検索API、チャットボット、レポート生成器を構築する場合でも、このパッケージを使用すると、知識グラフを取り入れて検索拡張生成(RAG)の関連性、正確性、説明可能性を向上させることができます。

fukuchanfukuchan

ドキュメントは こちら にあります。

公式の Neo4j GraphRAG Package for Python は、開発者が Neo4j と Python を使って グラフ検索拡張生成 (GraphRAG) アプリケーションを構築できるようにします。
純正ライブラリとして、堅牢で豊富な機能、高性能なソリューションを提供し、Neo4j からの長期サポートとメンテナンスの保証もあります。

サポートされている Neo4j バージョン:

  • Neo4j >= 5.18.1
  • Neo4j Aura >= 5.18.0

サポートされている Python バージョン:

  • Python 3.12
  • Python 3.11
  • Python 3.10
  • Python 3.9
fukuchanfukuchan

GraphRAG: GenAIに知識を追加

知識グラフとRAGを組み合わせたGraphRAGは、大規模言語モデル(LLM)の一般的な問題、例えば幻覚(hallucinations)を解決し、ドメイン固有のコンテキストを追加して、従来のRAGアプローチよりも高い品質と効果を提供します。
知識グラフは、LLMが信頼できるエージェントとして複雑なワークフローで信頼性のある回答を提供するために必要なコンテキストデータを提供します。
ほとんどのRAGソリューションとは異なり、GraphRAGは取得プロセスに構造化および半構造化情報を統合できます。

GraphRAG Python Packageは、知識グラフの作成と、グラフトラバーサル、text2Cypherによるクエリ生成、ベクトル、フルテキスト検索の組み合わせを使用した知識グラフデータ取得パターンを簡単に作成できるようにします。
さらに、完全なRAGパイプラインをサポートする追加のツールを提供し、GenAIアプリケーションとワークフローにGraphRAGをシームレスに実装できます。

fukuchanfukuchan

1. Build a Knowledge Graph

  1. Document: metadata for document sources
  2. Chunk: text chunks from the documents with embeddings to power vector retrieval
  3. __Entity__: Entities extracted from the text chunks

Neo4j Driver

import neo4j

neo4j_driver = neo4j.GraphDatabase.driver(NEO4J_URI,
                                         auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

LLM & Embedding Model

from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

llm=OpenAILLM(
   model_name="gpt-4o-mini",
   model_params={
       "response_format": {"type": "json_object"}, # use json_object formatting for best results
       "temperature": 0 # turning temperature down for more deterministic results
   }
)

#create text embedder
embedder = OpenAIEmbeddings()

LLM model : OpenAI, Google VertexAI, Anthropic, Cohere, Azure OpenAI, local Ollama models, and any chat model that works with LangChain.
embedding model OpenAI Embeddingsのデフォルトのtext-embedding-ada-002を使用します。他のプロバイダーの他のEmbeddingsを使用することもできます。(SentenceTransformer, VertexAI, MistralAI, Cohere, Azure OpenAI, Ollama from LlamaIndex)

fukuchanfukuchan

📦 インストール

最新の安定版をインストールするには、以下を実行します:

pip install neo4j-graphrag

注意: Python パッケージは、仮想環境にインストールすることを推奨します。

fukuchanfukuchan

LLM modelの設定方法

Azure Open AI

from neo4j_graphrag.llm import AzureOpenAILLM
llm = AzureOpenAILLM(
    model_name="gpt-4o",
    azure_endpoint="https://example-endpoint.openai.azure.com/",  # update with your endpoint
    api_version="2024-06-01",  # update appropriate version
)

VertexAI

from neo4j_graphrag.llm import VertexAILLM
from vertexai.generative_models import GenerationConfig

generation_config = GenerationConfig(temperature=0.0)
llm = VertexAILLM(
    model_name="gemini-1.5-flash-001", generation_config=generation_config
)

Anthropic

from neo4j_graphrag.llm import AnthropicLLM

llm = AnthropicLLM(
    model_name="claude-3-opus-20240229",
    model_params={"max_tokens": 1000},  # max_tokens must be specified
)

MistralAI

from neo4j_graphrag.llm import MistralAILLM

llm = MistralAILLM(
    model_name="mistral-small-latest",
)

Cohere

from neo4j_graphrag.llm import CohereLLM

llm = CohereLLM(
    model_name="command-r",
)
fukuchanfukuchan

Ollama (OpenAI API)

from neo4j_graphrag.llm import OpenAILLM
llm = OpenAILLM(
    api_key="ollama",
    base_url="http://127.0.0.1:11434/v1",
    model_name="orca-mini"
)

Ollama (LangChain)

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="llama3:8b")
fukuchanfukuchan

Optional Inputs: Schema & Prompt Template

potential_schema LLM がテキストで探すエンティティと関係をリストアップ

basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent",
                      "CellType", "Condition", "Disease", "Drug",
                      "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                      "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
   "BIOMARKER_FOR",]

default prompt

prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

- Use only the information from the Input text. Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

Creating the SimpleKGPipeline

from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
   llm=ex_llm,
   driver=driver,
   text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
   embedder=embedder,
   entities=node_labels,
   relations=rel_types,
   prompt_template=prompt_template,
   from_pdf=True
)

Running the Knowledge Graph Builder

pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf',
            'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf',
            'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"Result: {pdf_result}")

A Note on Custom & Detailed Knowledge Graph Building

SimpleKGPipeline

  • Document Parser: extract text from documents, such as PDFs.
  • Text Splitter: split text into smaller pieces manageable by the LLM context window (token limit).
  • Chunk Embedder: compute the text embeddings for each chunk
  • Schema Builder: provide a schema to ground the LLM entity extraction for an accurate and easily navigable knowledge graph.
  • Entity & Relation Extractor: extract relevant entities and relations from the text
  • Knowledge Graph Writer: save the identified entities and relations to the KG
fukuchanfukuchan

Knowledge Graph Builder

fukuchanfukuchan

2. Retrieve Data From Your Knowledge Graph

  • Vector Retriever: performs similarity searches using vector embeddings
  • Vector Cypher Retriever: combines vector search with retrieval queries in Cypher, Neo4j’s Graph Query language, to traverse the graph and incorporate additional nodes and relationships.
  • Hybrid Retriever: Combines vector and full-text search.
  • Hybrid Cypher Retriever: Combines vector and full-text search with Cypher retrieval queries for additional graph traversal.
  • Text2Cypher: converts natural language queries into Cypher queries to run against Neo4j.
  • Weaviate & Pinecone Neo4j Retriever: Allows you to search vectors stored in Weaviate or Pinecone and connect them to nodes in Neo4j using external id properties.
  • Custom Retriever: allows for tailored retrieval methods based on specific needs.
fukuchanfukuchan

Vector Retriever

ANN(近似最近傍探索) ベクトル検索を使用して知識グラフからデータを取得します。

from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

vector retriever ()

from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

プロンプトでリトリーバーを実行

import json

vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", 
                                                 top_k=3)
for i in vector_res.records: print("====\n" + json.dumps(i.data(), indent=4))
fukuchanfukuchan

Vector Cypher Retriever

ベクトル検索でノードを取得した後、Cypher(グラフクエリ言語)を使用してグラフトラバーサルロジックを取り込むことができます。

ベクトル検索でChunkノードを取得し、エンティティを3ホップまでトラバースするリトリーバーを作成

from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//1) Go out 2-3 hops in the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel

//2) collect relationships and text chunks
WITH collect(DISTINCT chunk) AS chunks, 
  collect(DISTINCT rel) AS rels

//3) format and return context
RETURN '=== text ===\n' + apoc.text.join([c in chunks | c.text], '\n---\n') + '\n\n=== kg_rels ===\n' +
  apoc.text.join([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' + endNode(r).name ], '\n---\n') AS info
"""
)

同じプロンプトでリトリーバーを実行

vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", top_k=3)

# print output
kg_rel_pos = vc_res.records[0]['info'].find('\n\n=== kg_rels ===\n')
print("# Text Chunk Context:")
print(vc_res.records[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.records[0]['info'][kg_rel_pos:])
fukuchanfukuchan

3. Instantiate and Run GraphRAG

from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned. 

# Question:
{query_text}
 
# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)
fukuchanfukuchan

3. Instantiate and Run GraphRAG

GraphRAG Python Packageは、GraphRAGクラスを使用してGraphRAGパイプラインのインスタンス化と実行を簡単にします。

ベクトルリトリーバーとベクトルCypherリトリーバーの両方のGraphRAGオブジェクトを作成

from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned. 

# Question:
{query_text}
 
# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)

質問をして、異なる知識グラフ取得パターンを比較

q = "How is precision medicine applied to Lupus? provide in list format."

print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k':5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':5}).answer}")

複数のテキストチャンクから情報を取得する必要があるより複雑な質問をして比較

q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, and treatments? Provide in detailed list format."

v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)

print(f"Vector Response: \n{v_rag_result.answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag_result.answer}")