Closed10

MS GraphRAGを試す

moriokamorioka

MS GraphRAGを試す。

以下の記事では、肝腎のindex作成部分の記述が欠けている。

https://hamaruki.com/graphrag-beginners-guide/
(colab notebookを見ればよかったようだ。ただし公式のGetting Startedを日本語訳した域を出ない)

結局、公式の Getting Startedに倣う。pypiからパッケージインストール。

https://microsoft.github.io/graphrag/posts/get_started/

index作成中。まずは費用が掛かってもOpenAI APIを呼ぶことにする。

OpenAI API互換ならば、他のLLMを利用することも簡単ではないか? tool choiceやfunction callingを使うのでなければ。

moriokamorioka
morioka@legion:~$ pyenv virtualenv 3.11.8 graphrag
morioka@legion:~$ mkdir graphgrag
morioka@legion:~$ cd graphgrag/
morioka@legion:~/graphgrag$ pyenv local graphrag
moriokamorioka

素材のダウンロードと、index作成の準備。

(graphrag) morioka@legion:~/graphgrag$ curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  184k  100  184k    0     0   114k      0  0:00:01  0:00:01 --:--:--  114k
(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.index --init --root ./ragtest
Initializing project at ./ragtest
⠋ GraphRAG Indexer (graphrag)
moriokamorioka

index作成...完了。

(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.index --root ./ragtest
🚀 Reading settings from ragtest/settings.yaml
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_base_text_units
                                   id                                              chunk  ...                        document_ids n_tokens
0    680dd6d2a970a49082fa4f34bf63a34e  The Project Gutenberg eBook of A Christmas Ca...  ...        300
1    95f1f8f5bdbf0bee3a2c6f2f4a4907f6   THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL...  ...        300
2    3a450ed2b7fb1e5fce66f92698c13824  1958,\n  1962, 1964, 1966, 1967, 1969, 1971, 1...  ...        300
3    95b143eba145d91eacae7be3e4ebaf0c  .\n  Mr. Fezziwig, a kind-hearted, jovial old ...  ...        300
4    c390f1b92e2888f78b58f6af5b12afa0   debtors.\n  Mrs. Cratchit, wife of Bob Cratch...  ...        300
..                                ...                                                ...  ...                                 ...      ...
226  972bb34ddd371530f06d006480526d3e   harmless from all liability, costs and expens...  ...        300
227  2f918cd94d1825eb5cbdc2a9d3ce094e  \nGutenberg Literary Archive Foundation was cr...  ...        300
228  eec5fc1a2be814473698e220b303dc1b  . Email contact links and up\nto date contact ...  ...        300
229  535f6bed392a62760401b1d4f2aa5e2f   compliance. To SEND\nDONATIONS or determine t...  ...        300
230  9e59af410db84b25757e3bf90e036f39   could be\nfreely shared with anyone. For fort...  ...        155

[231 rows x 5 columns]
⠙ GraphRAG Indexer
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_base_entity_graph
   level                                    clustered_graph
0      0  <graphml xmlns="http://graphml.graphdrawing.or...
1      1  <graphml xmlns="http://graphml.graphdrawing.or...
2      2  <graphml xmlns="http://graphml.graphdrawing.or...
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_final_entities
                                   id  ...                              description_embedding
0    b45241d70f0e43fca764df95b2b81f77  ...  [0.025451932102441788, 0.03978753089904785, -0...
1    4119fd06010c494caa07f439b333f4c5  ...  [0.021342700347304344, 0.024907737970352173, -...
2    d3835bf3dda84ead99deadbeac5d0d7d  ...  [0.0048078252002596855, 0.02126268297433853, -...
3    077d2820ae1845bcbb1803379a3d1eae  ...  [0.02204022742807865, -0.007841000333428383, -...
4    3671ea0dd4e84c1a9b02c5ab2c8f4bac  ...  [-0.024365561082959175, -0.006287779193371534,...
..                                ...  ...                                                ...
149  9a6f414210e14841a5b0e661aedc898d  ...  [-0.05559733510017395, -0.012709809467196465, ...
150  db541b7260974db8bac94e953009f60e  ...  [-0.003418784821406007, -0.01189712155610323, ...
151  f2ff8044718648e18acef16dd9a65436  ...  [-0.04298556223511696, 0.008937294594943523, 0...
152  00d785e7d76b47ec81b508e768d40584  ...  [-0.03399103134870529, -0.008763907477259636, ...
153  87915637da3e474c9349bd0ae604bd95  ...  [-0.02068764716386795, 0.0021940593142062426, ...

[154 rows x 8 columns]
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is
deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  datetime_column = pd.to_datetime(column, errors="ignore")
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer
format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please
specify a format.
  datetime_column = pd.to_datetime(column, errors="ignore")
🚀 create_final_nodes
     level                       title            type  ...                 top_level_node_id  x  y
0        0              "BOB CRATCHIT"        "PERSON"  ...  b45241d70f0e43fca764df95b2b81f77  0  0
1        0            "PETER CRATCHIT"        "PERSON"  ...  4119fd06010c494caa07f439b333f4c5  0  0
2        0              "TIM CRATCHIT"        "PERSON"  ...  d3835bf3dda84ead99deadbeac5d0d7d  0  0
3        0              "MR. FEZZIWIG"        "PERSON"  ...  077d2820ae1845bcbb1803379a3d1eae  0  0
4        0                      "FRED"        "PERSON"  ...  3671ea0dd4e84c1a9b02c5ab2c8f4bac  0  0
..     ...                         ...             ...  ...                               ... .. ..
457      2  "INTERNAL REVENUE SERVICE"  "ORGANIZATION"  ...  9a6f414210e14841a5b0e661aedc898d  0  0
458      2               "MISSISSIPPI"           "GEO"  ...  db541b7260974db8bac94e953009f60e  0  0
459      2        "SALT LAKE CITY, UT"           "GEO"  ...  f2ff8044718648e18acef16dd9a65436  0  0
460      2                       "IRS"  "ORGANIZATION"  ...  00d785e7d76b47ec81b508e768d40584  0  0
461      2           "MICHAEL S. HART"        "PERSON"  ...  87915637da3e474c9349bd0ae604bd95  0  0

[462 rows x 14 columns]
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
🚀 create_final_communities
    id         title  ...                                   relationship_ids                                      text_unit_ids
0    6   Community 6  ...  [8f1eba29f39e411188200bf0d14628ec, 7282c73622b...  [0d6bc6e701a0025632e41dc3387c641d,13f70e4c705f...
1    1   Community 1  ...  [59c726a8792d443e84ab052cb7942b4a, 4f2c665decf...  [13f70e4c705fb134466c125b05af3440,3a450ed2b7fb...
2    0   Community 0  ...  [896d2a51e8de47de85ba8ced108c3d53, f97011b2a99...  [3a450ed2b7fb1e5fce66f92698c13824,694fe3a93b65...
3    4   Community 4  ...  [4517768fc4e24bd2a790be0e08a7856e, 586bccefb1e...  [1d5a3ea2bdc7eb02878c9733fae3924b,715dd9466e12...
4    9   Community 9  ...  [9376ce8940e647a99e5e087514b88fa4, 35489ca6a63...
5   11  Community 11  ...  [8169efeea3ce473d9fd2f1c688126a1c, d203efdbfb2...  [a31b7f9a68e95c6a014501ba8710513c,f567abf22400...
6   10  Community 10  ...  [c2d48b75af6a4d7989ccf9eceabd934e, 68e0c60d2e8...
7    5   Community 5  ...  [f9005e5c01b44bb489f7112322fd1162, d9ef0175497...  [320c285a98f252d567b2005902763e5c,ddc8697a7671...
8    7   Community 7  ...  [cf6115e69d6649cc99ef2bd11854ccfb, 496f17c2f74...  [8abfa46c9318e287361dc792381e06e5,9875af54dfa7...
9    2   Community 2  ...  [9ed7e3d187b94ab0a90830b17d66615e, b4c7432f712...  [eab0a98a24212548adc9252b20b29dce, db04de01e9b...
10  12  Community 12  ...  [40450f2c91944a81944621b94f190b49, ed559fb4ebd...
11   3   Community 3  ...  [71a0a8c1beb64da08124205e9a803d98, f84314943be...  [d6d510a8b60a7597b6b907023d156777,eea518e2ef1c...
12   8   Community 8  ...  [5c13c7d61e6c4bfe839f21e7ad3530a7, a621663edba...  [2b16778c9beeb9bbb5c770960d7bd492,7b4ad7c69598...
13  18  Community 18  ...  [59c726a8792d443e84ab052cb7942b4a, 4f2c665decf...  [13f70e4c705fb134466c125b05af3440,3a450ed2b7fb...
14  13  Community 13  ...  [896d2a51e8de47de85ba8ced108c3d53, f97011b2a99...  [3a450ed2b7fb1e5fce66f92698c13824,694fe3a93b65...
15  20  Community 20  ...  [14555b518e954637b83aa762dc03164e, b1f6164116d...  [0e13fd0aca5720eb614104772f20077b,0eb69b9f79f6...
16  17  Community 17  ...  [545edff337344e518f68d1301d745455, d405c3154d0...  [45ac76a7dea29addc4542c64d7eae68f,715dd9466e12...
17  19  Community 19  ...  [b38a636e86984600bb4b57c2e2df9747, 4bc7440b8f4...  [547563001cad1df48dfcd4ee4ecc8ee9,8abfa46c9318...
18  14  Community 14  ...  [222f0ea8a5684123a7045986640ec844, 668cf1fdfd6...  [1d5a3ea2bdc7eb02878c9733fae3924b,3dc28534d844...
19  15  Community 15  ...  [82b0446e7c9d4fc793f7b97f890e9049, 70634e10a5e...  [da8b22fcbea495d042facb17b364be42, 9c7e56ef067...
20  16  Community 16  ...  [5f1fc373a8f34050a5f7dbd8ac852c1b, c725babdb14...  [5cf14e5d111ef8cbfd7d32e30b6cdb8a,7348862d5fd2...
21  23  Community 23  ...  [9ed7e3d187b94ab0a90830b17d66615e, b4c7432f712...  [eab0a98a24212548adc9252b20b29dce, db04de01e9b...
22  21  Community 21  ...  [5b9fa6a959294dc29c8420b2d7d3096f, b84d71ed9c3...  [56649632e1d3a637a756905477c99002,6997e1ff5fab...
23  22  Community 22  ...  [0111777c4e9e4260ab2e5ddea7cbcf58, 785f7f32471...  [0d6bc6e701a0025632e41dc3387c641d,25666ca46011...
24  24  Community 24  ...  [bcfdc48e5f044e1d84c5d217c1992d4b, b232fb0f2ac...  [7ed8b64d3fcf6b96c9c86c53e3fb7ce7, 7ed8b64d3fc...
25  25  Community 25  ...  [896d2a51e8de47de85ba8ced108c3d53, f97011b2a99...  [3a450ed2b7fb1e5fce66f92698c13824,694fe3a93b65...
26  26  Community 26  ...  [c2999bdca08a478b84b10219875b285e, 351abba16e5...  [21acfa3b8dca20f03f4a2d7133013952,3dc28534d844...
27  27  Community 27  ...  [263d07354a1b4336b462024288f9bcd3, 50ea7d3b696...  [320c285a98f252d567b2005902763e5c,45ac76a7dea2...

[28 rows x 6 columns]
🚀 join_text_units_to_entity_ids
                        text_unit_ids                                         entity_ids                                id
0    0d6bc6e701a0025632e41dc3387c641d  [b45241d70f0e43fca764df95b2b81f77, de988724cfd...  0d6bc6e701a0025632e41dc3387c641d
1    13f70e4c705fb134466c125b05af3440  [b45241d70f0e43fca764df95b2b81f77, 3671ea0dd4e...  13f70e4c705fb134466c125b05af3440
2    25666ca46011e54363d13007959f45fb  [b45241d70f0e43fca764df95b2b81f77, de988724cfd...  25666ca46011e54363d13007959f45fb
3    2818d4194a37f4573f7a83b49cd59b21  [b45241d70f0e43fca764df95b2b81f77, de988724cfd...  2818d4194a37f4573f7a83b49cd59b21
4    3a450ed2b7fb1e5fce66f92698c13824  [b45241d70f0e43fca764df95b2b81f77, 4119fd06010...  3a450ed2b7fb1e5fce66f92698c13824
..                                ...                                                ...                               ...
99   9e59af410db84b25757e3bf90e036f39  [eeef6ae5c464400c8755900b4f1ac37a, 422433aa458...  9e59af410db84b25757e3bf90e036f39
100  da3ca9f93aac15c67f6acf3cca2fc229                   da3ca9f93aac15c67f6acf3cca2fc229
101  e8cf7d2eec5c3bcbeefc60d9f15941ed  [eeef6ae5c464400c8755900b4f1ac37a, 1af9faf341e...  e8cf7d2eec5c3bcbeefc60d9f15941ed
102  eec5fc1a2be814473698e220b303dc1b  [422433aa45804c7ebb973b2fafce5da6, 1af9faf341e...  eec5fc1a2be814473698e220b303dc1b
103  b3c35247f91923027d9bd7d476467f4f  [1af9faf341e14a5bbf4ddc9080e8dc0b, 353d91abc68...  b3c35247f91923027d9bd7d476467f4f

[104 rows x 3 columns]
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is
deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:65: FutureWarning: errors='ignore' is
deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  column_numeric = cast(pd.Series, pd.to_numeric(column, errors="ignore"))
🚀 create_final_relationships
                                              source                      target  weight  ... source_degree target_degree rank
0                                     "BOB CRATCHIT"          "EBENEZER SCROOGE"     1.0  ...             9             4   13
1                                     "BOB CRATCHIT"            "PETER CRATCHIT"     2.0  ...             9             2   11
2                                     "BOB CRATCHIT"              "TIM CRATCHIT"     1.0  ...             9             1   10
3                                     "BOB CRATCHIT"                   "SCROOGE"     6.0  ...             9            69   78
4                                     "BOB CRATCHIT"                  "CORNHILL"     1.0  ...             9             1   10
..                                               ...                         ...     ...  ...           ...           ...  ...
157  "PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION"           "TRADEMARK OWNER"     1.0  ...             4             1    5
158  "PROJECT GUTENBERG LITERARY ARCHIVE FOUNDATION"                       "IRS"     1.0  ...             4             1    5
159          "GUTENBERG LITERARY ARCHIVE FOUNDATION"  "INTERNAL REVENUE SERVICE"     1.0  ...             4             1    5
160          "GUTENBERG LITERARY ARCHIVE FOUNDATION"               "MISSISSIPPI"     1.0  ...             4             1    5
161          "GUTENBERG LITERARY ARCHIVE FOUNDATION"        "SALT LAKE CITY, UT"     1.0  ...             4             1    5

[162 rows x 10 columns]
🚀 join_text_units_to_relationship_ids
                                  id                                   relationship_ids
0   3a450ed2b7fb1e5fce66f92698c13824  [8f1eba29f39e411188200bf0d14628ec, 7282c73622b...
1   d95d1ec14f9c4293fab4e36bbe5d9fd1  [7282c73622b8408e97289d959faff483, af7a1584dd1...
2   0d6bc6e701a0025632e41dc3387c641d  [af7a1584dd15492cb9a4940e285f57fc, 6090e736374...
3   13f70e4c705fb134466c125b05af3440  [af7a1584dd15492cb9a4940e285f57fc, 4f2c665decf...
4   25666ca46011e54363d13007959f45fb  [af7a1584dd15492cb9a4940e285f57fc, f422035f8b7...
..                               ...                                                ...
91  e8cf7d2eec5c3bcbeefc60d9f15941ed                 [089b9b9841714b8da043777e2cda3767]
92  10bab8e9773ee6dfbb465bfa45794c34                 [38f1e44579d0437dac1203c34678d3c3]
93  2f918cd94d1825eb5cbdc2a9d3ce094e  [1ca24718a96b47f3a8855550506c4b41, 4b8aa4587c7...
94  eec5fc1a2be814473698e220b303dc1b  [f23484b1b45d44c3b7847e1906dddd37, 4920fda0318...
95  b3c35247f91923027d9bd7d476467f4f                 [929f30875e1744b49e7b416eaf5a790c]

[96 rows x 2 columns]
🚀 create_final_community_reports
   community  ...                                    id
0         25  ...  88b49b31-f8d7-4e2d-8d1a-8d6cc623f120
1         26  ...  a8731b73-a194-42c2-acb5-46660cd47971
2         27  ...  d18abd4d-2c8f-4d11-8fb6-4955c3f43565
3         13  ...  c1730928-6a64-4958-b483-b88cbf2fc829
4         14  ...  7eadaa3d-599b-4f27-b8f0-eefc1591c972
5         15  ...  a35bd39e-9acb-4ed9-9c33-f5ecdebd8335
6         16  ...  aed0422f-ee68-45a6-aa66-bba3a0658720
7         17  ...  51e137a4-18fa-42ca-9b52-ec8cd8af44af
8         18  ...  3bbe1bbb-2486-42d7-9184-b8e8af4f4813
9         19  ...  bd33037c-2e6f-4ea1-829d-15f8801f75d5
10        20  ...  3369e9aa-c808-4009-a832-08f3a6cdb8e3
11        21  ...  60331a17-1ecf-48a5-8bfb-8808814d598f
12        23  ...  873fc6bc-1f5b-4a55-be5f-3c1b51d155b2
13        24  ...  c82ad56f-0734-4efb-9036-65c73eaa26e4
14         0  ...  8b08c146-dd6d-4e2a-bc97-3f581b731f22
15         1  ...  2f6d74c3-0d66-4a9b-875d-408ef7934815
16        10  ...  ccd21290-7825-4ea2-8d8b-7392f48e0352
17        11  ...  79fd9cbb-5854-49bc-a779-ffd157f53a6a
18        12  ...  fe0add06-b770-45dd-acc5-f308d3fc6698
19         2  ...  55d9c915-188e-475b-b7ac-79400fbe049b
20         3  ...  d3f034e6-4269-493b-beb0-b035c3e14907
21         4  ...  e2141b76-c398-436e-87da-bde2b892a939
22         5  ...  17b8c50f-9f53-4222-bdf0-87d6363a6029
23         6  ...  c46bcb13-36d2-451c-9961-0bda356d8337
24         7  ...  b8a65762-0c89-4d03-b35a-caf1072311cb
25         8  ...  bda48fd2-d260-42c7-9662-23374c0ccf10
26         9  ...  df636add-aa7d-444f-8af8-38793ecf73eb

[27 rows x 10 columns]
🚀 create_final_text_units
                                   id  ...                                   relationship_ids
0    3a450ed2b7fb1e5fce66f92698c13824  ...  [8f1eba29f39e411188200bf0d14628ec, 7282c73622b...
1    b6a337c6f91c648c7432dc9e9e01b797  ...
2    bcd3d11eb719b981ca5c674cbc9a123e  ...  [35489ca6a63b47d6a8913cf333818bc1, 5d3344f45e6...
3    547563001cad1df48dfcd4ee4ecc8ee9  ...  [5d3344f45e654d2c808481672f2f08dd, 68762e6f0d1...
4    da8b22fcbea495d042facb17b364be42  ...  [5d3344f45e654d2c808481672f2f08dd, 70634e10a5e...
..                                ...  ...                                                ...
226  01e84646075b255eab0a34d872336a89  ...                                               None
227  879b3fc36c9a2427cdb8d5d41b60e11b  ...                                               None
228  28f242c45159426edb8589f5ca3c10e6  ...                                               None
229  f96b5ddf7fae853edbc4d916f66c623f  ...                                               None
230  958e8453c6299cf980b3e6f962240699  ...                                               None

[231 rows x 6 columns]
/home/morioka/.pyenv/versions/graphrag/lib/python3.11/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is
deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  datetime_column = pd.to_datetime(column, errors="ignore")
🚀 create_base_documents
                                 id  ...     title
0  c305886e4aa2f6efcf64b57762777055  ...  book.txt

[1 rows x 4 columns]
🚀 create_final_documents
                                 id  ...     title
0  c305886e4aa2f6efcf64b57762777055  ...  book.txt

[1 rows x 4 columns]
⠦ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
🚀 All workflows completed successfully.
(graphrag) morioka@legion:~/graphgrag$
(graphrag) morioka@legion:~/graphgrag$
moriokamorioka

質問を実行。その1。

"Here is an example using Global search to ask a high-level question"

(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.query \
--root ./ragtest \
--method global \
"What are the top themes in this story?"


INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: ### Top Themes in the Story

The story weaves together a rich tapestry of themes, each contributing to the narrative's depth and the characters' journeys. Central to these is the **transformation and redemption** of Ebenezer Scrooge, who evolves from a figure of miserliness to one of generosity. This change is not just a personal journey but a reflection of the broader potential for human growth and redemption through relationships and self-reflection [Data: Reports (6, 18, 26, 27, 25, +more)].

**Family and social connections** emerge as another significant theme, illustrated through the interactions of characters like the Cratchit family and Scrooge's nephew. These relationships highlight the importance of community and the impact of personal bonds on individuals' lives [Data: Reports (6, 19, 20, 24, +more)].

The narrative places a strong emphasis on the **Christmas spirit**, which acts as a catalyst for reflection, joy, and change. This theme underscores the transformative power of the season and its ability to inspire generosity and kindness [Data: Reports (14, 19, 24)].

**Generosity versus greed** is explored through contrasting behaviors and attitudes, particularly through Scrooge's interactions with others. This theme delves into the moral implications of both sets of values, advocating for a life led by kindness and compassion [Data: Reports (6, 14, 18, 20)].

Lastly, the **role of supernatural guidance** in personal growth is highlighted through the visits of the Ghosts of Christmas Past, Present, and Yet to Come. These spectral figures guide Scrooge on a journey of self-discovery, emphasizing the importance of mentorship and positive influence in catalyzing change [Data: Reports (26, 27, 25)].

In summary, the story's themes are deeply interwoven, each playing a crucial role in the narrative's exploration of human nature, the importance of community, and the potential for personal transformation. Through the lens of Christmas and the supernatural, the story offers a timeless message of hope, redemption, and the enduring power of kindness and generosity.
(graphrag) morioka@legion:~/graphgrag$
(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.query --root ./ragtest --method global "What are the top themes in this story? 日 本語で回答して"


INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: この物語における主要なテーマは、希望と変容、家族とコミュニティの絆、贖罪と個人の成長、そして社会的責任と慈善です。これらのテーマは、エベネーザー・スクルージの人生と彼の周りの人々との関係を通じて探求されています。

### 希望と変容
物語の中心は、スクルージの人格と関係性の変化にあります。彼の変化やクラチット家族の経済的困難にもかかわらず見せる強さと希望が、このテーマを際立たせています。スクルージの変化を促す重要な要素として、ジェイコブ・マーリーとの関係や三人の霊の訪問が挙げられます [Data: Reports (25, 6, 20, 26, 27)]### 家族とコミュニティの絆
家族とコミュニティの絆の重要性が強調されています。クラチット家族の結束力やスクルージの甥との関係が、このテーマを具体化しています。フェジウィグのクリスマスイブの祝賀会は、コミュニティの絆と喜びの価値を強調しており、物語における肯定的な影響力の例です [Data: Reports (6, 19, 20, 4)]### 贖罪と個人の成長
贖罪と個人の成長のテーマが、スクルージの人生の変化を通じて探求されています。彼の過去、現在、未来を訪れることで、スクルージは自己反省と変化の旅を経験します [Data: Reports (26, 27, 18)]### 社会的責任と慈善
社会的責任と慈善のテーマが、スクルージと他のキャラクターの間の対話や行動を通じて展開されます。特に貧困層への支援の必要性が強調されています。物語全体を通じて、慈善、家族、コミュニティの絆の重要性が強調されています [Data: Reports (11, 3, 25, 4)]。

これらのテーマは、物語を通じて織り交ぜられ、読者に深い印象を与える要素となっています。スクルージの人生とコミュニティへの影響は、彼の変化が個人だけでなく、広い社会にも良い影響を与えることを示しています [Data: Reports (25)](graphrag) morioka@legion:~/graphgrag$
moriokamorioka

質問を実行。その2。

"Here is an example using Local search to ask a more specific question about a particular character"


(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.query \
--root ./ragtest \
--method local \
"Who is Scrooge, and what are his main relationships?"


INFO: Reading settings from ragtest/settings.yaml
[2024-07-13T06:19:41Z WARN  lance::dataset] No existing dataset at /home/morioka/graphgrag/lancedb/description_embedding.lance, it will be created
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response: ### Who is Scrooge?

Ebenezer Scrooge is a central character in a narrative that explores themes of redemption, compassion, and the transformative power of the Christmas spirit. Initially depicted as a miserly, solitary, and bitter individual, Scrooge is known for his disdain towards Christmas and his lack of empathy towards others. His life is characterized by gloom, isolation, and a meticulous nature that prioritizes wealth accumulation over human connections [Data: Entities (9, 11)].

### Scrooge's Main Relationships

#### Jacob Marley
Jacob Marley, Scrooge's late business partner, plays a pivotal role in initiating Scrooge's transformation. Marley's ghost visits Scrooge to warn him of the dire consequences of continuing his current path of greed and isolation, setting the stage for the visits from the three spirits [Data: Entities (42)].

#### The Three Spirits
The Ghosts of Christmas Past, Present, and Yet to Come are supernatural entities that guide Scrooge through a journey of self-reflection. Each spirit shows Scrooge the impact of his actions on himself and others, leading him to realize the importance of compassion and community. These spirits are instrumental in breaking down Scrooge's walls of indifference, showcasing the consequences of his miserliness, and ultimately guiding him towards redemption [Data: Entities (43, 48, 114)].

#### Bob Cratchit and Family
Bob Cratchit, Scrooge's underpaid and overworked clerk, represents the human face of Scrooge's harshness. Initially, Scrooge's relationship with Cratchit is emblematic of his miserliness. However, as Scrooge's transformation unfolds, he becomes a figure of generosity towards the Cratchit family, symbolizing his broader change from a life of isolation to one of engagement and benevolence. This shift is particularly highlighted by Scrooge's newfound concern for Tiny Tim, Cratchit's ailing son, and his actions to assist the family [Data: Entities (0, 95)].

#### Scrooge's Nephew, Fred
Fred, Scrooge's nephew, embodies the spirit of Christmas that Scrooge initially rejects. Fred's persistent invitations to Scrooge to join the Christmas celebrations and his optimistic outlook on life play a crucial role in highlighting the contrast between Scrooge's miserliness and the potential for joy and familial warmth. Scrooge's eventual acceptance of Fred's invitation symbolizes his reintegration into family and society [Data: Entities (4, 15, 131)].

#### The Broader Community
Scrooge's interactions with the broader community, including his charitable actions and his newfound willingness to engage with others, reflect the essence of his transformation. His change impacts not only his immediate circle but also the wider community, as he becomes a symbol of hope and generosity [Data: Entities (21, 104, 138)].

### Conclusion

Ebenezer Scrooge's journey from a miserly recluse to a generous benefactor is marked by significant relationships that catalyze his transformation. Through encounters with Jacob Marley, the three spirits, Bob Cratchit and his family, and his nephew Fred, Scrooge learns the value of compassion, generosity, and community. These relationships are central to the narrative, illustrating the profound impact of personal change on oneself and society at large.
(graphrag) morioka@legion:~/graphgrag$
(graphrag) morioka@legion:~/graphgrag$ python -m graphrag.query --root ./ragtest --method local "Who is Scrooge, and what are his main relationships? 日本語で回答して"


INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_chat", 'model': 'gpt-4-turbo-preview', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
creating embedding llm client with {'api_key': 'REDACTED,len=51', 'type': "openai_embedding", 'model': 'text-embedding-3-small', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Local Search Response: スクルージは、チャールズ・ディケンズの『クリスマス・キャロル』に登場する架空のキャラクターであり、物語の主人公です。彼は最初、非常にケチで心の冷たい人物として描かれていますが、物語の進行と共に大きな変化を遂げます。スクルージの人生と変化において重要な役割を果たす主な関係性には、ジェイコブ・マーリー、ボブ・クラチットとその家族、そしてクリスマスの三霊が含まれます。

### ジェイコブ・マーリーとの関係

ジェイコブ・マーリーはスクルージの亡くなったビジネスパートナーで、物語の始まりで幽霊としてスクルージの前に現れます。マーリーの訪問は、スクルージの変化のきっかけとなります。彼はスクルージに対し、自分のように後悔と苦しみの鎖を担ぐ運命を避けるために、生き方を変えるよう警告します[Data: Entities (42)]### ボブ・クラチットとその家族との関係

ボブ・クラチットはスクルージの事務員であり、彼の家族は物語の中で重要な役割を果たします。スクルージは当初、クラチット家に対して冷たく、ケチな態度を取っていましたが、クリスマスの霊たちの訪問を通じて彼らの苦労と温かさを知り、心を開いていきます。特に、クラチット家の息子であるタイニー・ティムの純粋さと弱さは、スクルージの心に深い影響を与えます[Data: Entities (0), Relationships (117, 119)]### クリスマスの三霊との関係

クリスマスの過去の霊、現在の霊、そして未来の霊は、スクルージに自分の過去、現在、そして未来を見せることで、彼の人生と行動の影響を理解させます。これらの霊たちの訪問は、スクルージが自分自身と周りの世界に対する彼の態度を根本的に変えるきっかけとなります[Data: Entities (48, 43, 114)]### まとめ

スクルージの物語は、彼の人生における重要な関係性を通じて語られます。ジェイコブ・マーリーの警告、ボブ・クラチットとその家族との絆、そしてクリスマスの三霊の教えは、スクルージを変えるための重要な要素です。これらの関係性は、スクルージが自己中心的で冷たい心を捨て、愛と慈悲の心を持つようになる過程を描いています。
(graphrag) morioka@legion:~/graphgrag$
moriokamorioka

わざわざ質問タイプ(global, local)を与えているが、質問タイプを推測させればよいのでは?

コストが気にならなければ、それぞれのタイプで回答を生成させて、どちらか選べば。

moriokamorioka

settings.yaml はこんな感じ。

APIの振り向け先の変更は簡単にできる印象。でも、LLMはともかく、embeddingモデルをどうするか。

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4-turbo-preview
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional



chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32
このスクラップは2ヶ月前にクローズされました