iTranslated by AI
OR Search Posting Traversal Algorithm and Implementation
Data Structures to be Handled
Consider the following data structure (inverted index). We assume that all arrays are sorted.
{
"word 1": [0, 1, 3, 4, ...], # List of words and document IDs
"word 2": [0, 1, 3, 5, ...], # List of words and document IDs
...
}
What is an Inverted Index?
An inverted index is a data structure used in search engines. It is specifically used to improve the efficiency of text searches. Here are the details:
-
Structure:
- An inverted index is a data structure where each word in the documents is a key, and the value is a list of documents where that word appears.
- Specifically, it records pairs of "words" and "document IDs."
-
Operation:
- While a standard index has a mapping of "document -> words," an inverted index has a mapping of "word -> documents."
- For example, if word A appears in Document 1 and Document 3, and word B appears in Document 2 and Document 3, the inverted index would look like this:
- A: [Document 1, Document 3]
- B: [Document 2, Document 3]
OR Search Algorithm
Consider retrieving IDs that contain any of word 1, word 2, or word 3.
{
"word 1": [0, 1, 2, 3, 4],
"word 2": [1, 3, 5, 6],
"word 3": [2, 3, 4]
}
The expected result is an array containing IDs 0-6, which are included in all the word arrays above.
{
[0, 1, 2, 3, 4, 5, 6]
}
Initialization is the same as an AND search. Compare the values, and if they don't match, add the minimum value to the return value. Here,
0 is added. Then, advance the cursor of the array that held the minimum value by one.
{
"word 1": ["0", 1, 2, 3, 4],
"word 2": ["1", 3, 5, 6],
"word 3": ["2", 3, 4]
}
Once the cursor is advanced, perform the comparison again. Similarly, add 1. This time, advance the two cursors that held the minimum value, word 1 and word 2, forward by one.
{
"word 1": [0, "1", 2, 3, 4],
"word 2": ["1", 3, 5, 6],
"word 3": ["2", 3, 4]
}
Once the cursor is advanced, perform the comparison again. Similarly, add 2. This time, advance the two cursors that held the minimum value, word 1 and word 3, forward by one.
{
"word 1": [0, 1, "2", 3, 4],
"word 2": [1, "3", 5, 6],
"word 3": ["2", 3, 4]
}
Proceed in the same manner. If all cursors match, advance all cursors by one.
{
"word 1": [0, 1, 2, "3", 4],
"word 2": [1, "3", 5, 6],
"word 3": [2, "3", 4]
}
Cursors for word 1 and word 3 have reached the end. Since these two terms no longer need scanning, they are removed. (In practice, a skip process is performed in the next loop iteration.)
{
"word 1": [0, 1, 2, 3, "4"],
"word 2": [1, 3, "5", 6],
"word 3": [2, 3, "4"]
}
Continue in the same manner thereafter.
{
"word 2": [1, 3, "5", 6], # Add 5 to the return value
}
{
"word 2": [1, 3, 5, "6"], # Add 6 to the return value
}
This completes the scan.
Implementing OR Search in PHP
The implementation assumes PHP 8.x. Try OR Search on PHP Playground.
Full Implementation
OR search is a simpler algorithm compared to AND search.
<?php
declare(strict_types=1);
/**
* Function to perform OR search
*
* @param array<string, int[]> $invertedIndex Inverted index
* @param string[] $terms Search terms
* @return int[] Array of search results
*/
function or_search(array $invertedIndex, array $terms): array
{
// Get posting lists for the search terms
$postingLists = get_posting_lists($invertedIndex, $terms);
// Initialize the cursor for each posting list
$cursors = array_fill(0, count($postingLists), 0);
// Execute the scan process
return scan_posting_lists($postingLists, $cursors);
}
/**
* Function to perform the scan process on posting lists
*
* @param int[][] $postingLists Array of posting lists
* @param int[] $cursors Cursors for each posting list
* @return int[] Array of scan results
*/
function scan_posting_lists(array $postingLists, array $cursors): array
{
$result = [];
while (true) {
// Find the minimum document ID and the keys of the terms that have that document ID
//
// ["word 1": [1, 3, 5, 6], "word 2": [2, 3, 4]]
// word 1 => key = 0, key => index = 1
// Minimum ID is 1, keys => [0]
// Minimum ID is 2, keys => [1]
// Minimum ID is 3, keys => [0, 1]
[$minDocumentId, $minKeys] = find_minimum_document_id($postingLists, $cursors);
// End when all pointers go out of range
if ($minDocumentId === PHP_INT_MAX) {
break;
}
// Add the minimum value to the result
$result[] = $minDocumentId;
// Advance the cursors pointing to the term with the minimum document ID
advance_cursors($cursors, $minKeys);
}
return $result;
}
/**
* Function to find the minimum value and its keys
*
* @param int[][] $postingLists Array of posting lists
* @param int[] $cursors Cursors for each posting list
* @return array{int, int[]} Array of the minimum value and its keys
*/
function find_minimum_document_id(array $postingLists, array $cursors): array
{
$minValue = PHP_INT_MAX;
$minKeys = [];
foreach ($postingLists as $i => $postingList) {
// Skip posting lists where scanning has finished
if ($cursors[$i] >= count($postingList)) {
continue;
}
$currentValue = $postingList[$cursors[$i]];
if ($currentValue < $minValue) {
// Initialize keys when the minimum value is updated
$minValue = $currentValue;
$minKeys = [$i];
} elseif ($currentValue === $minValue) {
// Add keys when it matches the minimum value
$minKeys[] = $i;
}
}
return [$minValue, $minKeys];
}
/**
* Function to advance the pointers for the minimum value
*
* @param int[] $cursors Cursors for each posting list
* @param int[] $minKeys Keys having the minimum value
* @return void
*/
function advance_cursors(array &$cursors, array $minKeys): void
{
foreach ($minKeys as $key) {
$cursors[$key]++;
}
}
/**
* Function to retrieve posting lists for the search terms
*
* @param array<string, int[]> $invertedIndex Inverted index
* @param string[] $terms Search terms
* @return int[][]
*/
function get_posting_lists(array $invertedIndex, array $terms): array
{
$postingLists = [];
foreach ($terms as $term) {
if (isset($invertedIndex[$term])) {
$postingLists[] = $invertedIndex[$term];
}
}
return $postingLists;
}
// Test data
$invertedIndex = [
"word 1" => [0, 1, 2, 3, 4],
"word 2" => [1, 3, 5, 6],
"word 3" => [2, 3, 4]
];
// Execute OR search
$terms = ["word 1", "word 2", "word 3"];
$result = or_search($invertedIndex, $terms); // [0, 1, 2, 3, 4, 5, 6]
print_r($result); // Display the results
or_search function
It doesn't do anything particularly difficult, but please note that the English name for the inverted index is "inverted," not "transpose." You can eventually obtain an array of IDs for the documents that contain the terms you want to search for.
/**
* Function to perform OR search
*
* @param array<string, int[]> $invertedIndex Inverted index
* @param string[] $terms Search terms
* @return int[] Array of search results
*/
function or_search(array $invertedIndex, array $terms): array
{
// Get posting lists for the search terms
$postingLists = get_posting_lists($invertedIndex, $terms);
// Initialize the cursor for each posting list
$cursors = array_fill(0, count($postingLists), 0);
// Execute the scan process
return scan_posting_lists($postingLists, $cursors);
}
scan_posting_lists function
This is the main body of the OR search process. Note that in the case of an OR search, the process cannot be completed until the scanning of all posting lists is finished.
/**
* Function to perform the scan process on posting lists
*
* @param int[][] $postingLists Array of posting lists
* @param int[] $cursors Cursors for each posting list
* @return int[] Array of scan results
*/
function scan_posting_lists(array $postingLists, array $cursors): array
{
$result = [];
while (true) {
// Find the minimum document ID and the keys of the terms that have that document ID
//
// ["word 1": [1, 3, 5, 6], "word 2": [2, 3, 4]]
// word 1 => key = 0, key => index = 1
// Minimum ID is 1, keys => [0]
// Minimum ID is 2, keys => [1]
// Minimum ID is 3, keys => [0, 1]
[$minDocumentId, $minKeys] = find_minimum_document_id($postingLists, $cursors);
// End when all pointers go out of range
if ($minDocumentId === PHP_INT_MAX) {
break;
}
// Add the minimum value to the result
$result[] = $minDocumentId;
// Advance the cursors pointing to the term with the minimum document ID
advance_cursors($cursors, $minKeys);
}
return $result;
}
get_posting_lists function
This function retrieves the posting lists corresponding to the terms in the search query from the inverted index.
/**
* Function to retrieve posting lists for the search terms
*
* @param array<string, int[]> $invertedIndex Inverted index
* @param string[] $terms Search terms
* @return int[][]
*/
function get_posting_lists(array $invertedIndex, array $terms): array
{
$postingLists = [];
foreach ($terms as $term) {
if (isset($invertedIndex[$term])) {
$postingLists[] = $invertedIndex[$term];
}
}
return $postingLists;
}
find_minimum_document_id function
This function performs the following processes on posting lists that have not yet completed scanning:
- Recreates the key list if the minimum value is updated.
- Adds to the key list if the value is equal to the minimum value.
/**
* Function to find the minimum value and its keys
*
* @param int[][] $postingLists Array of posting lists
* @param int[] $cursors Cursors for each posting list
* @return array{int, int[]} Array of the minimum value and its keys
*/
function find_minimum_document_id(array $postingLists, array $cursors): array
{
$minValue = PHP_INT_MAX;
$minKeys = [];
foreach ($postingLists as $i => $postingList) {
// Skip posting lists where scanning has finished
if ($cursors[$i] >= count($postingList)) {
continue;
}
$currentValue = $postingList[$cursors[$i]];
if ($currentValue < $minValue) {
// Initialize keys when the minimum value is updated
$minValue = $currentValue;
$minKeys = [$i];
} elseif ($currentValue === $minValue) {
// Add keys when it matches the minimum value
$minKeys[] = $i;
}
}
return [$minValue, $minKeys];
}
advance_cursors function
This uses pass-by-reference. Please refer to the official documentation for more about pass-by-reference. The word "Advance" means "to move forward." It's worth remembering as it often appears when creating scanners for programming languages.
/**
* Function to advance the pointers for the minimum value
*
* @param int[] $cursors Cursors for each posting list
* @param int[] $minKeys Keys having the minimum value
* @return void
*/
function advance_cursors(array &$cursors, array $minKeys): void
{
foreach ($minKeys as $key) {
$cursors[$key]++;
}
}
Speed of OR Search vs. AND Search
In an AND search, the process can be terminated as soon as any one of the posting lists reaches its end. On the other hand, in the case of an OR search, the process cannot be terminated until the end of all posting lists is reached.
Additionally, in an AND search, the cursor can be incremented multiple times at once if the posting list meets certain conditions. However, in an OR search, the cursor can only be advanced one at a time.
Therefore, OR search is fundamentally slower than AND search.
Discussion