The `voyager` Python API#

This module provides classes and functions for creating indices of vector data.

A quick example on how to get started:

import numpy as np
from voyager import Index, Space

# Create an empty Index object that can store vectors:
index = Index(Space.Euclidean, num_dimensions=5)
id_a = index.add_item([1, 2, 3, 4, 5])
id_b = index.add_item([6, 7, 8, 9, 10])

print(id_a)  # => 0
print(id_b)  # => 1

# Find the two closest elements:
neighbors, distances = index.query([1, 2, 3, 4, 5], k=2)
print(neighbors)  # => [0, 1]
print(distances)  # => [0.0, 125.0]

# Save the index to disk to reload later (or in Java!)
index.save("output_file.voy")

class voyager.Index(space: Space, num_dimensions: int, M: int = 12, ef_construction: int = 200, random_seed: int = 1, max_elements: int = 1, storage_data_type: StorageDataType = StorageDataType.Float32)#

A nearest-neighbor search index containing vector data (i.e. lists of floating-point values, each list labeled with a single integer ID).

Think of a Voyager Index as a Dict[int, List[float]] (a dictionary mapping integers to lists of floats), where a query() method allows finding the k nearest int keys to a provided List[float] query vector.

Warning

Voyager is an approximate nearest neighbor index package, which means that each call to the query() method may return results that are approximately correct. The metric used to gauge the accuracy of a Voyager query() is called recall. Recall measures the percentage of correct results returned per query().

Various parameters documented below (like M, ef_construction, and ef) may affect the recall of queries. Usually, increasing recall requires either increasing query latency, increasing memory usage, increasing index creation time, or all of the above.

Voyager indices support insertion, lookup, and deletion of vectors.

Calling the Index constructor will return an object of one of three classes:

FloatIndex, which uses 32-bit precision for all data
Float8Index, which uses 8-bit fixed-point precision and requires all vectors to be within the bounds [-1, 1]
E4M3Index, which uses 8-bit floating-point precision and requires all vectors to be within the bounds [-448, 448]

Parameters:

space –
The Space to use to calculate the distance between vectors, Space.Cosine (cosine distance).

The choice of distance to use usually depends on the kind of vector data being inserted into the index. If your vectors are usually compared by measuring the cosine distance between them, use Space.Cosine.
num_dimensions – The number of dimensions present in each vector added to this index. Each vector in the index must have the same number of dimensions.
M – The number of connections between nodes in the tree’s internal data structure. Larger values give better recall, but use more memory. This parameter cannot be changed after the index is instantiated.
ef_construction – The number of vectors to search through when inserting a new vector into the index. Higher values make index construction slower, but give better recall. This parameter cannot be changed after the index is instantiated.
random_seed – The seed (initial value) of the random number generator used when constructing this index. Byte-for-byte identical indices can be created by setting this value to a known value and adding vectors to the index one-at-a-time (to avoid nondeterministic ordering due to multi-threaded insertion).
max_elements – The maximum size of the index at construction time. Indices can be resized (and are automatically resized when add_item() or add_items() is called) so this value is only useful if the exact number of elements that will be added to this index is known in advance.

__contains__(id: int) → bool#

Check to see if a provided vector’s ID is present in this index.

Returns true iff the provided integer ID has a corresponding (non-deleted) vector in this index. Use the in operator to call this method:

1234 in index # => returns True or False

__len__() → int#

Returns the number of non-deleted vectors in this index.

Use the len operator to call this method:

len(index) # => 1234

Note

This value may differ from num_elements if elements have been deleted.

add_item(vector: ndarray[Any, dtype[float32]], id: int | None = None) → int#

Add a vector to this index.

Parameters:

vector –
A 32-bit floating-point NumPy array, with shape (num_dimensions,).

If using the Space.Cosine Space, this vector will be normalized before insertion. If using a StorageDataType other than StorageDataType.Float32, the vector will be converted to the lower-precision data type after normalization.
id – An optional ID to assign to this vector. If not provided, this vector’s ID will automatically be generated based on the number of elements already in this index.

Returns:

The ID that was assigned to this vector (either auto-generated or provided).

Warning

If calling add_item() in a loop, consider batching your calls by using add_items() instead, which will be faster.

add_items(vectors: ndarray[Any, dtype[float32]], ids: List[int] | None = None, num_threads: int = -1) → List[int]#

Add multiple vectors to this index simultaneously.

This method may be faster than calling add_items() multiple times, as passing a batch of vectors helps avoid Python’s Global Interpreter Lock.

Parameters:

vectors –
A 32-bit floating-point NumPy array, with shape (num_vectors, num_dimensions).

If using the Space.Cosine Space, these vectors will be normalized before insertion. If using a StorageDataType other than StorageDataType.Float32, these vectors will be converted to the lower-precision data type after normalization.
id – An optional list of IDs to assign to these vectors. If provided, this list must be identical in length to the first dimension of vectors. If not provided, each vector’s ID will automatically be generated based on the number of elements already in this index.
num_threads – Up to num_threads will be started to perform insertions in parallel. If vectors contains only one query vector, num_threads will have no effect. By default, one thread will be spawned per CPU core.

Returns:

The IDs that were assigned to the provided vectors (either auto-generated or provided), in the same order as the provided vectors.

as_bytes() → bytes#

Returns the contents of this index as a bytes object. The resulting object will contain the same data as if this index was serialized to disk and then read back into memory again.

Warning

This may be extremely large (many gigabytes) if the index is sufficiently large. To save to disk without allocating this entire bytestring, use save().

Note

This method can also be called by passing this object to the bytes(...) built-in function:

index: voyager.Index = ...
serialized_index = bytes(index)

get_distance(a: List[float], b: List[float]) → float#: Get the distance between two provided vectors. The vectors must share the dimensionality of the index.

get_vector(id: int) → ndarray[Any, dtype[float32]]#: Get the vector stored in this index at the provided integer ID. If no such vector exists, a KeyError will be thrown.

Note

This method can also be called by using the subscript operator (i.e. my_index[1234]).

Warning

If using the Cosine Space, this method will return a normalized version of the vector that was originally added to this index.

get_vectors(ids: List[int]) → ndarray[Any, dtype[float32]]#: Get one or more vectors stored in this index at the provided integer IDs. If one or more of the provided IDs cannot be found in the index, a KeyError will be thrown.

Warning

If using the Cosine Space, this method will return normalized versions of the vector that were originally added to this index.

static load(filename: str, space: Space, num_dimensions: int, storage_data_type: StorageDataType = StorageDataType.Float32) → Index#

static load(filename: str) → Index

static load(file_like: BinaryIO, space: Space, num_dimensions: int, storage_data_type: StorageDataType = StorageDataType.Float32) → Index

static load(file_like: BinaryIO) → Index

Load an index from a file on disk, or a Python file-like object.

If provided a string as a first argument, the string is assumed to refer to a file on the local filesystem. Loading of an index from this file will be done in native code, without holding Python’s Global Interpreter Lock (GIL), allowing for performant loading of multiple indices simultaneously.

If provided a file-like object as a first argument, the provided object must have read, seek, tell, and seekable methods, and must return binary data (i.e.: open(..., "rb") or io.BinaryIO, etc.).

The additional arguments for Space, num_dimensions, and StorageDataType allow for loading of index files created with versions of Voyager prior to v1.3.

Warning

Loading an index from a file-like object will not release the GIL. However, chunks of data of up to 100MB in size will be read from the file-like object at once, hopefully reducing the impact of the GIL.

mark_deleted(id: int) → None#

Mark an ID in this index as deleted.

Deleted IDs will not show up in the results of calls to query(), but will still take up space in the index, and will slow down queries.

Note

To delete one or more vectors from a Index without incurring any query-time performance penalty, recreate the index from scratch without the vectors you want to remove:

original: Index = ...
ids_to_remove: Set[int] = ...

recreated = Index(
  original.space,
  original.num_dimensions,
  original.M,
  original.ef_construction,
  max_elements=len(original),
  storage_data_type=original.storage_data_type
)
ordered_ids = list(set(original.ids) - ids_to_remove)
recreated.add_items(original.get_vectors(ordered_ids), ordered_ids)

Note

This method can also be called by using the del operator:

index: voyager.Index = ...
del index[1234]  # deletes the ID 1234

query(vectors: ndarray[Any, dtype[float32]], k: int = 1, num_threads: int = -1, query_ef: int = -1) → Tuple[ndarray[Any, dtype[uint64]], ndarray[Any, dtype[float32]]]#

Query this index to retrieve the k nearest neighbors of the provided vectors.

Parameters:

vectors – A 32-bit floating-point NumPy array, with shape (num_dimensions,) or (num_queries, num_dimensions).
k – The number of neighbors to return.
num_threads – If vectors contains more than one query vector, up to num_threads will be started to perform queries in parallel. If vectors contains only one query vector, num_threads will have no effect. Defaults to using one thread per CPU core.
query_ef – The depth of search to perform for this query. Up to query_ef candidates will be searched through to try to find up the k nearest neighbors per query vector.

Returns:

A tuple of (neighbor_ids, distances). If a single query vector was provided, both neighbor_ids and distances will be of shape (k,).

If multiple query vectors were provided, both neighbor_ids and distances will be of shape (num_queries, k), ordered such that the i-th result corresponds with the i-th query vector.

Examples

Query with a single vector:

neighbor_ids, distances = index.query(np.array([1, 2, 3, 4, 5]), k=10)
neighbor_ids.shape # => (10,)
distances.shape # => (10,)

for i, (neighbor_id, distance) in enumerate(zip(neighbor_ids, distances)):
    print(f"{i}-th closest neighbor is {neighbor_id}, {distance} away")

Query with multiple vectors simultaneously:

query_vectors = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
all_neighbor_ids, all_distances = index.query(query_vectors, k=10)
all_neighbor_ids.shape # => (2, 10)
all_distances.shape # => (2, 10)

for query_vector, query_neighbor_ids, query_distances in \
        zip(query_vectors, all_neighbor_ids, all_distances):
    print(f"For query vector {query_vector}:")
    for i, (neighbor_ids, distances) in enumerate(query_neighbor_ids, query_distances):
        print(f"    {i}-th closest neighbor is {neighbor_id}, {distance} away")

Warning

If using E4M3 storage with the Cosine Space, some queries may return negative distances due to the reduced floating-point precision of the storage data type. While confusing, these negative distances still result in a correct ordering between results.

resize(new_size: int) → None#

Resize this index, allocating space for up to new_size elements to be stored. This changes the max_elements property and may cause this Index object to use more memory. This is a fairly expensive operation and prevents queries from taking place while the resize is in process.

Note that the size of an index is a soft limit; if a call to add_item() or add_items() would cause num_elements to exceed max_elements, the index will automatically be resized to accommodate the new elements.

Calling resize() once before adding vectors to the index will speed up index creation if the number of elements is known in advance, as subsequent calls to add_items() will not need to resize the index on-the-fly.

save(output_path: str) → None#

save(file_like: BinaryIO) → None

Save this index to the provided file path or file-like object.

If provided a file path, Voyager will release Python’s Global Interpreter Lock (GIL) and will write to the provided file.

If provided a file-like object, Voyager will not release the GIL, but will pass one or more chunks of data (of up to 100MB each) to the provided object for writing.

unmark_deleted(id: int) → None#

Unmark an ID in this index as deleted.

Once unmarked as deleted, an existing ID will show up in the results of calls to query() again.

property M: int#

The number of connections between nodes in the tree’s internal data structure.

Larger values give better recall, but use more memory. This parameter cannot be changed after the index is instantiated.

property ef: int#

The default number of vectors to search through when calling query().

Higher values make queries slower, but give better recall.

Warning

Changing this parameter affects all calls to query() made after this change is made.

This parameter can be overridden on a per-query basis when calling query() by passing the query_ef parameter, allowing finer-grained control over query speed and recall.

property ef_construction: int#: The number of vectors that this index searches through when inserting a new vector into the index. Higher values make index construction slower, but give better recall. This parameter cannot be changed after the index is instantiated.

property ids: LabelSetView#

A set-like object containing the integer IDs stored as ‘keys’ in this index.

Use these indices to iterate over the vectors in this index, or to check for inclusion of a specific integer ID in this index:

index: voyager.Index = ...

1234 in index.ids  # => true, this ID is present in the index
1234 in index  # => also works!

len(index.ids) # => returns the number of non-deleted vectors
len(index) # => also returns the number of valid labels

for _id in index.ids:
    print(_id) # print all labels

property max_elements: int#

The maximum number of elements that can be stored in this index.

If max_elements is much larger than num_elements, this index may use more memory than necessary. (Call resize() to reduce the memory footprint of this index.)

This is a soft limit; if a call to add_item() or add_items() would cause num_elements to exceed max_elements, the index will automatically be resized to accommodate the new elements.

Note that assigning to this property is functionally identical to calling resize().

property num_dimensions: int#: The number of dimensions in each vector stored by this index.

property num_elements: int#

The number of elements (vectors) currently stored in this index.

Note that the number of elements will not decrease if any elements are deleted from the index; those deleted elements simply become invisible.

property space: Space#: Return the Space used to calculate distances between vectors.

property storage_data_type: StorageDataType#: The StorageDataType used to store vectors in this Index.

Enums#

class voyager.Space(value)#

The method used to calculate the distance between vectors.

Euclidean = 0#: Euclidean distance; also known as L2 distance. The square root of the sum of the squared differences between each element of each vector.

InnerProduct = 1#: Inner product distance.

Cosine = 2#: Cosine distance; also known as normalized inner product.

class voyager.StorageDataType(value)#

The data type used to store vectors in memory and on-disk.

The StorageDataType used for an Index directly determines its memory usage, disk space usage, and recall. Both Float8 and E4M3 data types use 8 bits (1 byte) per dimension per vector, reducing memory usage and index size by a factor of 4 compared to Float32.

Float8 = 16#: 8-bit fixed-point decimal values. All values must be within [-1, 1.00787402].

Float32 = 32#: 32-bit floating point (default).

E4M3 = 48#: 8-bit floating point with a range of [-448, 448], from the paper “FP8 Formats for Deep Learning” by Micikevicius et al.

Warning

Using E4M3 with the Cosine Space may cause some queries to return negative distances due to the reduced floating-point precision. While confusing, these negative distances still result in a correct ordering between results.

Utilities#

class voyager.LabelSetView#: A read-only set-like object containing 64-bit integers. Use this object like a regular Python set object, by either iterating through it, or checking for membership with the in operator.

The voyager Python API#

Enums#

Utilities#

The `voyager` Python API#