The voyager
Python API#
This module provides classes and functions for creating indices of vector data.
A quick example on how to get started:
import numpy as np
from voyager import Index, Space
# Create an empty Index object that can store vectors:
index = Index(Space.Euclidean, num_dimensions=5)
id_a = index.add_item([1, 2, 3, 4, 5])
id_b = index.add_item([6, 7, 8, 9, 10])
print(id_a) # => 0
print(id_b) # => 1
# Find the two closest elements:
neighbors, distances = index.query([1, 2, 3, 4, 5], k=2)
print(neighbors) # => [0, 1]
print(distances) # => [0.0, 125.0]
# Save the index to disk to reload later (or in Java!)
index.save("output_file.voy")
- class voyager.Index(space: Space, num_dimensions: int, M: int = 12, ef_construction: int = 200, random_seed: int = 1, max_elements: int = 1, storage_data_type: StorageDataType = StorageDataType.Float32)#
A nearest-neighbor search index containing vector data (i.e. lists of floating-point values, each list labeled with a single integer ID).
Think of a Voyager
Index
as aDict[int, List[float]]
(a dictionary mapping integers to lists of floats), where aquery()
method allows finding the k nearestint
keys to a providedList[float]
query vector.Warning
Voyager is an approximate nearest neighbor index package, which means that each call to the
query()
method may return results that are approximately correct. The metric used to gauge the accuracy of a Voyagerquery()
is called recall. Recall measures the percentage of correct results returned perquery()
.Various parameters documented below (like
M
,ef_construction
, andef
) may affect the recall of queries. Usually, increasing recall requires either increasing query latency, increasing memory usage, increasing index creation time, or all of the above.Voyager indices support insertion, lookup, and deletion of vectors.
Calling the
Index
constructor will return an object of one of three classes:FloatIndex
, which uses 32-bit precision for all dataFloat8Index
, which uses 8-bit fixed-point precision and requires all vectors to be within the bounds [-1, 1]E4M3Index
, which uses 8-bit floating-point precision and requires all vectors to be within the bounds [-448, 448]
- Parameters:
space –
The
Space
to use to calculate the distance between vectors,Space.Cosine
(cosine distance).The choice of distance to use usually depends on the kind of vector data being inserted into the index. If your vectors are usually compared by measuring the cosine distance between them, use
Space.Cosine
.num_dimensions – The number of dimensions present in each vector added to this index. Each vector in the index must have the same number of dimensions.
M – The number of connections between nodes in the tree’s internal data structure. Larger values give better recall, but use more memory. This parameter cannot be changed after the index is instantiated.
ef_construction – The number of vectors to search through when inserting a new vector into the index. Higher values make index construction slower, but give better recall. This parameter cannot be changed after the index is instantiated.
random_seed – The seed (initial value) of the random number generator used when constructing this index. Byte-for-byte identical indices can be created by setting this value to a known value and adding vectors to the index one-at-a-time (to avoid nondeterministic ordering due to multi-threaded insertion).
max_elements – The maximum size of the index at construction time. Indices can be resized (and are automatically resized when
add_item()
oradd_items()
is called) so this value is only useful if the exact number of elements that will be added to this index is known in advance.
- __contains__(id: int) bool #
Check to see if a provided vector’s ID is present in this index.
Returns true iff the provided integer ID has a corresponding (non-deleted) vector in this index. Use the
in
operator to call this method:1234 in index # => returns True or False
- __len__() int #
Returns the number of non-deleted vectors in this index.
Use the
len
operator to call this method:len(index) # => 1234
Note
This value may differ from
num_elements
if elements have been deleted.
- add_item(vector: ndarray[Any, dtype[float32]], id: int | None = None) int #
Add a vector to this index.
- Parameters:
vector –
A 32-bit floating-point NumPy array, with shape
(num_dimensions,)
.If using the
Space.Cosine
Space
, this vector will be normalized before insertion. If using aStorageDataType
other thanStorageDataType.Float32
, the vector will be converted to the lower-precision data type after normalization.id – An optional ID to assign to this vector. If not provided, this vector’s ID will automatically be generated based on the number of elements already in this index.
- Returns:
The ID that was assigned to this vector (either auto-generated or provided).
Warning
If calling
add_item()
in a loop, consider batching your calls by usingadd_items()
instead, which will be faster.
- add_items(vectors: ndarray[Any, dtype[float32]], ids: List[int] | None = None, num_threads: int = -1) List[int] #
Add multiple vectors to this index simultaneously.
This method may be faster than calling
add_items()
multiple times, as passing a batch of vectors helps avoid Python’s Global Interpreter Lock.- Parameters:
vectors –
A 32-bit floating-point NumPy array, with shape
(num_vectors, num_dimensions)
.If using the
Space.Cosine
Space
, these vectors will be normalized before insertion. If using aStorageDataType
other thanStorageDataType.Float32
, these vectors will be converted to the lower-precision data type after normalization.id – An optional list of IDs to assign to these vectors. If provided, this list must be identical in length to the first dimension of
vectors
. If not provided, each vector’s ID will automatically be generated based on the number of elements already in this index.num_threads – Up to
num_threads
will be started to perform insertions in parallel. Ifvectors
contains only one query vector,num_threads
will have no effect. By default, one thread will be spawned per CPU core.
- Returns:
The IDs that were assigned to the provided vectors (either auto-generated or provided), in the same order as the provided vectors.
- as_bytes() bytes #
Returns the contents of this index as a
bytes
object. The resulting object will contain the same data as if this index was serialized to disk and then read back into memory again.Warning
This may be extremely large (many gigabytes) if the index is sufficiently large. To save to disk without allocating this entire bytestring, use
save()
.Note
This method can also be called by passing this object to the
bytes(...)
built-in function:index: voyager.Index = ... serialized_index = bytes(index)
- get_distance(a: List[float], b: List[float]) float #
Get the distance between two provided vectors. The vectors must share the dimensionality of the index.
- get_vector(id: int) ndarray[Any, dtype[float32]] #
Get the vector stored in this index at the provided integer ID. If no such vector exists, a
KeyError
will be thrown.Note
This method can also be called by using the subscript operator (i.e.
my_index[1234]
).Warning
If using the
Cosine
Space
, this method will return a normalized version of the vector that was originally added to this index.
- get_vectors(ids: List[int]) ndarray[Any, dtype[float32]] #
Get one or more vectors stored in this index at the provided integer IDs. If one or more of the provided IDs cannot be found in the index, a
KeyError
will be thrown.Warning
If using the
Cosine
Space
, this method will return normalized versions of the vector that were originally added to this index.
- static load(filename: str, space: Space, num_dimensions: int, storage_data_type: StorageDataType = StorageDataType.Float32) Index #
- static load(filename: str) Index
- static load(file_like: BinaryIO, space: Space, num_dimensions: int, storage_data_type: StorageDataType = StorageDataType.Float32) Index
- static load(file_like: BinaryIO) Index
Load an index from a file on disk, or a Python file-like object.
If provided a string as a first argument, the string is assumed to refer to a file on the local filesystem. Loading of an index from this file will be done in native code, without holding Python’s Global Interpreter Lock (GIL), allowing for performant loading of multiple indices simultaneously.
If provided a file-like object as a first argument, the provided object must have
read
,seek
,tell
, andseekable
methods, and must return binary data (i.e.:open(..., "rb")
orio.BinaryIO
, etc.).The additional arguments for
Space
,num_dimensions
, andStorageDataType
allow for loading of index files created with versions of Voyager prior to v1.3.Warning
Loading an index from a file-like object will not release the GIL. However, chunks of data of up to 100MB in size will be read from the file-like object at once, hopefully reducing the impact of the GIL.
- mark_deleted(id: int) None #
Mark an ID in this index as deleted.
Deleted IDs will not show up in the results of calls to
query()
, but will still take up space in the index, and will slow down queries.Note
To delete one or more vectors from a
Index
without incurring any query-time performance penalty, recreate the index from scratch without the vectors you want to remove:original: Index = ... ids_to_remove: Set[int] = ... recreated = Index( original.space, original.num_dimensions, original.M, original.ef_construction, max_elements=len(original), storage_data_type=original.storage_data_type ) ordered_ids = list(set(original.ids) - ids_to_remove) recreated.add_items(original.get_vectors(ordered_ids), ordered_ids)
Note
This method can also be called by using the
del
operator:index: voyager.Index = ... del index[1234] # deletes the ID 1234
- query(vectors: ndarray[Any, dtype[float32]], k: int = 1, num_threads: int = -1, query_ef: int = -1) Tuple[ndarray[Any, dtype[uint64]], ndarray[Any, dtype[float32]]] #
Query this index to retrieve the
k
nearest neighbors of the provided vectors.- Parameters:
vectors – A 32-bit floating-point NumPy array, with shape
(num_dimensions,)
or(num_queries, num_dimensions)
.k – The number of neighbors to return.
num_threads – If
vectors
contains more than one query vector, up tonum_threads
will be started to perform queries in parallel. Ifvectors
contains only one query vector,num_threads
will have no effect. Defaults to using one thread per CPU core.query_ef – The depth of search to perform for this query. Up to
query_ef
candidates will be searched through to try to find up thek
nearest neighbors per query vector.
- Returns:
A tuple of
(neighbor_ids, distances)
. If a single query vector was provided, bothneighbor_ids
anddistances
will be of shape(k,)
.If multiple query vectors were provided, both
neighbor_ids
anddistances
will be of shape(num_queries, k)
, ordered such that thei
-th result corresponds with thei
-th query vector.
Examples
Query with a single vector:
neighbor_ids, distances = index.query(np.array([1, 2, 3, 4, 5]), k=10) neighbor_ids.shape # => (10,) distances.shape # => (10,) for i, (neighbor_id, distance) in enumerate(zip(neighbor_ids, distances)): print(f"{i}-th closest neighbor is {neighbor_id}, {distance} away")
Query with multiple vectors simultaneously:
query_vectors = np.array([ [1, 2, 3, 4, 5], [6, 7, 8, 9, 10] ]) all_neighbor_ids, all_distances = index.query(query_vectors, k=10) all_neighbor_ids.shape # => (2, 10) all_distances.shape # => (2, 10) for query_vector, query_neighbor_ids, query_distances in \ zip(query_vectors, all_neighbor_ids, all_distances): print(f"For query vector {query_vector}:") for i, (neighbor_ids, distances) in enumerate(query_neighbor_ids, query_distances): print(f" {i}-th closest neighbor is {neighbor_id}, {distance} away")
Warning
If using E4M3 storage with the Cosine
Space
, some queries may return negative distances due to the reduced floating-point precision of the storage data type. While confusing, these negative distances still result in a correct ordering between results.
- resize(new_size: int) None #
Resize this index, allocating space for up to
new_size
elements to be stored. This changes themax_elements
property and may cause thisIndex
object to use more memory. This is a fairly expensive operation and prevents queries from taking place while the resize is in process.Note that the size of an index is a soft limit; if a call to
add_item()
oradd_items()
would causenum_elements
to exceedmax_elements
, the index will automatically be resized to accommodate the new elements.Calling
resize()
once before adding vectors to the index will speed up index creation if the number of elements is known in advance, as subsequent calls toadd_items()
will not need to resize the index on-the-fly.
- save(output_path: str) None #
- save(file_like: BinaryIO) None
Save this index to the provided file path or file-like object.
If provided a file path, Voyager will release Python’s Global Interpreter Lock (GIL) and will write to the provided file.
If provided a file-like object, Voyager will not release the GIL, but will pass one or more chunks of data (of up to 100MB each) to the provided object for writing.
- unmark_deleted(id: int) None #
Unmark an ID in this index as deleted.
Once unmarked as deleted, an existing ID will show up in the results of calls to
query()
again.
- property M: int#
The number of connections between nodes in the tree’s internal data structure.
Larger values give better recall, but use more memory. This parameter cannot be changed after the index is instantiated.
- property ef: int#
The default number of vectors to search through when calling
query()
.Higher values make queries slower, but give better recall.
- property ef_construction: int#
The number of vectors that this index searches through when inserting a new vector into the index. Higher values make index construction slower, but give better recall. This parameter cannot be changed after the index is instantiated.
- property ids: LabelSetView#
A set-like object containing the integer IDs stored as ‘keys’ in this index.
Use these indices to iterate over the vectors in this index, or to check for inclusion of a specific integer ID in this index:
index: voyager.Index = ... 1234 in index.ids # => true, this ID is present in the index 1234 in index # => also works! len(index.ids) # => returns the number of non-deleted vectors len(index) # => also returns the number of valid labels for _id in index.ids: print(_id) # print all labels
- property max_elements: int#
The maximum number of elements that can be stored in this index.
If
max_elements
is much larger thannum_elements
, this index may use more memory than necessary. (Callresize()
to reduce the memory footprint of this index.)This is a soft limit; if a call to
add_item()
oradd_items()
would causenum_elements
to exceedmax_elements
, the index will automatically be resized to accommodate the new elements.Note that assigning to this property is functionally identical to calling
resize()
.
- property num_dimensions: int#
The number of dimensions in each vector stored by this index.
- property num_elements: int#
The number of elements (vectors) currently stored in this index.
Note that the number of elements will not decrease if any elements are deleted from the index; those deleted elements simply become invisible.
- property storage_data_type: StorageDataType#
The
StorageDataType
used to store vectors in thisIndex
.
Enums#
- class voyager.Space(value)#
The method used to calculate the distance between vectors.
- Euclidean = 0#
Euclidean distance; also known as L2 distance. The square root of the sum of the squared differences between each element of each vector.
- InnerProduct = 1#
Inner product distance.
- Cosine = 2#
Cosine distance; also known as normalized inner product.
- class voyager.StorageDataType(value)#
The data type used to store vectors in memory and on-disk.
The
StorageDataType
used for anIndex
directly determines its memory usage, disk space usage, and recall. BothFloat8
andE4M3
data types use 8 bits (1 byte) per dimension per vector, reducing memory usage and index size by a factor of 4 compared toFloat32
.- Float8 = 16#
8-bit fixed-point decimal values. All values must be within [-1, 1.00787402].
- Float32 = 32#
32-bit floating point (default).
- E4M3 = 48#
8-bit floating point with a range of [-448, 448], from the paper “FP8 Formats for Deep Learning” by Micikevicius et al.
Warning
Using E4M3 with the Cosine
Space
may cause some queries to return negative distances due to the reduced floating-point precision. While confusing, these negative distances still result in a correct ordering between results.
Utilities#
- class voyager.LabelSetView#
A read-only set-like object containing 64-bit integers. Use this object like a regular Python
set
object, by either iterating through it, or checking for membership with thein
operator.