Annoy
Scio integrates with Spotify’s Annoy, an approximate nearest neighbors library, via annoy-java and annoy4s.
Write
A keyed SCollection
with Int
keys and Array[Float]
vector values can be saved with asAnnoy
:
import com.spotify.scio.values.SCollection
import com.spotify.scio.extra.annoy._
val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val numTrees: Int = ???
val itemVectors: SCollection[(Int, Array[Float])] = ???
itemVectors.asAnnoy("gs://output-path", metric, numDimensions, numTrees)
Side Input
An Annoy file can be read directly as a SideInput
with annoySideInput
:
import com.spotify.scio._
import com.spotify.scio.values.SideInput
import com.spotify.scio.extra.annoy._
val sc: ScioContext = ???
val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val annoySI: SideInput[AnnoyReader] = sc.annoySideInput("gs://input-path", metric, numDimensions)
Alternatively, an SCollection
can be converted directly to a SideInput
with @scaladoc [asAnnoySideInput
](com.spotify.scio.extra.annoy.package$$AnnoyPairSCollection#asAnnoySideInput(metric:com.spotify.scio.extra.annoy.package.AnnoyMetric,dim:Int):com.spotify.scio.values.SideInput[com.spotify.scio.extra.annoy.package.AnnoyReader]):
import com.spotify.scio.values.{SCollection, SideInput}
import com.spotify.scio.extra.annoy._
val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val numTrees: Int = ???
val itemVectors: SCollection[(Int, Array[Float])] = ???
val annoySI: SideInput[AnnoyReader] = itemVectors.asAnnoySideInput(metric, numDimensions, numTrees)
An AnnoyReader
provides access to item vectors and their nearest neighbors:
import com.spotify.scio.values.{SCollection, SideInput}
import com.spotify.scio.extra.annoy._
val annoySI: SideInput[AnnoyReader] = ???
val elements: SCollection[Int] = ???
elements
.withSideInputs(annoySI)
.map { case (element, ctx) =>
val annoyReader: AnnoyReader = ctx(annoySI)
val vec: Array[Float] = annoyReader.getItemVector(element)
element -> annoyReader.getNearest(vec, 1)
}
0.14.8-23-c45685a-20241105T161920Z*