Annoy

Scio integrates with Spotify’s Annoy, an approximate nearest neighbors library, via annoy-java and annoy4s.

Write

A keyed SCollection with Int keys and Array[Float] vector values can be saved with asAnnoy:

import com.spotify.scio.values.SCollection
import com.spotify.scio.extra.annoy._

val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val numTrees: Int = ???
val itemVectors: SCollection[(Int, Array[Float])] = ???
itemVectors.asAnnoy("gs://output-path", metric, numDimensions, numTrees)

Side Input

An Annoy file can be read directly as a SideInput with annoySideInput:

import com.spotify.scio._
import com.spotify.scio.values.SideInput
import com.spotify.scio.extra.annoy._

val sc: ScioContext = ???

val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val annoySI: SideInput[AnnoyReader] = sc.annoySideInput("gs://input-path", metric, numDimensions)

Alternatively, an SCollection can be converted directly to a SideInput with @scaladoc [asAnnoySideInput](com.spotify.scio.extra.annoy.AnnoyPairSCollection#asAnnoySideInput(metric:com.spotify.scio.extra.annoy.package.AnnoyMetric,dim:Int):com.spotify.scio.values.SideInput[com.spotify.scio.extra.annoy.package.AnnoyReader]):

import com.spotify.scio.values.{SCollection, SideInput}
import com.spotify.scio.extra.annoy._

val metric: AnnoyMetric = ???
val numDimensions: Int = ???
val numTrees: Int = ???
val itemVectors: SCollection[(Int, Array[Float])] = ???
val annoySI: SideInput[AnnoyReader] = itemVectors.asAnnoySideInput(metric, numDimensions, numTrees)

An AnnoyReader provides access to item vectors and their nearest neighbors:

import com.spotify.scio.values.{SCollection, SideInput}
import com.spotify.scio.extra.annoy._

val annoySI: SideInput[AnnoyReader] = ???
val elements: SCollection[Int] = ???
elements
  .withSideInputs(annoySI)
  .map { case (element, ctx) =>
    val annoyReader: AnnoyReader = ctx(annoySI)
    val vec: Array[Float] = annoyReader.getItemVector(element)
    element -> annoyReader.getNearest(vec, 1)
  }