Packages

  • package root
    Definition Classes
    root
  • package com
    Definition Classes
    root
  • package spotify
    Definition Classes
    com
  • package scio
    Definition Classes
    spotify
  • package extra
    Definition Classes
    scio
  • package sparkey

    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey fileset:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    s.saveAsSparkey("gs://<bucket>/<path>/<sparkey-prefix>")
    
    // with multiple shards, sharded by MurmurHash3 of the key
    s.saveAsSparkey("gs://<bucket>/<path>/<sparkey-dir>", numShards=2)

    A previously-saved sparkey can be loaded as a side input:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>")

    A sharded collection of Sparkey files can also be used as a side input by specifying a glob path:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-dir>/part-*")

    When the sparkey is needed only temporarily, the save step can be elided:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput

    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput
    
    main.withSideInputs(side)
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
      }

    A SparkeyMap can store any types of keys and values, but can only be used as a SideInput:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyMap[String, Int]] = sc
      .parallelize(Seq("a" -> 1, "b" -> 2, "c" -> 3))
      .asLargeMapSideInput()
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection

    To read a static Sparkey collection and use it as a typed SideInput, use TypedSparkeyReader. TypedSparkeyReader can also accept a Caffeine cache to reduce IO and deserialization load:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val cache: Cache[String, MyObject] = ...
    val side: SideInput[TypedSparkeyReader[MyObject]] = sc
      .typedSparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>", MyObject.decode, cache)
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection
    Definition Classes
    extra
  • package instances
    Definition Classes
    sparkey
  • CachedStringSparkeyReader
  • MockByteArrayEntry
  • MockByteArraySparkeyReader
  • MockEntry
  • MockSparkeyReader
  • MockStringEntry
  • MockStringSparkeyReader
  • ShardedSparkeyReader
  • SparkeyMap
  • SparkeyMapBase
  • SparkeyReaderInstances
  • SparkeySet
  • SparkeySetBase
  • StringSparkeyReader
  • TypedSparkeyReader

package instances

Type Members

  1. class CachedStringSparkeyReader extends StringSparkeyReader

    A wrapper around SparkeyReader that includes an in-memory Caffeine cache.

  2. case class MockByteArrayEntry(k: Array[Byte], v: Array[Byte]) extends MockEntry[Array[Byte], Array[Byte]] with Product with Serializable
  3. case class MockByteArraySparkeyReader(data: Map[Array[Byte], Array[Byte]]) extends MockSparkeyReader with Product with Serializable
  4. trait MockEntry[K, V] extends Entry with Serializable
  5. trait MockSparkeyReader extends SparkeyReader with Serializable
  6. case class MockStringEntry(k: String, v: String) extends MockEntry[String, String] with Product with Serializable
  7. case class MockStringSparkeyReader(data: Map[String, String]) extends MockSparkeyReader with Product with Serializable
  8. class ShardedSparkeyReader extends SparkeyReader

    A wrapper class around SparkeyReader that allows the reading of multiple Sparkey files, sharded by their keys (via MurmurHash3).

    A wrapper class around SparkeyReader that allows the reading of multiple Sparkey files, sharded by their keys (via MurmurHash3). At most 32,768 Sparkey files are supported.

  9. class SparkeyMap[K, V] extends SparkeyMapBase[K, V]

    Enhanced version of SparkeyReader that assumes the underlying Sparkey is encoded with the given Coders, providing a very similar interface to Map[K, V].

  10. trait SparkeyMapBase[K, V] extends Map[K, V]
  11. trait SparkeyReaderInstances extends AnyRef
  12. class SparkeySet[T] extends SparkeySetBase[T]

    Enhanced version of SparkeyReader that assumes the underlying Sparkey is encoded with a given Coder, but contains no values (i.e.: only used as an on-disk HashSet).

  13. trait SparkeySetBase[T] extends Set[T]
  14. class StringSparkeyReader extends SparkeyMapBase[String, String]

    Enhanced version of SparkeyReader that mimics a Map.

  15. class TypedSparkeyReader[T] extends SparkeyMapBase[String, T]

    A wrapper around SparkeyReader that includes both a decoder (to map from each byte array to a JVM type) and an optional in-memory cache.

Ungrouped