Packages

  • package root
    Definition Classes
    root
  • package com
    Definition Classes
    root
  • package spotify
    Definition Classes
    com
  • package scio
    Definition Classes
    spotify
  • package extra
    Definition Classes
    scio
  • package sparkey

    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey fileset:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    s.saveAsSparkey("gs://<bucket>/<path>/<sparkey-prefix>")
    
    // with multiple shards, sharded by MurmurHash3 of the key
    s.saveAsSparkey("gs://<bucket>/<path>/<sparkey-dir>", numShards=2)

    A previously-saved sparkey can be loaded as a side input:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>")

    A sharded collection of Sparkey files can also be used as a side input by specifying a glob path:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-dir>/part-*")

    When the sparkey is needed only temporarily, the save step can be elided:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput

    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput
    
    main.withSideInputs(side)
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
      }

    A SparkeyMap can store any types of keys and values, but can only be used as a SideInput:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyMap[String, Int]] = sc
      .parallelize(Seq("a" -> 1, "b" -> 2, "c" -> 3))
      .asLargeMapSideInput()
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection

    To read a static Sparkey collection and use it as a typed SideInput, use TypedSparkeyReader. TypedSparkeyReader can also accept a Caffeine cache to reduce IO and deserialization load:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val cache: Cache[String, MyObject] = ...
    val side: SideInput[TypedSparkeyReader[MyObject]] = sc
      .typedSparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>", MyObject.decode, cache)
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection
    Definition Classes
    extra
  • package instances
    Definition Classes
    sparkey
  • InvalidNumShardsException
  • LargeHashSCollectionFunctions
  • PairLargeHashSCollectionFunctions
  • SparkeyIO
  • SparkeyPairSCollection
  • SparkeySCollection
  • SparkeyScioContext
  • SparkeySetSCollection
  • SparkeyUri
  • SparkeyWritable
c

com.spotify.scio.extra.sparkey

PairLargeHashSCollectionFunctions

class PairLargeHashSCollectionFunctions[K, V] extends AnyRef

Extra functions available on SCollections of (key, value) pairs for hash based joins through an implicit conversion, using the Sparkey-backed LargeMapSideInput for dramatic speed increases over the in-memory versions for datasets >100MB. As long as the RHS fits on disk, these functions are usually much much faster than regular joins and save on shuffling.

Note that these are nearly identical to the functions in PairHashSCollectionFunctions.scala, but we can't reuse the implementations there as SideInput[T] is not covariant over T.

Source
PairLargeHashSCollectionFunctions.scala
Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. PairLargeHashSCollectionFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new PairLargeHashSCollectionFunctions(self: SCollection[(K, V)])

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable])
  9. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  10. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  11. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  12. def largeHashFullOuterJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (Option[V], Option[W]))]

    Perform a full outer join with a SideInput[SparkeyMap[K, Iterable[W]]].

    Perform a full outer join with a SideInput[SparkeyMap[K, Iterable[W]]].

    Example:
    1. val si = pairSCollRight.asLargeMultiMapSideInput
      val joined1 = pairSColl1Left.hashFullOuterJoin(si)
      val joined2 = pairSColl2Left.hashFullOuterJoin(si)
  13. def largeHashFullOuterJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (Option[V], Option[W]))]

    Perform a full outer join by replicating rhs to all workers.

    Perform a full outer join by replicating rhs to all workers. The right side must fit on disk.

  14. def largeHashIntersectByKey(sideInput: SideInput[SparkeySet[K]]): SCollection[(K, V)]

    Return an SCollection with the pairs from this whose keys are in the SideSet rhs.

    Return an SCollection with the pairs from this whose keys are in the SideSet rhs.

    Unlike SCollection.intersection this preserves duplicates in this.

  15. def largeHashIntersectByKey(rhs: SCollection[K], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, V)]

    Return an SCollection with the pairs from this whose keys are in rhs given rhs is small enough to fit on disk.

    Return an SCollection with the pairs from this whose keys are in rhs given rhs is small enough to fit on disk.

    Unlike SCollection.intersection this preserves duplicates in this.

  16. def largeHashJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (V, W))]

    Perform an inner join with a MultiMap SideInput[SparkeyMap[K, Iterable[V]]

    Perform an inner join with a MultiMap SideInput[SparkeyMap[K, Iterable[V]]

    The right side must fit on disk. The SideInput can be used reused for multiple joins.

    Example:
    1. val si = pairSCollRight.asLargeMultiMapSideInput
      val joined1 = pairSColl1Left.hashJoin(si)
      val joined2 = pairSColl2Left.hashJoin(si)
  17. def largeHashJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (V, W))]

    Perform an inner join by replicating rhs to all workers.

    Perform an inner join by replicating rhs to all workers. The right side should be <<10x smaller than the left side, and must fit on disk.

  18. def largeHashLeftOuterJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (V, Option[W]))]

    Perform a left outer join with a MultiMap SideInput[SparkeyMap[K, Iterable[V]]

    Perform a left outer join with a MultiMap SideInput[SparkeyMap[K, Iterable[V]]

    Example:
    1. val si = pairSCollRight.asLargeMultiMapSideInput
      val joined1 = pairSColl1Left.hashLeftOuterJoin(si)
      val joined2 = pairSColl2Left.hashLeftOuterJoin(si)
  19. def largeHashLeftOuterJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (V, Option[W]))]

    Perform a left outer join by replicating rhs to all workers.

    Perform a left outer join by replicating rhs to all workers. The right side must fit on disk.

    rhs

    The SCollection[(K, W)] treated as right side of the join.

    Example:
    1. val si = pairSCollRight
      val joined = pairSColl1Left.largeHashLeftOuterJoin(pairSCollRight)
  20. def largeHashSubtractByKey(sideInput: SideInput[SparkeySet[K]]): SCollection[(K, V)]

    Return an SCollection with the pairs from this whose keys are not in SideInput[Set] rhs.

  21. def largeHashSubtractByKey(rhs: SCollection[K], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, V)]

    Return an SCollection with the pairs from this whose keys are not in SCollection[V] rhs.

    Return an SCollection with the pairs from this whose keys are not in SCollection[V] rhs.

    Rhs must be small enough to fit on disk.

  22. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  23. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  24. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  25. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  26. def toString(): String
    Definition Classes
    AnyRef → Any
  27. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  28. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  29. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()

Inherited from AnyRef

Inherited from Any

Join Operations

per key

Ungrouped