class PairLargeHashSCollectionFunctions[K, V] extends AnyRef
Extra functions available on SCollections of (key, value) pairs for hash based joins through an implicit conversion, using the Sparkey-backed LargeMapSideInput for dramatic speed increases over the in-memory versions for datasets >100MB. As long as the RHS fits on disk, these functions are usually much much faster than regular joins and save on shuffling.
Note that these are nearly identical to the functions in PairHashSCollectionFunctions.scala, but we can't reuse the implementations there as SideInput[T] is not covariant over T.
- Alphabetic
- By Inheritance
- PairLargeHashSCollectionFunctions
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new PairLargeHashSCollectionFunctions(self: SCollection[(K, V)])
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def largeHashFullOuterJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (Option[V], Option[W]))]
Perform a full outer join with a
SideInput[SparkeyMap[K, Iterable[W]]]
.Perform a full outer join with a
SideInput[SparkeyMap[K, Iterable[W]]]
.val si = pairSCollRight.asLargeMultiMapSideInput val joined1 = pairSColl1Left.hashFullOuterJoin(si) val joined2 = pairSColl2Left.hashFullOuterJoin(si)
Example: - def largeHashFullOuterJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (Option[V], Option[W]))]
Perform a full outer join by replicating
rhs
to all workers.Perform a full outer join by replicating
rhs
to all workers. The right side must fit on disk. - def largeHashIntersectByKey(sideInput: SideInput[SparkeySet[K]]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are in the SideSetrhs
.Return an SCollection with the pairs from
this
whose keys are in the SideSetrhs
.Unlike SCollection.intersection this preserves duplicates in
this
. - def largeHashIntersectByKey(rhs: SCollection[K], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are inrhs
givenrhs
is small enough to fit on disk.Return an SCollection with the pairs from
this
whose keys are inrhs
givenrhs
is small enough to fit on disk.Unlike SCollection.intersection this preserves duplicates in
this
. - def largeHashJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (V, W))]
Perform an inner join with a MultiMap
SideInput[SparkeyMap[K, Iterable[V]]
Perform an inner join with a MultiMap
SideInput[SparkeyMap[K, Iterable[V]]
The right side must fit on disk. The SideInput can be used reused for multiple joins.
val si = pairSCollRight.asLargeMultiMapSideInput val joined1 = pairSColl1Left.hashJoin(si) val joined2 = pairSColl2Left.hashJoin(si)
Example: - def largeHashJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (V, W))]
Perform an inner join by replicating
rhs
to all workers.Perform an inner join by replicating
rhs
to all workers. The right side should be <<10x smaller than the left side, and must fit on disk. - def largeHashLeftOuterJoin[W](sideInput: SideInput[SparkeyMap[K, Iterable[W]]])(implicit arg0: Coder[W]): SCollection[(K, (V, Option[W]))]
Perform a left outer join with a MultiMap
SideInput[SparkeyMap[K, Iterable[V]]
Perform a left outer join with a MultiMap
SideInput[SparkeyMap[K, Iterable[V]]
val si = pairSCollRight.asLargeMultiMapSideInput val joined1 = pairSColl1Left.hashLeftOuterJoin(si) val joined2 = pairSColl2Left.hashLeftOuterJoin(si)
Example: - def largeHashLeftOuterJoin[W](rhs: SCollection[(K, W)], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, (V, Option[W]))]
Perform a left outer join by replicating
rhs
to all workers.Perform a left outer join by replicating
rhs
to all workers. The right side must fit on disk.- rhs
The SCollection[(K, W)] treated as right side of the join.
val si = pairSCollRight val joined = pairSColl1Left.largeHashLeftOuterJoin(pairSCollRight)
Example: - def largeHashSubtractByKey(sideInput: SideInput[SparkeySet[K]]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are not in SideInput[Set]rhs
. - def largeHashSubtractByKey(rhs: SCollection[K], numShards: Short = SparkeyIO.DefaultSideInputNumShards, compressionType: CompressionType = SparkeyIO.DefaultCompressionType, compressionBlockSize: Int = SparkeyIO.DefaultCompressionBlockSize): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are not in SCollection[V]rhs
.Return an SCollection with the pairs from
this
whose keys are not in SCollection[V]rhs
.Rhs must be small enough to fit on disk.
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()