c

com.spotify.scio.values

PairSkewedSCollectionFunctions

class PairSkewedSCollectionFunctions[K, V] extends AnyRef

Extra functions available on SCollections of (key, value) pairs for skewed joins through an implicit conversion.

Source
PairSkewedSCollectionFunctions.scala
Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. PairSkewedSCollectionFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new PairSkewedSCollectionFunctions(self: SCollection[(K, V)])

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable])
  9. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  10. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  11. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  12. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  14. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  15. val self: SCollection[(K, V)]
  16. def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], cms: SCollection[TopCMS[K]]): SCollection[(K, (Option[V], Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full outer join where some keys on the left hand may be hot. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    cms

    left hand side key com.twitter.algebird.TopCMS

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = TopNCMS.aggregator[K](eps, delta, seed, count)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedFullOuterJoin(logMetadata, hotKeyCMS)

      Read more about TopCMS: com.twitter.algebird.TopCMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  17. def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (Option[V], Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full outer join where some keys on the left hand may be hot, i.e.appear more thanhotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedFullOuterJoin(logMetadata, hotKeyThreshold=8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  18. def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyMethod: HotKeyMethod = SkewedJoins.DefaultHotKeyMethod, hotKeyFanout: Int = SkewedJoins.DefaultHotKeyFanout, cmsEps: Double = SkewedJoins.DefaultCmsEpsilon, cmsDelta: Double = SkewedJoins.DefaultCmsDelta, cmsSeed: Int = SkewedJoins.DefaultCmsSeed, sampleFraction: Double = SkewedJoins.DefaultSampleFraction, sampleWithReplacement: Boolean = SkewedJoins.DefaultSampleWithReplacement)(implicit hasher: CMSHasher[K]): SCollection[(K, (Option[V], Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full join where some keys on the left hand may be hot. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    hotKeyMethod

    Method used to compute hot-keys from the left side collection.

    hotKeyFanout

    The number of intermediate keys that will be used during the CMS computation.

    cmsEps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    cmsDelta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    cmsSeed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    sampleFraction

    left side sample fraction.

    sampleWithReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedFullOuterJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  19. def skewedJoin[W](rhs: SCollection[(K, W)], cms: SCollection[TopCMS[K]]): SCollection[(K, (V, W))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    cms

    left hand side key com.twitter.algebird.TopCMS

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = TopNCMS.aggregator[K](eps, delta, seed, count)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyCMS)

      Read more about TopCMS: com.twitter.algebird.TopCMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  20. def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, W))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMS

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold=8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  21. def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyMethod: HotKeyMethod = SkewedJoins.DefaultHotKeyMethod, hotKeyFanout: Int = SkewedJoins.DefaultHotKeyFanout, cmsEps: Double = SkewedJoins.DefaultCmsEpsilon, cmsDelta: Double = SkewedJoins.DefaultCmsDelta, cmsSeed: Int = SkewedJoins.DefaultCmsSeed, sampleFraction: Double = SkewedJoins.DefaultSampleFraction, sampleWithReplacement: Boolean = SkewedJoins.DefaultSampleWithReplacement)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, W))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    hotKeyMethod

    Method used to compute hot-keys from the left side collection.

    hotKeyFanout

    The number of intermediate keys that will be used during the CMS computation.

    cmsEps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    cmsDelta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    cmsSeed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    sampleFraction

    left side sample fraction.

    sampleWithReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  22. def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], cms: SCollection[TopCMS[K]]): SCollection[(K, (V, Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    cms

    left hand side key com.twitter.algebird.TopCMS

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = TopNCMS.aggregator[K](eps, delta, seed, count)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedLeftOuterJoin(logMetadata, hotKeyCMS)

      Read more about TopCMS: com.twitter.algebird.TopCMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  23. def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMS

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedLeftOuterJoin(logMetadata, hotKeyThreshold=8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  24. def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyMethod: HotKeyMethod = SkewedJoins.DefaultHotKeyMethod, hotKeyFanout: Int = SkewedJoins.DefaultHotKeyFanout, cmsEps: Double = SkewedJoins.DefaultCmsEpsilon, cmsDelta: Double = SkewedJoins.DefaultCmsDelta, cmsSeed: Int = SkewedJoins.DefaultCmsSeed, sampleFraction: Double = SkewedJoins.DefaultSampleFraction, sampleWithReplacement: Boolean = SkewedJoins.DefaultSampleWithReplacement)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency.

    true frequency <= estimate <= true frequency + eps * N

    where N is the total size of the left hand side stream so far.

    hotKeyMethod

    Method used to compute hot-keys from the left side collection.

    hotKeyFanout

    The number of intermediate keys that will be used during the CMS computation.

    cmsEps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    cmsDelta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    cmsSeed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    sampleFraction

    left side sample fraction.

    sampleWithReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      val p = logs.skewedLeftJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMS.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  25. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  26. def toString(): String
    Definition Classes
    AnyRef → Any
  27. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  28. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  29. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()

Inherited from AnyRef

Inherited from Any

Join Operations

Ungrouped