c

com.spotify.scio.values

PairSkewedSCollectionFunctions

class PairSkewedSCollectionFunctions[K, V] extends AnyRef

Extra functions available on SCollections of (key, value) pairs for skwed joins through an implicit conversion.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. PairSkewedSCollectionFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Instance Constructors

  1. new PairSkewedSCollectionFunctions(self: SCollection[(K, V)])

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native() @HotSpotIntrinsicCandidate()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  8. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  9. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  10. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  11. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  12. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  13. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  14. val self: SCollection[(K, V)]
  15. def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (Option[V], Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full outer join where some keys on the left hand may be hot, i.e.appear more thanhotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  16. def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (Option[V], Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.

    Perform a skewed full join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val p = logs.skewedLeftJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  17. def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, W))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  18. def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, W))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    N to 1 skew-proof flavor of PairSCollectionFunctions.join.

    Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val p = logs.skewedJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  19. def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind.

    cms

    left hand side key com.twitter.algebird.CMSMonoid

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val keyAggregator = CMS.aggregator[K](eps, delta, seed)
      val hotKeyCMS = self.keys.aggregate(keyAggregator)
      val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  20. def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, Option[W]))]

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.

    Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than hotKeyThreshold times. Frequency of a key is estimated with 1 - delta probability, and the estimate is within eps * N of the true frequency. true frequency <= estimate <= true frequency + eps * N, where N is the total size of the left hand side stream so far.

    hotKeyThreshold

    key with hotKeyThreshold values will be considered hot. Some runners have inefficient GroupByKey implementation for groups with more than 10K values. Thus it is recommended to set hotKeyThreshold to below 10K, keep upper estimation error in mind. If you sample input via sampleFraction make sure to adjust hotKeyThreshold accordingly.

    eps

    One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in (0, 1).

    seed

    A seed to initialize the random number generator used to create the pairwise independent hash functions.

    delta

    A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on eps) around the truth. Must lie in (0, 1).

    sampleFraction

    left side sample fraction. Default is 1.0 - no sampling.

    withReplacement

    whether to use sampling with replacement, see SCollection.sample.

    Example:
    1. // Implicits that enabling CMS-hashing
      import com.twitter.algebird.CMSHasherImplicits._
      
      val p = logs.skewedLeftJoin(logMetadata)

      Read more about CMS: com.twitter.algebird.CMSMonoid.

    Note

    Make sure to import com.twitter.algebird.CMSHasherImplicits before using this join.

  21. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  22. def toString(): String
    Definition Classes
    AnyRef → Any
  23. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  24. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  25. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

Inherited from AnyRef

Inherited from Any

Join Operations

Ungrouped