class PairSkewedSCollectionFunctions[K, V] extends AnyRef
Extra functions available on SCollections of (key, value) pairs for skwed joins through an implicit conversion.
- Alphabetic
- By Inheritance
- PairSkewedSCollectionFunctions
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new PairSkewedSCollectionFunctions(self: SCollection[(K, V)])
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native() @HotSpotIntrinsicCandidate()
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
- val self: SCollection[(K, V)]
- def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (Option[V], Option[W]))]
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
Perform a skewed full outer join where some keys on the left hand may be hot, i.e.appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind.- cms
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - def skewedFullOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (Option[V], Option[W]))]
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.fullOuterJoin.
Perform a skewed full join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind. If you sample input viasampleFraction
make sure to adjusthotKeyThreshold
accordingly.- eps
One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in
(0, 1)
.- seed
A seed to initialize the random number generator used to create the pairwise independent hash functions.
- delta
A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on
eps
) around the truth. Must lie in(0, 1)
.- sampleFraction
left side sample fraction. Default is
1.0
- no sampling.- withReplacement
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedLeftJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, W))]
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind.- cms
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - def skewedJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, W))]
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
N to 1 skew-proof flavor of PairSCollectionFunctions.join.
Perform a skewed join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind. If you sample input viasampleFraction
make sure to adjusthotKeyThreshold
accordingly.- eps
One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in
(0, 1)
.- seed
A seed to initialize the random number generator used to create the pairwise independent hash functions.
- delta
A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on
eps
) around the truth. Must lie in(0, 1)
.- sampleFraction
left side sample fraction. Default is
1.0
- no sampling.- withReplacement
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long, cms: SCollection[CMS[K]]): SCollection[(K, (V, Option[W]))]
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind.- cms
left hand side key com.twitter.algebird.CMSMonoid
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val keyAggregator = CMS.aggregator[K](eps, delta, seed) val hotKeyCMS = self.keys.aggregate(keyAggregator) val p = logs.skewedJoin(logMetadata, hotKeyThreshold = 8500, cms=hotKeyCMS)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - def skewedLeftOuterJoin[W](rhs: SCollection[(K, W)], hotKeyThreshold: Long = 9000, eps: Double = 0.001, seed: Int = 42, delta: Double = 1e-10, sampleFraction: Double = 1.0, withReplacement: Boolean = true)(implicit hasher: CMSHasher[K]): SCollection[(K, (V, Option[W]))]
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
N to 1 skew-proof flavor of PairSCollectionFunctions.leftOuterJoin.
Perform a skewed left join where some keys on the left hand may be hot, i.e. appear more than
hotKeyThreshold
times. Frequency of a key is estimated with1 - delta
probability, and the estimate is withineps * N
of the true frequency.true frequency <= estimate <= true frequency + eps * N
, where N is the total size of the left hand side stream so far.- hotKeyThreshold
key with
hotKeyThreshold
values will be considered hot. Some runners have inefficientGroupByKey
implementation for groups with more than 10K values. Thus it is recommended to sethotKeyThreshold
to below 10K, keep upper estimation error in mind. If you sample input viasampleFraction
make sure to adjusthotKeyThreshold
accordingly.- eps
One-sided error bound on the error of each point query, i.e. frequency estimate. Must lie in
(0, 1)
.- seed
A seed to initialize the random number generator used to create the pairwise independent hash functions.
- delta
A bound on the probability that a query estimate does not lie within some small interval (an interval that depends on
eps
) around the truth. Must lie in(0, 1)
.- sampleFraction
left side sample fraction. Default is
1.0
- no sampling.- withReplacement
whether to use sampling with replacement, see SCollection.sample.
// Implicits that enabling CMS-hashing import com.twitter.algebird.CMSHasherImplicits._ val p = logs.skewedLeftJoin(logMetadata)
Read more about CMS: com.twitter.algebird.CMSMonoid.
- Note
Make sure to
import com.twitter.algebird.CMSHasherImplicits
before using this join.
Example: - final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])