class PairSCollectionFunctions[K, V] extends AnyRef
Extra functions available on SCollections of (key, value) pairs through an implicit conversion.
- Alphabetic
- By Inheritance
- PairSCollectionFunctions
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new PairSCollectionFunctions(self: SCollection[(K, V)])
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def aggregateByKey[A, U](aggregator: MonoidAggregator[V, A, U])(implicit arg0: Coder[A], arg1: Coder[U]): SCollection[(K, U)]
Aggregate the values of each key with MonoidAggregator.
Aggregate the values of each key with MonoidAggregator. First each value
V
is mapped toA
, then we reduce with a Monoid ofA
, then finally we present the results asU
. This could be more powerful and better optimized in some cases. - def aggregateByKey[A, U](aggregator: Aggregator[V, A, U])(implicit arg0: Coder[A], arg1: Coder[U]): SCollection[(K, U)]
Aggregate the values of each key with Aggregator.
Aggregate the values of each key with Aggregator. First each value
V
is mapped toA
, then we reduce with a Semigroup ofA
, then finally we present the results asU
. This could be more powerful and better optimized in some cases. - def aggregateByKey[U](zeroValue: => U)(seqOp: (U, V) => U, combOp: (U, U) => U)(implicit arg0: Coder[U]): SCollection[(K, U)]
Aggregate the values of each key, using given combine functions and a neutral "zero value".
Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type,
U
, than the type of the values in this SCollection,V
. Thus, we need one operation for merging aV
into aU
and one operation for merging twoU
's. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new
U.
- def applyPerKeyDoFn[U](t: DoFn[KV[K, V], KV[K, U]])(implicit arg0: Coder[U]): SCollection[(K, U)]
Apply a DoFn that processes KV s and wrap the output in an SCollection.
- def approxQuantilesByKey(numQuantiles: Int)(implicit ord: Ordering[V]): SCollection[(K, Iterable[V])]
For each key, compute the values' data distribution using approximate
N
-tiles.For each key, compute the values' data distribution using approximate
N
-tiles.- returns
a new SCollection whose values are
Iterable
s of the approximateN
-tiles of the elements.
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def asMapSideInput: SideInput[Map[K, V]]
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs.Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs. It is required that each key of the input be associated with a single value.Note: the underlying map implementation is runner specific and may have performance overhead. Use asMapSingletonSideInput instead if the resulting map can fit into memory.
- def asMapSingletonSideInput: SideInput[Map[K, V]]
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs.Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, value]
, to be used with SCollection.withSideInputs. It is required that each key of the input be associated with a single value.Currently, the resulting map is required to fit into memory. This is preferable to asMapSideInput if that's the case.
- def asMultiMapSideInput: SideInput[Map[K, Iterable[V]]]
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs.Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs. In contrast to asMapSideInput, it is not required that the keys in the input collection be unique.Note: the underlying map implementation is runner specific and may have performance overhead. Use asMultiMapSingletonSideInput instead if the resulting map can fit into memory.
- def asMultiMapSingletonSideInput: SideInput[Map[K, Iterable[V]]]
Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs.Convert this SCollection to a SideInput, mapping key-value pairs of each window to a
Map[key, Iterable[value]]
, to be used with SCollection.withSideInputs. In contrast to asMapSingletonSideInput, it is not required that the keys in the input collection be unique.Currently, the resulting map is required to fit into memory. This is preferable to asMultiMapSideInput if that's the case.
- def batchByKey(batchSize: Long, maxBufferingDuration: Duration = Duration.ZERO): SCollection[(K, Iterable[V])]
Batches inputs to a desired batch size.
Batches inputs to a desired batch size. Batches will contain only elements of a single key.
Elements are buffered until there are batchSize elements buffered, at which point they are emitted to the output SCollection.
Windows are preserved (batches contain elements from the same window). Batches may contain elements from more than one bundle.
A time limit (in processing time) on how long an incomplete batch of elements is allowed to be buffered can be set. Once a batch is flushed to output, the timer is reset. The provided limit must be a positive duration or zero; a zero buffering duration effectively means no limit.
- def batchByteSizedByKey(batchByteSize: Long, maxBufferingDuration: Duration = Duration.ZERO): SCollection[(K, Iterable[V])]
Batches inputs to a desired batch of byte size.
Batches inputs to a desired batch of byte size. Batches will contain only elements of a single key.
The value coder is used to determine the byte size of each element.
Elements are buffered until there are an estimated batchByteSize bytes buffered, at which point they are emitted to the output SCollection.
Windows are preserved (batches contain elements from the same window). Batches may contain elements from more than one bundle.
A time limit (in processing time) on how long an incomplete batch of elements is allowed to be buffered can be set. Once a batch is flushed to output, the timer is reset. The provided limit must be a positive duration or zero; a zero buffering duration effectively means no limit.
- def batchWeightedByKey(weight: Long, cost: (V) => Long, maxBufferingDuration: Duration = Duration.ZERO): SCollection[(K, Iterable[V])]
Batches inputs to a desired weight.
Batches inputs to a desired weight. Batches will contain only elements of a single key.
The weight of each element is computer from the provided cost function.
Elements are buffered until the weight is reached, at which point they are emitted to the output SCollection.
Windows are preserved (batches contain elements from the same window). Batches may contain elements from more than one bundle.
A time limit (in processing time) on how long an incomplete batch of elements is allowed to be buffered can be set. Once a batch is flushed to output, the timer is reset. The provided limit must be a positive duration or zero; a zero buffering duration effectively means no limit.
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def cogroup[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
For each key k in
this
orrhs1
orrhs2
orrhs3
, return a resulting SCollection that contains a tuple with the list of values for that key inthis
,rhs1
,rhs2
andrhs3
. - def cogroup[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
For each key k in
this
orrhs1
orrhs2
, return a resulting SCollection that contains a tuple with the list of values for that key inthis
,rhs1
andrhs2
. - def cogroup[W](rhs: SCollection[(K, W)]): SCollection[(K, (Iterable[V], Iterable[W]))]
For each key k in
this
orrhs
, return a resulting SCollection that contains a tuple with the list of values for that key inthis
as well asrhs
. - def combineByKey[C](createCombiner: (V) => C)(mergeValue: (C, V) => C)(mergeCombiners: (C, C) => C)(implicit arg0: Coder[C]): SCollection[(K, C)]
Generic function to combine the elements for each key using a custom set of aggregation functions.
Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an
SCollection[(K, V)]
into a result of typeSCollection[(K, C)]
, for a "combined type"C
Note thatV
andC
can be different -- for example, one might group an SCollection of type(Int, Int)
into an SCollection of type(Int, Seq[Int])
. Users provide three functions:createCombiner
, which turns aV
into aC
(e.g., creates a one-element list)mergeValue
, to merge aV
into aC
(e.g., adds it to the end of a list)mergeCombiners
, to combine twoC
's into a single one.
Both
mergeValue
andmergeCombiners
are allowed to modify and return their first argument instead of creating a newU
to avoid memory allocation. - def countApproxDistinctByKey(estimator: ApproxDistinctCounter[V]): SCollection[(K, Long)]
Return a new SCollection of (key, value) pairs where value is estimated distinct count(as Long) per each unique key.
Return a new SCollection of (key, value) pairs where value is estimated distinct count(as Long) per each unique key. Correctness of the estimation is depends on the given ApproxDistinctCounter estimator.
- returns
a key valued SCollection where value type is Long and hold the approximate distinct count.
val input: SCollection[(K, V)] = ... val distinctCount: SCollection[(K, Long)] = input.approximateDistinctCountPerKey(ApproximateUniqueCounter(sampleSize))
There are two known subclass of ApproxDistinctCounter available for HLL++ implementations in the
scio-extra
module.- com.spotify.scio.extra.hll.sketching.SketchingHyperLogLogPlusPlus
- com.spotify.scio.extra.hll.zetasketch.ZetasketchHll_Counter HyperLogLog++: Google HLL++ paper
Example: - def countApproxDistinctByKey(maximumEstimationError: Double = 0.02): SCollection[(K, Long)]
Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
- maximumEstimationError
the maximum estimation error, which should be in the range
[0.01, 0.5]
.
- def countApproxDistinctByKey(sampleSize: Int): SCollection[(K, Long)]
Count approximate number of distinct values for each key in the SCollection.
Count approximate number of distinct values for each key in the SCollection.
- sampleSize
the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be
>= 16
.
- def countByKey: SCollection[(K, Long)]
Count the number of elements for each key.
Count the number of elements for each key.
- returns
a new SCollection of (key, count) pairs
- def distinctByKey: SCollection[(K, V)]
Return a new SCollection of (key, value) pairs without duplicates based on the keys.
Return a new SCollection of (key, value) pairs without duplicates based on the keys. The value is taken randomly for each key.
- returns
a new SCollection of (key, value) pairs
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def filterValues(f: (V) => Boolean): SCollection[(K, V)]
Return a new SCollection of (key, value) pairs whose values satisfy the predicate.
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- def flatMapValues[U](f: (V) => TraversableOnce[U])(implicit arg0: Coder[U]): SCollection[(K, U)]
Pass each value in the key-value pair SCollection through a
flatMap
function without changing the keys. - def flattenValues[U](implicit arg0: Coder[U], ev: <:<[V, TraversableOnce[U]]): SCollection[(K, U)]
Return an SCollection having its values flattened.
- def foldByKey(implicit mon: Monoid[V]): SCollection[(K, V)]
Fold by key with Monoid, which defines the associative function and "zero value" for
V
.Fold by key with Monoid, which defines the associative function and "zero value" for
V
. This could be more powerful and better optimized in some cases. - def foldByKey(zeroValue: => V)(op: (V, V) => V): SCollection[(K, V)]
Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.). The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
- def fullOuterJoin[W](rhs: SCollection[(K, W)]): SCollection[(K, (Option[V], Option[W]))]
Perform a full outer join of
this
andrhs
.Perform a full outer join of
this
andrhs
. For each element (k, v) inthis
, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for w inrhs
, or the pair (k, (Some(v), None)) if no elements inrhs
have key k. Similarly, for each element (k, w) inrhs
, the resulting SCollection will either contain all pairs (k, (Some(v), Some(w))) for v inthis
, or the pair (k, (None, Some(w))) if no elements inthis
have key k. - final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def groupByKey: SCollection[(K, Iterable[V])]
Group the values for each key in the SCollection into a single sequence.
Group the values for each key in the SCollection into a single sequence. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.
Note: As currently implemented,
groupByKey
must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in anOutOfMemoryError
. - def groupWith[W1, W2, W3](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)], rhs3: SCollection[(K, W3)]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
Alias for
cogroup
. - def groupWith[W1, W2](rhs1: SCollection[(K, W1)], rhs2: SCollection[(K, W2)]): SCollection[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
Alias for
cogroup
. - def groupWith[W](rhs: SCollection[(K, W)]): SCollection[(K, (Iterable[V], Iterable[W]))]
Alias for
cogroup
. - def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashPartitionByKey(numPartitions: Int): Seq[SCollection[(K, V)]]
Partition this SCollection using
K.##
inton
partitions.Partition this SCollection using
K.##
inton
partitions. Note that K should provide consistent hash code accross different JVM.- numPartitions
number of output partitions
- returns
partitioned SCollections in a
Seq
- def intersectByKey(rhs: SCollection[K]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are inrhs
.Return an SCollection with the pairs from
this
whose keys are inrhs
.Unlike SCollection.intersection this preserves duplicates in
this
. - final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def join[W](rhs: SCollection[(K, W)]): SCollection[(K, (V, W))]
Return an SCollection containing all pairs of elements with matching keys in
this
andrhs
.Return an SCollection containing all pairs of elements with matching keys in
this
andrhs
. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is inthis
and (k, v2) is inrhs
. - implicit lazy val keyCoder: Coder[K]
- def keys: SCollection[K]
Return an SCollection with the keys of each tuple.
- def leftOuterJoin[W](rhs: SCollection[(K, W)]): SCollection[(K, (V, Option[W]))]
Perform a left outer join of
this
andrhs
.Perform a left outer join of
this
andrhs
. For each element (k, v) inthis
, the resulting SCollection will either contain all pairs (k, (v, Some(w))) for w inrhs
, or the pair (k, (v, None)) if no elements inrhs
have key k. - def mapKeys[U](f: (K) => U)(implicit arg0: Coder[U]): SCollection[(U, V)]
Pass each key in the key-value pair SCollection through a
map
function without changing the values. - def mapValues[U](f: (V) => U)(implicit arg0: Coder[U]): SCollection[(K, U)]
Pass each value in the key-value pair SCollection through a
map
function without changing the keys. - def maxByKey(implicit ord: Ordering[V]): SCollection[(K, V)]
Return the max of values for each key as defined by the implicit
Ordering[T]
.Return the max of values for each key as defined by the implicit
Ordering[T]
.- returns
a new SCollection of (key, maximum value) pairs
- def minByKey(implicit ord: Ordering[V]): SCollection[(K, V)]
Return the min of values for each key as defined by the implicit
Ordering[T]
.Return the min of values for each key as defined by the implicit
Ordering[T]
.- returns
a new SCollection of (key, minimum value) pairs
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def reduceByKey(op: (V, V) => V): SCollection[(K, V)]
Merge the values for each key using an associative reduce function.
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
- def reifyAsMapInGlobalWindow: SCollection[Map[K, V]]
Returns an SCollection consisting of a single
Map[K, V]
element. - def reifyAsMultiMapInGlobalWindow: SCollection[Map[K, Iterable[V]]]
Returns an SCollection consisting of a single
Map[K, Iterable[V]]
element. - def rightOuterJoin[W](rhs: SCollection[(K, W)]): SCollection[(K, (Option[V], W))]
Perform a right outer join of
this
andrhs
.Perform a right outer join of
this
andrhs
. For each element (k, w) inrhs
, the resulting SCollection will either contain all pairs (k, (Some(v), w)) for v inthis
, or the pair (k, (None, w)) if no elements inthis
have key k. - def sampleByKey(withReplacement: Boolean, fractions: Map[K, Double]): SCollection[(K, V)]
Return a subset of this SCollection sampled by key (via stratified sampling).
Return a subset of this SCollection sampled by key (via stratified sampling).
Create a sample of this SCollection using variable sampling rates for different keys as specified by
fractions
, a key to sampling rate map, via simple random sampling with one pass over the SCollection, to produce a sample of size that's approximately equal to the sum ofmath.ceil(numItems * samplingRate)
over all key values.- withReplacement
whether to sample with or without replacement
- fractions
map of specific keys to sampling rates
- returns
SCollection containing the sampled subset
- def sampleByKey(sampleSize: Int): SCollection[(K, Iterable[V])]
Return a sampled subset of values for each key of this SCollection.
Return a sampled subset of values for each key of this SCollection.
- returns
a new SCollection of (key, sampled values) pairs
- val self: SCollection[(K, V)]
- def sparseFullOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit funnel: Funnel[K]): SCollection[(K, (Option[V], Option[W]))]
Full outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.Full outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs
) is used to splitthis
into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- rhsNumKeys
An estimate of the number of keys in the right collection
rhs
. This estimate is used to find the size and number of BloomFilters rhs Scio would use to split the left collection (this
) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
- def sparseIntersectByKey(rhs: SCollection[K], rhsNumKeys: Long, computeExact: Boolean = false, fpProb: Double = 0.01)(implicit funnel: Funnel[K]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are inrhs
when the cardinality ofthis
>>rhs
, but neither can fit in memory (see PairHashSCollectionFunctions.hashIntersectByKey).Return an SCollection with the pairs from
this
whose keys are inrhs
when the cardinality ofthis
>>rhs
, but neither can fit in memory (see PairHashSCollectionFunctions.hashIntersectByKey).Unlike SCollection.intersection this preserves duplicates in
this
.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.- rhsNumKeys
An estimate of the number of keys in
rhs
. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filterthis
in a "map" step before any join. Having a value close to the actual number improves the false positives in output. WhencomputeExact
is set to true, a more accurate estimate of the number of keys inrhs
would mean less shuffle when finding the exact value.- computeExact
Whether or not to directly pass through bloom filter results (with a small false positive rate) or perform an additional inner join to confirm exact result set. By default this is set to false.
- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability for this transform. By default when
computeExact
is set tofalse
, this reflects the probability that an output element is an incorrect intersect (meaning it may not be present inrhs
) WhencomputeExact
is set totrue
, this fraction is used to find the acceptable false positive in the intermediate step before computing exact. Note: having fpProb = 0 doesn't mean an exact computation. This value along withrhsNumKeys
is used for creating a BloomFilter.
- def sparseIntersectByKey[AF <: ApproxFilter[K]](sideInput: SideInput[AF]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys might be present in the SideInput.Return an SCollection with the pairs from
this
whose keys might be present in the SideInput.The
SideInput[ApproxFilter]
can be used reused for multiple sparse operations across multiple SCollections.val si = pairSCollRight.asApproxFilterSideInput(BloomFilter, 1000000) val filtered1 = pairSColl1.sparseIntersectByKey(si) val filtered2 = pairSColl2.sparseIntersectByKey(si)
Example: - def sparseJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit funnel: Funnel[K]): SCollection[(K, (V, W))]
Inner join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.Inner join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs
) is used to splitthis
into 2 partitions. Only those with keys in the filter go through the join and the rest are filtered out before the join.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- rhsNumKeys
An estimate of the number of keys in the right collection
rhs
. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this
) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
- def sparseLeftOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit funnel: Funnel[K]): SCollection[(K, (V, Option[W]))]
Left outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.Left outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs
) is used to splitthis
into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- rhsNumKeys
An estimate of the number of keys in the right collection
rhs
. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this
) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
- def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long)(implicit funnel: Funnel[K]): SCollection[(K, (V, Iterable[A], Iterable[B]))]
Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
.Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
. A Bloom Filter of keys inthis
is used to filter out irrelevant keys inrhs
. This is useful when searching for a limited number of values from one or more very large tables.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- thisNumKeys
An estimate of the number of keys in
this
. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filterrhs
before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
- def sparseLookup[A, B](rhs1: SCollection[(K, A)], rhs2: SCollection[(K, B)], thisNumKeys: Long, fpProb: Double)(implicit funnel: Funnel[K]): SCollection[(K, (V, Iterable[A], Iterable[B]))]
Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
.Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
. A Bloom Filter of keys inthis
is used to filter out irrelevant keys inrhs
. This is useful when searching for a limited number of values from one or more very large tables.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- thisNumKeys
An estimate of the number of keys in
this
. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filterrhs1
andrhs2
before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when discarding elements of
rhs1
andrhs2
in the pre-filter step.
- def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long)(implicit funnel: Funnel[K]): SCollection[(K, (V, Iterable[A]))]
Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
.Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
. A Bloom Filter of keys inthis
is used to filter out irrelevant keys inrhs
. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.google.common.hash.BloomFilter.- thisNumKeys
An estimate of the number of keys in
this
. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filterrhs
before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.
- def sparseLookup[A](rhs: SCollection[(K, A)], thisNumKeys: Long, fpProb: Double)(implicit funnel: Funnel[K]): SCollection[(K, (V, Iterable[A]))]
Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
.Look up values from
rhs
whererhs
is much larger and keys fromthis
wont fit in memory, and is sparse inrhs
. A Bloom Filter of keys inthis
is used to filter out irrelevant keys inrhs
. This is useful when searching for a limited number of values from one or more very large tables. Read more about Bloom Filter: com.google.common.hash.BloomFilter.- thisNumKeys
An estimate of the number of keys in
this
. This estimate is used to find the size and number of BloomFilters that Scio would use to pre-filterrhs
before doing a co-group. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when discarding elements of
rhs
in the pre-filter step.
- def sparseRightOuterJoin[W](rhs: SCollection[(K, W)], rhsNumKeys: Long, fpProb: Double = 0.01)(implicit funnel: Funnel[K]): SCollection[(K, (Option[V], W))]
Right outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e.Right outer join for cases when the left collection (
this
) is much larger than the right collection (rhs
) which cannot fit in memory, but contains a mostly overlapping set of keys as the left collection, i.e. when the intersection of keys is sparse in the left collection. A Bloom Filter of keys from the right collection (rhs
) is used to splitthis
into 2 partitions. Only those with keys in the filter go through the join and the rest are concatenated. This is useful for joining historical aggregates with incremental updates.Import
magnolify.guava.auto._
to get common instances of Guava Funnel s.Read more about Bloom Filter: com.google.common.hash.BloomFilter.
- rhsNumKeys
An estimate of the number of keys in the right collection
rhs
. This estimate is used to find the size and number of BloomFilters that Scio would use to split the left collection (this
) into overlap and intersection in a "map" step before an exact join. Having a value close to the actual number improves the false positives in intermediate steps which means less shuffle.- fpProb
A fraction in range (0, 1) which would be the accepted false positive probability when computing the overlap. Note: having fpProb = 0 doesn't mean that Scio would calculate an exact overlap.
- def subtractByKey(rhs: SCollection[K]): SCollection[(K, V)]
Return an SCollection with the pairs from
this
whose keys are not inrhs
. - def sumByKey(implicit sg: Semigroup[V]): SCollection[(K, V)]
Reduce by key with Semigroup.
Reduce by key with Semigroup. This could be more powerful and better optimized than reduceByKey in some cases.
- def swap: SCollection[(V, K)]
Swap the keys with the values.
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- def topByKey(num: Int)(implicit ord: Ordering[V]): SCollection[(K, Iterable[V])]
Return the top
num
(largest) values for each key from this SCollection as defined by the specified implicitOrdering[T]
.Return the top
num
(largest) values for each key from this SCollection as defined by the specified implicitOrdering[T]
.- returns
a new SCollection of (key, top
num
values) pairs
- implicit lazy val valueCoder: Coder[V]
- def values: SCollection[V]
Return an SCollection with the values of each tuple.
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def withHotKeyFanout(hotKeyFanout: Int): SCollectionWithHotKeyFanout[K, V]
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
- hotKeyFanout
constant value for every key
- def withHotKeyFanout(hotKeyFanout: (K) => Int): SCollectionWithHotKeyFanout[K, V]
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
Convert this SCollection to an SCollectionWithHotKeyFanout that uses an intermediate node to combine "hot" keys partially before performing the full combine.
- hotKeyFanout
a function from keys to an integer N, where the key will be spread among N intermediate nodes for partial combining. If N is less than or equal to 1, this key will not be sent through an intermediate node.