sealed trait SCollection[T] extends PCollectionWrapper[T]
A Scala wrapper for PCollection. Represents an
immutable, partitioned collection of elements that can be operated on in parallel. This class
contains the basic operations available on all SCollections, such as map
, filter
, and sum
.
In addition, PairSCollectionFunctions contains operations available only on SCollections of
key-value pairs, such as groupByKey
and join
; DoubleSCollectionFunctions contains
operations available only on SCollections of Double
s.
- Self Type
- SCollection[T]
- Source
- SCollection.scala
- Alphabetic
- By Inheritance
- SCollection
- PCollectionWrapper
- TransformNameable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Abstract Value Members
- abstract val context: ScioContext
The ScioContext associated with this PCollection.
The ScioContext associated with this PCollection.
- Definition Classes
- PCollectionWrapper
- abstract val internal: PCollection[T]
The PCollection being wrapped internally.
The PCollection being wrapped internally.
- Definition Classes
- PCollectionWrapper
Concrete Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- def ++(that: SCollection[T]): SCollection[T]
Return the union of this SCollection and another one.
Return the union of this SCollection and another one. Any identical elements will appear multiple times (use distinct to eliminate them).
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def aggregate[A, U](aggregator: MonoidAggregator[T, A, U])(implicit arg0: Coder[A], arg1: Coder[U]): SCollection[U]
Aggregate with MonoidAggregator.
Aggregate with MonoidAggregator. First each item
T
is mapped toA
, then we reduce with a Monoid ofA
, then finally we present the results asU
. This could be more powerful and better optimized in some cases. - def aggregate[A, U](aggregator: Aggregator[T, A, U])(implicit arg0: Coder[A], arg1: Coder[U]): SCollection[U]
Aggregate with Aggregator.
Aggregate with Aggregator. First each item
T
is mapped toA
, then we reduce with a Semigroup ofA
, then finally we present the results asU
. This could be more powerful and better optimized in some cases. - def aggregate[U](zeroValue: => U)(seqOp: (U, T) => U, combOp: (U, U) => U)(implicit arg0: Coder[U]): SCollection[U]
Aggregate the elements using given combine functions and a neutral "zero value".
Aggregate the elements using given combine functions and a neutral "zero value". This function can return a different result type,
U
, than the type of this SCollection,T
. Thus, we need one operation for merging aT
into anU
and one operation for merging twoU
's. Both of these functions are allowed to modify and return their first argument instead of creating a newU
to avoid memory allocation. - def applyKvTransform[K, V](name: String, transform: PTransform[_ >: PCollection[T], PCollection[KV[K, V]]])(implicit arg0: Coder[K], arg1: Coder[V]): SCollection[KV[K, V]]
Apply a PTransform and wrap the output in an SCollection.
Apply a PTransform and wrap the output in an SCollection. This is a special case of applyTransform for transforms with KV output.
- name
default transform name
- transform
PTransform to be applied
- def applyKvTransform[K, V](transform: PTransform[_ >: PCollection[T], PCollection[KV[K, V]]])(implicit arg0: Coder[K], arg1: Coder[V]): SCollection[KV[K, V]]
Apply a PTransform and wrap the output in an SCollection.
Apply a PTransform and wrap the output in an SCollection. This is a special case of applyTransform for transforms with KV output.
- def applyTransform[U](name: String, transform: PTransform[_ >: PCollection[T], PCollection[U]])(implicit arg0: Coder[U]): SCollection[U]
Apply a PTransform and wrap the output in an SCollection.
Apply a PTransform and wrap the output in an SCollection.
- name
default transform name
- transform
PTransform to be applied
- def applyTransform[U](transform: PTransform[_ >: PCollection[T], PCollection[U]])(implicit arg0: Coder[U]): SCollection[U]
Apply a PTransform and wrap the output in an SCollection.
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def asIterableSideInput: SideInput[Iterable[T]]
Convert this SCollection to a SideInput, mapping each window to an
Iterable
, to be used with withSideInputs.Convert this SCollection to a SideInput, mapping each window to an
Iterable
, to be used with withSideInputs.The values of the
Iterable
for a window are not required to fit in memory, but they may also not be effectively cached. If it is known that every window fits in memory, and stronger caching is desired, use asListSideInput. - def asListSideInput: SideInput[Seq[T]]
Convert this SCollection to a SideInput, mapping each window to a
Seq
, to be used with withSideInputs.Convert this SCollection to a SideInput, mapping each window to a
Seq
, to be used with withSideInputs.The resulting
Seq
is required to fit in memory. - def asSetSingletonSideInput: SideInput[Set[T]]
Convert this SCollection to a SideInput, mapping each window to a
Set[T]
, to be used with withSideInputs.Convert this SCollection to a SideInput, mapping each window to a
Set[T]
, to be used with withSideInputs.The resulting SideInput is a one element singleton which is a
Set
of all elements in the SCollection for the given window. The complete Set must fit in memory of the worker. - def asSingletonSideInput(defaultValue: T): SideInput[T]
Convert this SCollection of a single value per window to a SideInput with a default value, to be used with withSideInputs.
- def asSingletonSideInput: SideInput[T]
Convert this SCollection of a single value per window to a SideInput, to be used with withSideInputs.
- def batch(batchSize: Long, maxLiveWindows: Int = BatchDoFn.DEFAULT_MAX_LIVE_WINDOWS): SCollection[Iterable[T]]
Batches elements for amortized processing.
Batches elements for amortized processing. Elements are batched per-window and batches emitted in the window corresponding to its contents.
Batches are emitted even if the maximum size is not reached when bundle finishes or when there are too many live windows.
- batchSize
desired number of elements in a batch
- maxLiveWindows
maximum number of window buffering
- def batchByteSized(batchByteSize: Long, maxLiveWindows: Int = BatchDoFn.DEFAULT_MAX_LIVE_WINDOWS): SCollection[Iterable[T]]
Batches elements for amortized processing.
Batches elements for amortized processing. Elements are batched per-window and batches emitted in the window corresponding to its contents.
Batches are emitted even if the maximum size is not reached when bundle finishes or when there are too many live windows.
- batchByteSize
desired batch size in bytes, estimated using the Coder
- maxLiveWindows
maximum number of window buffering
- def batchWeighted(batchWeight: Long, cost: (T) => Long, maxLiveWindows: Int = BatchDoFn.DEFAULT_MAX_LIVE_WINDOWS): SCollection[Iterable[T]]
Batches elements for amortized processing.
Batches elements for amortized processing. Elements are batched per-window and batches emitted in the window corresponding to its contents.
Batches are emitted even if the maximum size is not reached when bundle finishes or when there are too many live windows.
- batchWeight
desired batch weight
- cost
function that associated a weight to an element
- maxLiveWindows
maximum number of window buffering
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- implicit def coder: Coder[T]
- Definition Classes
- PCollectionWrapper
- def collect[U](pfn: PartialFunction[T, U])(implicit arg0: Coder[U]): SCollection[U]
Filter the elements for which the given
PartialFunction
is defined, and then map. - def combine[C](createCombiner: (T) => C)(mergeValue: (C, T) => C)(mergeCombiners: (C, C) => C)(implicit arg0: Coder[C]): SCollection[C]
Generic function to combine the elements using a custom set of aggregation functions.
Generic function to combine the elements using a custom set of aggregation functions. Turns an
SCollection[T]
into a result of typeSCollection[C]
, for a "combined type"C
. Note thatT
andC
can be different -- for example, one might combine an SCollection of typeInt
into an SCollection of typeSeq[Int]
. Users provide three functions:createCombiner
, which turns aT
into aC
(e.g., creates a one-element list)mergeValue
, to merge aT
into aC
(e.g., adds it to the end of a list)mergeCombiners
, to combine twoC
's into a single one.
Both
mergeValue
andmergeCombiners
are allowed to modify and return their first argument instead of creating a newU
to avoid memory allocation. - def contravary[U <: T]: SCollection[U]
lifts this SCollection to the specified type
- def count: SCollection[Long]
Count the number of elements in the SCollection.
Count the number of elements in the SCollection.
- returns
a new SCollection with the count
- def countApproxDistinct(estimator: ApproxDistinctCounter[T]): SCollection[Long]
Returns a single valued SCollection with estimated distinct count.
Returns a single valued SCollection with estimated distinct count. Correctness is depends on the ApproxDistinctCounter estimator.
- def countApproxDistinct(maximumEstimationError: Double = 0.02): SCollection[Long]
Count approximate number of distinct elements in the SCollection.
Count approximate number of distinct elements in the SCollection.
- maximumEstimationError
the maximum estimation error, which should be in the range
[0.01, 0.5]
- def countApproxDistinct(sampleSize: Int): SCollection[Long]
Count approximate number of distinct elements in the SCollection.
Count approximate number of distinct elements in the SCollection.
- sampleSize
the number of entries in the statistical sample; the higher this number, the more accurate the estimate will be; should be
>= 16
- def countByValue: SCollection[(T, Long)]
Count of each unique value in this SCollection as an SCollection of (value, count) pairs.
- def covary[U >: T]: SCollection[U]
lifts this SCollection to the specified type
- def covary_[U](implicit ev: <:<[T, U]): SCollection[U]
lifts this SCollection to the specified type
- def cross[U](that: SCollection[U]): SCollection[(T, U)]
Return the cross product with another SCollection by replicating
that
to all workers.Return the cross product with another SCollection by replicating
that
to all workers. The right side should be tiny and fit in memory. - def debug(out: () => PrintStream = () => Console.out, prefix: String = "", enabled: Boolean = true): SCollection[T]
Print content of an SCollection to
out()
.Print content of an SCollection to
out()
.- out
where to write the debug information. Default: stdout
- prefix
prefix for each logged entry. Default: empty string
- enabled
if debugging is enabled or not. Default: true. It can be useful to set this to sc.isTest to avoid debugging when running in production.
- def distinct: SCollection[T]
Return a new SCollection containing the distinct elements in this SCollection.
- def distinctBy[U](f: (T) => U)(implicit arg0: Coder[U]): SCollection[T]
Returns a new SCollection with distinct elements using given function to obtain a representative value for each input element.
Returns a new SCollection with distinct elements using given function to obtain a representative value for each input element.
- U
The type of representative values used to dedup.
- f
The function to use to get representative values.
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def filter(f: (T) => Boolean): SCollection[T]
Return a new SCollection containing only the elements that satisfy a predicate.
- def filterNot(f: (T) => Boolean): SCollection[T]
Return a new SCollection containing only the elements that don't satisfy a predicate.
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- def flatMap[U](f: (T) => TraversableOnce[U])(implicit arg0: Coder[U]): SCollection[U]
Return a new SCollection by first applying a function to all elements of this SCollection, and then flattening the results.
- def flatten[U](implicit ev: (T) => TraversableOnce[U], coder: Coder[U]): SCollection[U]
Return a new
SCollection[U]
by flattening each element of anSCollection[Traversable[U]]
. - def fold(implicit mon: Monoid[T]): SCollection[T]
Fold with Monoid, which defines the associative function and "zero value" for
T
.Fold with Monoid, which defines the associative function and "zero value" for
T
. This could be more powerful and better optimized in some cases. - def fold(zeroValue: => T)(op: (T, T) => T): SCollection[T]
Aggregate the elements using a given associative function and a neutral "zero value".
Aggregate the elements using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def groupBy[K](f: (T) => K)(implicit arg0: Coder[K]): SCollection[(K, Iterable[T])]
Return an SCollection of grouped items.
Return an SCollection of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using PairSCollectionFunctions.aggregateByKey or PairSCollectionFunctions.reduceByKey will provide much better performance.
- def groupMap[K, U](f: (T) => K)(g: (T) => U)(implicit arg0: Coder[K], arg1: Coder[U]): SCollection[(K, Iterable[U])]
Return an SCollection of grouped items.
Return an SCollection of grouped items. Each group consists of a key and a sequence of elements transformed into a value of type
U
. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.It is equivalent to groupBy(key).mapValues(_.map(f)), but more efficient.
- def groupMapReduce[K](f: (T) => K)(g: (T, T) => T)(implicit arg0: Coder[K]): SCollection[(K, T)]
Return an SCollection of grouped items.
Return an SCollection of grouped items. Each group consists of a key and the result of an associative reduce function. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting SCollection is evaluated.
The associative function is performed locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashFilter(sideInput: SideInput[Set[T]]): SCollection[T]
Return a new SCollection containing only the elements that also exist in the
SideInput
. - def hashLookup[V](that: SCollection[(T, V)]): SCollection[(T, Iterable[V])]
Look up values in an
SCollection[(T, V)]
for each elementT
in this SCollection by replicatingthat
to all workers.Look up values in an
SCollection[(T, V)]
for each elementT
in this SCollection by replicatingthat
to all workers. The right side should be tiny and fit in memory. - def hashPartition(numPartitions: Int): Seq[SCollection[T]]
Partition this SCollection using T.## into
n
partitionsPartition this SCollection using T.## into
n
partitions- numPartitions
number of output partitions
- returns
partitioned SCollections in a
Seq
- def intersection(that: SCollection[T]): SCollection[T]
Return the intersection of this SCollection and another one.
Return the intersection of this SCollection and another one. The output will not contain any duplicate elements, even if the input SCollections did.
Note that this method performs a shuffle internally.
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def keyBy[K](f: (T) => K)(implicit arg0: Coder[K]): SCollection[(K, T)]
Create tuples of the elements in this SCollection by applying
f
. - def map[U](f: (T) => U)(implicit arg0: Coder[U]): SCollection[U]
Return a new SCollection by applying a function to all elements of this SCollection.
- def materialize: ClosedTap[T]
Extract data from this SCollection as a closed Tap.
Extract data from this SCollection as a closed Tap. The Tap will be available once the pipeline completes successfully.
.materialize()
must be called before theScioContext
is run, as its implementation modifies the current pipeline graph.val closedTap = sc.parallelize(1 to 10).materialize sc.run().waitUntilDone().tap(closedTap)
- def max(implicit ord: Ordering[T]): SCollection[T]
Return the max of this SCollection as defined by the implicit
Ordering[T]
.Return the max of this SCollection as defined by the implicit
Ordering[T]
.- returns
a new SCollection with the maximum element
- def mean(implicit ev: Numeric[T]): SCollection[Double]
Return the mean of this SCollection as defined by the implicit
Numeric[T]
.Return the mean of this SCollection as defined by the implicit
Numeric[T]
.- returns
a new SCollection with the mean of elements
- def min(implicit ord: Ordering[T]): SCollection[T]
Return the min of this SCollection as defined by the implicit
Ordering[T]
.Return the min of this SCollection as defined by the implicit
Ordering[T]
.- returns
a new SCollection with the minimum element
- def name: String
A friendly name for this SCollection.
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def partition(p: (T) => Boolean): (SCollection[T], SCollection[T])
Partition this SCollection into a pair of SCollections according to a predicate.
Partition this SCollection into a pair of SCollections according to a predicate.
- p
predicate on which to partition
- returns
a pair of SCollections: the first SCollection consists of all elements that satisfy the predicate p and the second consists of all element that do not.
- def partition(numPartitions: Int, f: (T) => Int): Seq[SCollection[T]]
Partition this SCollection with the provided function.
Partition this SCollection with the provided function.
- numPartitions
number of output partitions
- f
function that assigns an output partition to each element, should be in the range
[0, numPartitions - 1]
- returns
partitioned SCollections in a
Seq
- def partitionByKey[U](partitionKeys: Set[U])(f: (T) => U): Map[U, SCollection[T]]
Partition this SCollection into a map from possible key values to an SCollection of corresponding elements based on the provided function .
Partition this SCollection into a map from possible key values to an SCollection of corresponding elements based on the provided function .
- partitionKeys
The keys for the output partitions
- f
function that assigns an output partition to each element, should be in the range of
partitionKeys
- returns
partitioned SCollections in a
Map
- def quantilesApprox(numQuantiles: Int)(implicit ord: Ordering[T]): SCollection[Iterable[T]]
Compute the SCollection's data distribution using approximate
N
-tiles.Compute the SCollection's data distribution using approximate
N
-tiles.- returns
a new SCollection whose single value is an
Iterable
of the approximateN
-tiles of the elements
- def randomSplit(weightA: Double, weightB: Double): (SCollection[T], SCollection[T], SCollection[T])
Randomly splits this SCollection into three parts.
Randomly splits this SCollection into three parts. Note:
0 < weightA + weightB < 1
- weightA
weight for first SCollection, should be in the range
(0, 1)
- weightB
weight for second SCollection, should be in the range
(0, 1)
- returns
split SCollections in a Tuple3
- def randomSplit(weight: Double): (SCollection[T], SCollection[T])
Randomly splits this SCollection into two parts.
Randomly splits this SCollection into two parts.
- weight
weight for left hand side SCollection, should be in the range
(0, 1)
- returns
split SCollections in a Tuple2
- def randomSplit(weights: Array[Double]): Array[SCollection[T]]
Randomly splits this SCollection with the provided weights.
Randomly splits this SCollection with the provided weights.
- weights
weights for splits, will be normalized if they don't sum to 1
- returns
split SCollections in an array
- def readFiles[A](filesTransform: PTransform[_ >: PCollection[ReadableFile], PCollection[A]], directoryTreatment: DirectoryTreatment = DirectoryTreatment.SKIP, compression: Compression = Compression.AUTO)(implicit arg0: Coder[A], ev: <:<[T, String]): SCollection[A]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- directoryTreatment
Controls how to handle directories in the input.
- compression
Reads files using the given org.apache.beam.sdk.io.Compression.
- See also
- def readFiles[A](desiredBundleSizeBytes: Long, directoryTreatment: DirectoryTreatment, compression: Compression)(f: (String) => FileBasedSource[A])(implicit arg0: Coder[A], ev: <:<[T, String]): SCollection[A]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection. Files are split into multiple offset ranges and read with the FileBasedSource.
- desiredBundleSizeBytes
Desired size of bundles read by the sources.
- directoryTreatment
Controls how to handle directories in the input.
- compression
Reads files using the given org.apache.beam.sdk.io.Compression.
- def readFiles[A](directoryTreatment: DirectoryTreatment, compression: Compression)(f: (ReadableFile) => A)(implicit arg0: Coder[A], ev: <:<[T, String]): SCollection[A]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- directoryTreatment
Controls how to handle directories in the input.
- compression
Reads files using the given org.apache.beam.sdk.io.Compression.
- See also
- def readFiles[A](f: (ReadableFile) => A)(implicit arg0: Coder[A], ev: <:<[T, String]): SCollection[A]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- See also
- def readFilesAsBytes(implicit ev: <:<[T, String]): SCollection[Array[Byte]]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- returns
each file fully read as Array[Byte.
- def readFilesAsString(implicit ev: <:<[T, String]): SCollection[String]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- returns
each file fully read as String.
- def readFilesWithPath[A](desiredBundleSizeBytes: Long = FileSCollectionFunctions.DefaultBundleSizeBytes, directoryTreatment: DirectoryTreatment = DirectoryTreatment.SKIP, compression: Compression = Compression.AUTO)(f: (String) => FileBasedSource[A])(implicit arg0: Coder[A], ev: <:<[T, String]): SCollection[(String, A)]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection. Files are split into multiple offset ranges and read with the FileBasedSource.
- desiredBundleSizeBytes
Desired size of bundles read by the sources.
- directoryTreatment
Controls how to handle directories in the input.
- compression
Reads files using the given org.apache.beam.sdk.io.Compression.
- returns
origin file name paired with read element.
- def readTextFiles(implicit ev: <:<[T, String]): SCollection[String]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection.
- returns
each line of the input files.
- def readTextFilesWithPath(desiredBundleSizeBytes: Long = FileSCollectionFunctions.DefaultBundleSizeBytes, directoryTreatment: DirectoryTreatment = DirectoryTreatment.SKIP, compression: Compression = Compression.AUTO)(implicit ev: <:<[T, String]): SCollection[(String, String)]
Reads each file, represented as a pattern, in this SCollection.
Reads each file, represented as a pattern, in this SCollection. Files are split into multiple offset ranges and read with the FileBasedSource.
- desiredBundleSizeBytes
Desired size of bundles read by the sources.
- directoryTreatment
Controls how to handle directories in the input.
- compression
Reads files using the given org.apache.beam.sdk.io.Compression.
- returns
origin file name paired with read line.
- def reduce(op: (T, T) => T): SCollection[T]
Reduce the elements of this SCollection using the specified commutative and associative binary operator.
- def reifyAsIterableInGlobalWindow: SCollection[Iterable[T]]
Returns an SCollection consisting of a single
Iterable[T]
element. - def reifyAsListInGlobalWindow: SCollection[Seq[T]]
Returns an SCollection consisting of a single
Seq[T]
element. - def reifySideInputAsValues[U](side: SideInput[U])(implicit arg0: Coder[U]): SCollection[(T, U)]
Pairs each element with the value of the provided SideInput in the element's window.
Pairs each element with the value of the provided SideInput in the element's window.
Reify as List:
val other: SCollection[Int] = sc.parallelize(Seq(1)) val coll: SCollection[(Int, Seq[Int])] = sc.parallelize(Seq(1, 2)) .reifySideInputAsValues(other.asListSideInput)
Reify as Iterable:
val other: SCollection[Int] = sc.parallelize(Seq(1)) val coll: SCollection[(Int, Iterable[Int])] = sc.parallelize(Seq(1, 2)) .reifySideInputAsValues(other.asIterableSideInput)
Reify as Map:
val other: SCollection[(Int, Int)] = sc.parallelize(Seq((1, 1))) val coll: SCollection[(Int, Map[Int, Int])] = sc.parallelize(Seq(1, 2)) .reifySideInputAsValues(other.asMapSideInput)
Reify as Multimap:
val other: SCollection[(Int, Int)] = sc.parallelize(Seq((1, 1))) val coll: SCollection[(Int, Map[Int, Iterable[Int]])] = sc.parallelize(Seq(1, 2)) .reifySideInputAsValues(other.asMultiMapSideInput)
- def sample(withReplacement: Boolean, fraction: Double): SCollection[T]
Return a sampled subset of this SCollection.
Return a sampled subset of this SCollection. Does not trigger shuffling.
- withReplacement
if
true
the same element can be produced more than once, otherwise the same element will be sampled only once- fraction
the sampling fraction
- def sample(sampleSize: Int): SCollection[Iterable[T]]
Return a sampled subset of this SCollection containing exactly
sampleSize
items.Return a sampled subset of this SCollection containing exactly
sampleSize
items. Involves combine operation resulting in shuffling. All the elements of the output should fit into main memory of a single worker machine.- returns
a new SCollection whose single value is an
Iterable
of the samples
- def sampleByteSized(totalByteSize: Long): SCollection[Iterable[T]]
- def sampleWeighted(totalWeight: Long, cost: (T) => Long): SCollection[Iterable[T]]
- def saveAsBinaryFile(path: String, numShards: Int = BinaryIO.WriteParam.DefaultNumShards, prefix: String = BinaryIO.WriteParam.DefaultPrefix, suffix: String = BinaryIO.WriteParam.DefaultSuffix, compression: Compression = BinaryIO.WriteParam.DefaultCompression, header: Array[Byte] = BinaryIO.WriteParam.DefaultHeader, footer: Array[Byte] = BinaryIO.WriteParam.DefaultFooter, shardNameTemplate: String = BinaryIO.WriteParam.DefaultShardNameTemplate, framePrefix: (Array[Byte]) => Array[Byte] = BinaryIO.WriteParam.DefaultFramePrefix, frameSuffix: (Array[Byte]) => Array[Byte] = BinaryIO.WriteParam.DefaultFrameSuffix, tempDirectory: String = BinaryIO.WriteParam.DefaultTempDirectory, filenamePolicySupplier: FilenamePolicySupplier = BinaryIO.WriteParam.DefaultFilenamePolicySupplier)(implicit ev: <:<[T, Array[Byte]]): ClosedTap[Nothing]
Save this SCollection as raw bytes.
Save this SCollection as raw bytes. Note that elements must be of type
Array[Byte]
. - def saveAsCustomOutput[O <: POutput](name: String, transform: PTransform[PCollection[T], O]): ClosedTap[Nothing]
Save this SCollection with a custom output transform.
Save this SCollection with a custom output transform. The transform should have a unique name.
- def saveAsTextFile(path: String, numShards: Int = TextIO.WriteParam.DefaultNumShards, suffix: String = TextIO.WriteParam.DefaultSuffix, compression: Compression = TextIO.WriteParam.DefaultCompression, header: Option[String] = TextIO.WriteParam.DefaultHeader, footer: Option[String] = TextIO.WriteParam.DefaultFooter, shardNameTemplate: String = TextIO.WriteParam.DefaultShardNameTemplate, tempDirectory: String = TextIO.WriteParam.DefaultTempDirectory, filenamePolicySupplier: FilenamePolicySupplier = TextIO.WriteParam.DefaultFilenamePolicySupplier, prefix: String = TextIO.WriteParam.DefaultPrefix)(implicit ct: ClassTag[T]): ClosedTap[String]
Save this SCollection as a text file.
Save this SCollection as a text file. Note that elements must be of type
String
. - def saveAsZstdDictionary(path: String, zstdDictSizeBytes: Int = ZstdDictIO.WriteParam.DefaultZstdDictSizeBytes, numElementsForSizeEstimation: Long = ZstdDictIO.WriteParam.DefaultNumElementsForSizeEstimation, trainingBytesTarget: Option[Int] = ZstdDictIO.WriteParam.DefaultTrainingBytesTarget): ClosedTap[Nothing]
Creates a Zstd dictionary based on this SCollection targeting a dictionary of size
zstdDictSizeBytes
to be trained with approximatelytrainingBytesTarget
bytes.Creates a Zstd dictionary based on this SCollection targeting a dictionary of size
zstdDictSizeBytes
to be trained with approximatelytrainingBytesTarget
bytes. The exact training size is determined by estimating the average element size withnumElementsForSizeEstimation
encoded elements and sampling this SCollection at an appropriate rate.- path
The path to which the trained dictionary should be written.
- zstdDictSizeBytes
The size of the dictionary to train in bytes. Recommended dictionary sizes are in hundreds of KB. Over 10MB is not recommended and you may hit resource limits if the dictionary size is near 20MB.
- numElementsForSizeEstimation
The number of elements of the SCollection to use to estimate the average element size.
- trainingBytesTarget
The target number of bytes on which to train. Memory usage for training can be 10x this.
None
to infer fromzstdDictSizeBytes
. Must be able to fit in the memory of a single worker.
- def setCoder(coder: Coder[T]): SCollection[T]
Assign a Coder to this SCollection.
- def setSchema(schema: Schema[T])(implicit ct: ClassTag[T]): SCollection[T]
- def subtract(that: SCollection[T]): SCollection[T]
Return an SCollection with the elements from
this
that are not inother
. - def sum(implicit sg: Semigroup[T]): SCollection[T]
Reduce with Semigroup.
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def take(num: Long): SCollection[T]
Return a sampled subset of any
num
elements of the SCollection. - def tap(f: (T) => Any): SCollection[T]
Applies f to each element of this SCollection, and returns the original value.
- def timestampBy(f: (T) => Instant, allowedTimestampSkew: Duration = Duration.ZERO): SCollection[T]
Assign timestamps to values.
Assign timestamps to values. With a optional skew
- def toString(): String
- Definition Classes
- AnyRef → Any
- def toWindowed: WindowedSCollection[T]
Convert this SCollection to an WindowedSCollection.
- def top(num: Int)(implicit ord: Ordering[T]): SCollection[Iterable[T]]
Return the top k (largest) elements from this SCollection as defined by the specified implicit
Ordering[T]
.Return the top k (largest) elements from this SCollection as defined by the specified implicit
Ordering[T]
.- returns
a new SCollection whose single value is an
Iterable
of the top k
- def transform[U](name: String)(f: (SCollection[T]) => SCollection[U]): SCollection[U]
- def transform[U](f: (SCollection[T]) => SCollection[U]): SCollection[U]
Apply a transform.
- def union(that: SCollection[T]): SCollection[T]
Return the union of this SCollection and another one.
Return the union of this SCollection and another one. Any identical elements will appear multiple times (use distinct to eliminate them).
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def windowByDays(number: Int, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into by days.
- def windowByMonths(number: Int, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into by months.
- def windowByWeeks(number: Int, startDayOfWeek: Int, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into by weeks.
- def windowByYears(number: Int, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into by years.
- def withFanout(fanout: Int): SCollectionWithFanout[T]
Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.
Convert this SCollection to an SCollectionWithFanout that uses an intermediate node to combine parts of the data to reduce load on the final global combine step.
- fanout
the number of intermediate keys that will be used
- def withFixedWindows(duration: Duration, offset: Duration = Duration.ZERO, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into fixed windows.
- def withGlobalWindow(options: WindowOptions = WindowOptions()): SCollection[T]
Group values in to a single global window.
- def withName(name: String): SCollection.this.type
Set a custom name for the next transform to be applied.
Set a custom name for the next transform to be applied.
- Definition Classes
- TransformNameable
- def withPaneInfo: SCollection[(T, PaneInfo)]
Convert values into pairs of (value, window).
- def withSessionWindows(gapDuration: Duration, options: WindowOptions = WindowOptions()): SCollection[T]
Window values based on sessions.
- def withSideInputs(sides: SideInput[_]*): SCollectionWithSideInput[T]
Convert this SCollection to an SCollectionWithSideInput with one or more SideInput s, similar to Spark broadcast variables.
Convert this SCollection to an SCollectionWithSideInput with one or more SideInput s, similar to Spark broadcast variables. Call SCollectionWithSideInput.toSCollection when done with side inputs.
val s1: SCollection[Int] = // ... val s2: SCollection[String] = // ... val s3: SCollection[(String, Double)] = // ... // Prepare side inputs val side1 = s1.asSingletonSideInput val side2 = s2.asIterableSideInput val side3 = s3.asMapSideInput val side4 = s4.asMultiMapSideInput val p: SCollection[MyRecord] = // ... p.withSideInputs(side1, side2, side3).map { (x, s) => // Extract side inputs from context val s1: Int = s(side1) val s2: Iterable[String] = s(side2) val s3: Map[String, Double] = s(side3) val s4: Map[String, Iterable[Double]] = s(side4) // ... }
- def withSideOutputs(sides: SideOutput[_]*): SCollectionWithSideOutput[T]
Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutput s, so that a single transform can write to multiple destinations.
Convert this SCollection to an SCollectionWithSideOutput with one or more SideOutput s, so that a single transform can write to multiple destinations.
// Prepare side inputs val side1 = SideOutput[String]() val side2 = SideOutput[Int]() val p: SCollection[MyRecord] = // ... p.withSideOutputs(side1, side2).map { (x, s) => // Write to side outputs via context s.output(side1, "word").output(side2, 1) // ... }
- def withSlidingWindows(size: Duration, period: Duration = null, offset: Duration = Duration.ZERO, options: WindowOptions = WindowOptions()): SCollection[T]
Window values into sliding windows.
- def withTimestamp: SCollection[(T, Instant)]
Convert values into pairs of (value, timestamp).
- def withWindow[W <: BoundedWindow](implicit arg0: Coder[W]): SCollection[(T, W)]
Convert values into pairs of (value, window).
Convert values into pairs of (value, window).
- W
window type, must be BoundedWindow or one of it's sub-types, e.g. GlobalWindow if this SCollection is not windowed or IntervalWindow if it is windowed.
- def withWindowFn[W <: BoundedWindow](fn: WindowFn[_, W], options: WindowOptions = WindowOptions()): SCollection[T]
Window values with the given function.
- def write(io: ScioIO[T] { type WriteP = Unit }): ClosedTap[T]
- def write(io: ScioIO[T])(params: WriteP): ClosedTap[T]
Generic write method for all
ScioIO[T]
implementations, if it is test pipeline this will evaluate pre-registered output IO implementation which match for the passingScioIO[T]
implementation.Generic write method for all
ScioIO[T]
implementations, if it is test pipeline this will evaluate pre-registered output IO implementation which match for the passingScioIO[T]
implementation. if not this will invoke com.spotify.scio.io.ScioIO[T]#write method along with write configurations passed by.- io
an implementation of
ScioIO[T]
trait- params
configurations need to pass to perform underline write implementation
Deprecated Value Members
- def readFiles(implicit ev: <:<[T, String]): SCollection[String]
- Annotations
- @deprecated
- Deprecated
(Since version 0.14.5) Use readTextFiles