package smb
- Alphabetic
- Public
- Protected
Type Members
- class AvroBucketMetadata[K1, K2, V <: IndexedRecord] extends BucketMetadata[K1, K2, V]
org.apache.beam.sdk.extensions.smb.BucketMetadata
for AvroIndexedRecord
records. - class AvroFileOperations[ValueT] extends FileOperations[ValueT]
org.apache.beam.sdk.extensions.smb.FileOperations
implementation for Avro files. - class AvroSortedBucketIO extends AnyRef
API for reading and writing Avro sorted-bucket files.
- abstract class BucketMetadata[K1, K2, V] extends Serializable with HasDisplayData
Represents metadata in a JSON-serializable format to be stored alongside sorted-bucket files in a
SortedBucketSink
transform, and read/checked for compatibility in aSortedBucketSource
transform.Represents metadata in a JSON-serializable format to be stored alongside sorted-bucket files in a
SortedBucketSink
transform, and read/checked for compatibility in aSortedBucketSource
transform.Key encoding
BucketMetadata
controls over how values
are mapped to a key type
(see: `[[ #extractKeySecondary(Object)]]`), and how those bytes are encoded intobyte[]
(see: `[[ #getKeyBytesPrimary(Object)]]`, `[[#getKeyBytesSecondary(Object)]]`). Therefore, in order for two sources to be compatible in a `[[SortedBucketSource]]` transform, the `[[#keyClass]]` does not have to be the same, as long as the final byte encoding is equivalent. A `[[ #coderOverrides()]]` method is provided for any encoding overrides: for example, in Avro sources the `[[CharSequence]]` type should be encoded as a UTF-8 string.- Annotations
- @JsonTypeInfo()
- class BucketMetadataUtil extends AnyRef
- abstract class BucketShardId extends AnyRef
Abstracts bucket and shard id in
SortedBucketIO
.Abstracts bucket and shard id in
SortedBucketIO
.Null keys
Null keys are assigned a bucket ID of -1, separate from the rest of the bucketed data. Note that this does not necessarily reflect the final filename:
SMBFilenamePolicy
can special-case this bucket ID.- Annotations
- @AutoValue()
- class CoGbkResultUtil extends AnyRef
- Annotations
- @PatchedFromBeam()
- abstract class FileOperations[V] extends Serializable with HasDisplayData
Abstracts IO operations for file-based formats.
Abstracts IO operations for file-based formats.
Since the SMB algorithm doesn't support
org.apache.beam.sdk.io.Source
splitting, I/O operations must be abstracted at a per-record granularity.Reader
andWriter
must beSerializable
to be used inSortedBucketSource
andSortedBucketSink
transforms. - final class IcebergEncoder extends AnyRef
- class JsonBucketMetadata[K1, K2] extends BucketMetadata[K1, K2, TableRow]
org.apache.beam.sdk.extensions.smb.BucketMetadata
for BigQueryTableRow
JSON records. - class JsonFileOperations extends FileOperations[TableRow]
org.apache.beam.sdk.extensions.smb.FileOperations
implementation for text files with BigQueryTableRow
JSON records. - class JsonSortedBucketIO extends AnyRef
API for reading and writing BigQuery
TableRow
JSON sorted-bucket files. - class MissingImplementationException extends RuntimeException
- class MultiSourceKeyGroupReader[KeyType] extends AnyRef
- Annotations
- @SuppressWarnings()
- class ParquetAvroFileOperations[ValueT] extends FileOperations[ValueT]
org.apache.beam.sdk.extensions.smb.FileOperations
implementation for Parquet files with Avro records. - class ParquetAvroSortedBucketIO extends AnyRef
API for reading and writing Parquet sorted-bucket files as Avro.
- class ParquetBucketMetadata[K1, K2, V] extends BucketMetadata[K1, K2, V]
- class ParquetInputFile extends InputFile
- class ParquetOutputFile extends OutputFile
- case class ParquetTypeFileOperations[T](compression: CompressionCodecName, conf: SerializableConfiguration, predicate: FilterPredicate)(implicit pt: ParquetType[T], coder: Coder[T]) extends FileOperations[T] with Product with Serializable
- final class SMBFilenamePolicy extends Serializable
Naming policy for SMB files, similar to
org.apache.beam.sdk.io.FileBasedSink.FilenamePolicy
.Naming policy for SMB files, similar to
org.apache.beam.sdk.io.FileBasedSink.FilenamePolicy
.File names are assigned uniquely per
BucketShardId
. This class functions differently for the initial write to temp files, and the move of those files to their final destination. This is because temp writes need to be idempotent in case of bundle failure, and are thus timestamped to ensure an uncorrupted write result when a bundle succeeds. - class SortedBucketIO extends AnyRef
Sorted-bucket files are
PCollection
s written withSortedBucketSink
that can be efficiently merged without shuffling withSortedBucketSource
.Sorted-bucket files are
PCollection
s written withSortedBucketSink
that can be efficiently merged without shuffling withSortedBucketSource
. When writing, values are grouped by key into buckets, sorted by key within a bucket, and written to files. When reading, key-values in matching buckets are read in a merge-sort style, reducing shuffle. - trait SortedBucketOptions extends PipelineOptions
- Annotations
- @Description()
- class SortedBucketOptionsRegistrar extends PipelineOptionsRegistrar
- Annotations
- @AutoService()
- class SortedBucketPrimaryAndSecondaryKeyedSource[K1, K2] extends SortedBucketSource[KV[K1, K2]]
- class SortedBucketPrimaryKeyedSource[K] extends SortedBucketSource[K]
- class SortedBucketSink[K1, K2, V] extends PTransform[PCollection[V], WriteResult]
A
PTransform
for writing aPCollection
to file-based sink, where files represent "buckets" of elements deterministically assigned byBucketMetadata
based on a key extraction function.A
PTransform
for writing aPCollection
to file-based sink, where files represent "buckets" of elements deterministically assigned byBucketMetadata
based on a key extraction function. The elements in each bucket are written in sorted order according to the same key.This transform is intended to be used in conjunction with the
SortedBucketSource
transform. Any two datasets written withSortedBucketSink
using the same bucketing scheme can be joined by simply sequentially reading and merging files, thus eliminating the shuffle required byGroupByKey
-based transforms. This is ideal for datasets that will be written once and read many times with a predictable join key, i.e. user event data.Transform steps
SortedBucketSink
maps over each element, extracts abyte[]
representation of its sorting key usingBucketMetadata#extractKeyPrimary(Object)
, and assigns it to an Integer bucket usingBucketMetadata#getBucketId(byte[])
. Next, aGroupByKey
transform is applied to create aPCollection
ofN
elements, whereN
is the number of buckets specified byBucketMetadata#getNumBuckets()
, then aSortBucketShard
transform is used to sort elements within each bucket group, optionally sorting by the secondary key bytes fromBucketMetadata#getKeyBytesSecondary(Object)
. Finally, the write operation is performed, where each bucket is first written to aSortedBucketSink#tempDirectory
and then copied to its final destination.A JSON-serialized form of
BucketMetadata
is also written, which is required in order to joinSortedBucketSink
s using theSortedBucketSource
transform.Bucketing properties and hot keys
Bucketing properties are specified in
BucketMetadata
. The number of buckets,N
, must be a power of two and should be chosen such that each bucket can fit in a worker node's memory. Note that theSortValues
transform will try to sort in-memory and fall back to anExternalSorter
if needed.Each bucket can be further sharded to reduce the impact of hot keys, by specifying
BucketMetadata#getNumShards()
. - abstract class SortedBucketSource[KeyType] extends BoundedSource[KV[KeyType, CoGbkResult]]
A
PTransform
for co-grouping sources written using compatibleSortedBucketSink
transforms.A
PTransform
for co-grouping sources written using compatibleSortedBucketSink
transforms. It differs fromorg.apache.beam.sdk.transforms.join.CoGroupByKey
because no shuffle step is required, since the source files are written in pre-sorted order. Instead, matching buckets' files are sequentially read in a merge-sort style, and outputs resulting value groups asorg.apache.beam.sdk.transforms.join.CoGbkResult
.Source compatibility
Each of the
BucketedInput
sources must use the same key function and hashing scheme. SinceSortedBucketSink
writes an additional file representingBucketMetadata
,SortedBucketSource
begins by reading each metadata file and usingBucketMetadata#isCompatibleWith(BucketMetadata)
to check compatibility.The number of buckets,
N
, does not have to match across sources. Since that value is required be to a power of 2, all values ofN
are compatible, albeit requiring a fan-out from the source with smallestN
. - class SortedBucketTransform[FinalKeyT, FinalValueT] extends PTransform[PBegin, WriteResult]
A
PTransform
that encapsulates both aSortedBucketSource
andSortedBucketSink
operation, with a user-supplied transform function mapping mergedCoGbkResult
s to their final writable outputs. - class TFRecordCodec extends AnyRef
Codec for TFRecords file format.
Codec for TFRecords file format. See https://www.tensorflow.org/api_guides/python/python_io#TFRecords_Format_Details
- Annotations
- @PatchedFromBeam()
- abstract class TargetParallelism extends Serializable
Represents the desired parallelism of an SMB read operation.
Represents the desired parallelism of an SMB read operation. For a given set of sources, targetParallelism can be set to any number between the least and greatest numbers of buckets among sources. This can be dynamically configured using
TargetParallelism#min()
orTargetParallelism#max()
, which at graph construction time will determine the least or greatest amount of parallelism based on sources. Alternately,TargetParallelism#of(int)
can be used to statically configure a custom value, orTargetParallelism#auto()
can be used to let the runner decide how to split the SMB read at runtime based on the combined byte size of the inputs.If no value is specified, SMB read operations will use Auto parallelism.
When selecting a target parallelism for your SMB operation, there are tradeoffs to consider:
- Minimal parallelism means a fewer number of workers merging data from potentially many buckets. For example, if source A has 4 buckets and source B has 64, a minimally parallel SMB read would have 4 workers, each one merging 1 bucket from source A and 16 buckets from source B. This read may have low throughput. - Maximal parallelism means that each bucket is read by at least one worker. For example, if source A has 4 buckets and source B has 64, a maximally parallel SMB read would have 64 workers, each one merging 1 bucket from source B and 1 bucket from source A, replicated 16 times. This may have better throughput than the minimal example, but more expensive because every key group from the replicated sources must be re-hashed to avoid emitting duplicate records. - A custom parallelism in the middle of these bounds may be the best balance of speed and computing cost.
- class TensorFlowBucketIO extends AnyRef
API for reading and writing sorted-bucket TensorFlow TFRecord files with TensorFlow
Example
records. - class TensorFlowBucketMetadata[K1, K2] extends BucketMetadata[K1, K2, Example]
BucketMetadata
for TensorFlowExample
records. - class TensorFlowFileOperations extends FileOperations[Example]
org.apache.beam.sdk.extensions.smb.FileOperations
implementation for TensorFlow TFRecord files with TensorFlowExample
records.
Value Members
- object ParquetTypeFileOperations extends Serializable
- object ParquetTypeSortedBucketIO
- object SortedBucketIOUtil