smb

package smb

Ordering

Alphabetic

Visibility

Public
Protected

Type Members

class AvroBucketMetadata[K1, K2, V <: IndexedRecord] extends BucketMetadata[K1, K2, V]
org.apache.beam.sdk.extensions.smb.BucketMetadata for Avro IndexedRecord records.
class AvroFileOperations[ValueT] extends FileOperations[ValueT]
org.apache.beam.sdk.extensions.smb.FileOperations implementation for Avro files.
class AvroSortedBucketIO extends AnyRef
API for reading and writing Avro sorted-bucket files.
abstract class BucketMetadata[K1, K2, V] extends Serializable with HasDisplayData
Represents metadata in a JSON-serializable format to be stored alongside sorted-bucket files in a SortedBucketSink transform, and read/checked for compatibility in a SortedBucketSource transform.
Represents metadata in a JSON-serializable format to be stored alongside sorted-bucket files in a SortedBucketSink transform, and read/checked for compatibility in a SortedBucketSource transform.
Key encoding
BucketMetadata controls over how values are mapped to a key type (see: `[[#extractKeyPrimary(Object)]]`) and (see: `[[ #extractKeySecondary(Object)]]`), and how those bytes are encoded into byte[] (see: `[[ #getKeyBytesPrimary(Object)]]`, `[[#getKeyBytesSecondary(Object)]]`). Therefore, in order for two sources to be compatible in a `[[SortedBucketSource]]` transform, the `[[#keyClass]]` does not have to be the same, as long as the final byte encoding is equivalent. A `[[ #coderOverrides()]]` method is provided for any encoding overrides: for example, in Avro sources the `[[CharSequence]]` type should be encoded as a UTF-8 string.
Annotations
@JsonTypeInfo()
class BucketMetadataUtil extends AnyRef
abstract class BucketShardId extends AnyRef
Abstracts bucket and shard id in SortedBucketIO.
Abstracts bucket and shard id in SortedBucketIO.
Null keys
Null keys are assigned a bucket ID of -1, separate from the rest of the bucketed data. Note that this does not necessarily reflect the final filename: SMBFilenamePolicy can special-case this bucket ID.
Annotations
@AutoValue()
class CoGbkResultUtil extends AnyRef
Annotations
@PatchedFromBeam()
abstract class FileOperations[V] extends Serializable with HasDisplayData
Abstracts IO operations for file-based formats.
Abstracts IO operations for file-based formats.
Since the SMB algorithm doesn't support org.apache.beam.sdk.io.Source splitting, I/O operations must be abstracted at a per-record granularity. Reader and Writer must be Serializable to be used in SortedBucketSource and SortedBucketSink transforms.
final class IcebergEncoder extends AnyRef
class JsonBucketMetadata[K1, K2] extends BucketMetadata[K1, K2, TableRow]
org.apache.beam.sdk.extensions.smb.BucketMetadata for BigQuery TableRow JSON records.
class JsonFileOperations extends FileOperations[TableRow]
org.apache.beam.sdk.extensions.smb.FileOperations implementation for text files with BigQuery TableRow JSON records.
class JsonSortedBucketIO extends AnyRef
API for reading and writing BigQuery TableRow JSON sorted-bucket files.
class MissingImplementationException extends RuntimeException
class MultiSourceKeyGroupReader[KeyType] extends AnyRef
Annotations
@SuppressWarnings()
class ParquetAvroFileOperations[ValueT] extends FileOperations[ValueT]
org.apache.beam.sdk.extensions.smb.FileOperations implementation for Parquet files with Avro records.
class ParquetAvroSortedBucketIO extends AnyRef
API for reading and writing Parquet sorted-bucket files as Avro.
class ParquetBucketMetadata[K1, K2, V] extends BucketMetadata[K1, K2, V]
class ParquetInputFile extends InputFile
class ParquetOutputFile extends OutputFile
case class ParquetTypeFileOperations[T](compression: CompressionCodecName, conf: SerializableConfiguration, predicate: FilterPredicate)(implicit pt: ParquetType[T], coder: Coder[T]) extends FileOperations[T] with Product with Serializable
final class SMBFilenamePolicy extends Serializable
Naming policy for SMB files, similar to org.apache.beam.sdk.io.FileBasedSink.FilenamePolicy.
Naming policy for SMB files, similar to org.apache.beam.sdk.io.FileBasedSink.FilenamePolicy.
File names are assigned uniquely per BucketShardId. This class functions differently for the initial write to temp files, and the move of those files to their final destination. This is because temp writes need to be idempotent in case of bundle failure, and are thus timestamped to ensure an uncorrupted write result when a bundle succeeds.
class SortedBucketIO extends AnyRef
Sorted-bucket files are PCollections written with SortedBucketSink that can be efficiently merged without shuffling with SortedBucketSource.
Sorted-bucket files are PCollections written with SortedBucketSink that can be efficiently merged without shuffling with SortedBucketSource. When writing, values are grouped by key into buckets, sorted by key within a bucket, and written to files. When reading, key-values in matching buckets are read in a merge-sort style, reducing shuffle.
trait SortedBucketOptions extends PipelineOptions
Annotations
@Description()
class SortedBucketOptionsRegistrar extends PipelineOptionsRegistrar
Annotations
@AutoService()
class SortedBucketPrimaryAndSecondaryKeyedSource[K1, K2] extends SortedBucketSource[KV[K1, K2]]
class SortedBucketPrimaryKeyedSource[K] extends SortedBucketSource[K]
class SortedBucketSink[K1, K2, V] extends PTransform[PCollection[V], WriteResult]
A PTransform for writing a PCollection to file-based sink, where files represent "buckets" of elements deterministically assigned by BucketMetadata based on a key extraction function.
A PTransform for writing a PCollection to file-based sink, where files represent "buckets" of elements deterministically assigned by BucketMetadata based on a key extraction function. The elements in each bucket are written in sorted order according to the same key.
This transform is intended to be used in conjunction with the SortedBucketSource transform. Any two datasets written with SortedBucketSink using the same bucketing scheme can be joined by simply sequentially reading and merging files, thus eliminating the shuffle required by GroupByKey-based transforms. This is ideal for datasets that will be written once and read many times with a predictable join key, i.e. user event data.
Transform steps
SortedBucketSink maps over each element, extracts a byte[] representation of its sorting key using BucketMetadata#extractKeyPrimary(Object), and assigns it to an Integer bucket using BucketMetadata#getBucketId(byte[]). Next, a GroupByKey transform is applied to create a PCollection of N elements, where N is the number of buckets specified by BucketMetadata#getNumBuckets(), then a SortBucketShard transform is used to sort elements within each bucket group, optionally sorting by the secondary key bytes from BucketMetadata#getKeyBytesSecondary(Object). Finally, the write operation is performed, where each bucket is first written to a SortedBucketSink#tempDirectory and then copied to its final destination.
A JSON-serialized form of BucketMetadata is also written, which is required in order to join SortedBucketSinks using the SortedBucketSource transform.
Bucketing properties and hot keys
Bucketing properties are specified in BucketMetadata. The number of buckets, N, must be a power of two and should be chosen such that each bucket can fit in a worker node's memory. Note that the SortValues transform will try to sort in-memory and fall back to an ExternalSorter if needed.
Each bucket can be further sharded to reduce the impact of hot keys, by specifying BucketMetadata#getNumShards().
abstract class SortedBucketSource[KeyType] extends BoundedSource[KV[KeyType, CoGbkResult]]
A PTransform for co-grouping sources written using compatible SortedBucketSink transforms.
A PTransform for co-grouping sources written using compatible SortedBucketSink transforms. It differs from org.apache.beam.sdk.transforms.join.CoGroupByKey because no shuffle step is required, since the source files are written in pre-sorted order. Instead, matching buckets' files are sequentially read in a merge-sort style, and outputs resulting value groups as org.apache.beam.sdk.transforms.join.CoGbkResult.
Source compatibility
Each of the BucketedInput sources must use the same key function and hashing scheme. Since SortedBucketSink writes an additional file representing BucketMetadata, SortedBucketSource begins by reading each metadata file and using BucketMetadata#isCompatibleWith(BucketMetadata) to check compatibility.
The number of buckets, N, does not have to match across sources. Since that value is required be to a power of 2, all values of N are compatible, albeit requiring a fan-out from the source with smallest N.
class SortedBucketTransform[FinalKeyT, FinalValueT] extends PTransform[PBegin, WriteResult]
A PTransform that encapsulates both a SortedBucketSource and SortedBucketSink operation, with a user-supplied transform function mapping merged CoGbkResults to their final writable outputs.
class TFRecordCodec extends AnyRef
Codec for TFRecords file format.
Codec for TFRecords file format. See https://www.tensorflow.org/api_guides/python/python_io#TFRecords_Format_Details
Annotations
@PatchedFromBeam()
abstract class TargetParallelism extends Serializable
Represents the desired parallelism of an SMB read operation.
Represents the desired parallelism of an SMB read operation. For a given set of sources, targetParallelism can be set to any number between the least and greatest numbers of buckets among sources. This can be dynamically configured using TargetParallelism#min() or TargetParallelism#max(), which at graph construction time will determine the least or greatest amount of parallelism based on sources. Alternately, TargetParallelism#of(int) can be used to statically configure a custom value, or TargetParallelism#auto() can be used to let the runner decide how to split the SMB read at runtime based on the combined byte size of the inputs.
If no value is specified, SMB read operations will use Auto parallelism.
When selecting a target parallelism for your SMB operation, there are tradeoffs to consider:
- Minimal parallelism means a fewer number of workers merging data from potentially many buckets. For example, if source A has 4 buckets and source B has 64, a minimally parallel SMB read would have 4 workers, each one merging 1 bucket from source A and 16 buckets from source B. This read may have low throughput. - Maximal parallelism means that each bucket is read by at least one worker. For example, if source A has 4 buckets and source B has 64, a maximally parallel SMB read would have 64 workers, each one merging 1 bucket from source B and 1 bucket from source A, replicated 16 times. This may have better throughput than the minimal example, but more expensive because every key group from the replicated sources must be re-hashed to avoid emitting duplicate records. - A custom parallelism in the middle of these bounds may be the best balance of speed and computing cost.
class TensorFlowBucketIO extends AnyRef
API for reading and writing sorted-bucket TensorFlow TFRecord files with TensorFlow Example records.
class TensorFlowBucketMetadata[K1, K2] extends BucketMetadata[K1, K2, Example]
BucketMetadata for TensorFlow Example records.
class TensorFlowFileOperations extends FileOperations[Example]
org.apache.beam.sdk.extensions.smb.FileOperations implementation for TensorFlow TFRecord files with TensorFlow Example records.

Value Members

object ParquetTypeFileOperations extends Serializable
object ParquetTypeSortedBucketIO
object SortedBucketIOUtil

Packages

smb

package smb

Type Members

Key encoding

Null keys

Transform steps

Bucketing properties and hot keys

Source compatibility

Value Members

Ungrouped

Packages

smb

package smb

Type Members

Key encoding

Null keys

Transform steps

Bucketing properties and hot keys

Source compatibility

Value Members

Ungrouped

smb