Scio, Beam and Dataflow
Check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts. Also see this comparison between Scio, Scalding and Spark APIs.
Scio aims to be a thin wrapper on top of Beam while offering idiomatic Scala style API.
`PTransform`are implemented as idiomatic Scala methods on
`DoubleSCollectionFunctions`are specialized version of
SCollectionimplemented via the Scala “pimp my library” pattern.
SCollection[(K, V)]is automatically converted to a
PairSCollectionFunctionswhich provides key-value operations, e.g.
SCollection[Double]is automatically converted to a
DoubleSCollectionFunctionswhich provides statistical operations, e.g.
ScioContext, PipelineOptions, Args and ScioResult
- Beam/Dataflow uses
`PipelineOptions`and its subclasses to parse command line arguments. Users have to extend the interface for their application level arguments.
- Scalding uses
Argsto parse application arguments in a more generic and boilerplate free style.
parseArgumentsmethod that takes an
Array[String]of command line arguments, parses Beam/Dataflow specific ones into a
PipelineOptions, and application specific ones into an
Args, and returns the
ContextAndArgsis a short cut to create a
ScioResultcan be used to access accumulator values and job state.
`IO`Read transforms are implemented as methods on
IOWrite transforms are implemented as methods on
- These IO operations also detects when the
ScioContextis running in a
`JobTest`and manages test IO in memory.
- Write options also return a
`ClosedTap`. Once the job completes you can open the
Tapabstracts away the logic of reading the dataset directly as an
Iterator[T]or re-opening it in another
Futureis complete once the job finishes. This can be used to do light weight pipeline orchestration e.g. WordCountOrchestration.scala.
PCollection[KV[K, V]]inputs while Scio uses
- Hence every
KV[K, V]before and vice versa afterwards. However these are lightweight wrappers and the JVM should be able to optimize them.
scala.Iterable[V]in some cases.
- Beam/Dataflow uses
`Coder`for (de)serializing elements in a
PCollectionduring shuffle. There are built-in coders for Java primitive types, collections, and common types in GCP like Avro, ProtoBuf, BigQuery
TypeTokenfrom Guava reflection and to workaround Java type erasure and retrieve type information of elements. This may not always work but there is a
PCollection#setCodermethod to override.
- Twitter’s chill library uses kryo to (de)serialize data. Chill includes serializers for common Scala types and cal also automatically derive serializers for arbitrary objects. Scio falls back to
KryoAtomicCoderwhen a built-in one isn’t available.
- A coder may be non-deterministic if
Coder#verifyDeterministicthrows an exception. Any data type with such a coder cannot be used as a key in
KryoAtomicCoderassumes all types are deterministic for simplicity so it’s up to the user’s discretion to not avoid non-deterministic types e.g. tuples or case classes with doubles as keys.
GenericRecordrequires a schema during deserialization (which is available as
GenericRecord#getSchemafor serialization) and
`AvroCoder`requires that too during initialization. This is not possible in
KryoAtomicCoder, i.e. when nesting
GenericRecordinside a Scala type. Instead
KryoAtomicCoderserializes the schema before every record so that they can roundtrip safely. This is not optimal but the only way without requiring user to handcraft a custom coder.