Scio’s API is heavily influenced by Spark with a lot of ideas from Scalding.
The Dataflow programming model is fundamentally different from that of Spark. Read this Google blog article for more details.
The Scio API is heavily influenced by Spark but there are some minor differences.
SCollectionis equivalent to Spark’s
DoubleSCollectionFunctionsare specialized versions of
SCollectionand equivalent to Spark’s
- Execution planning is static and happens before the job is submitted. There is no driver node in a Dataflow cluster and one can only perform the equivalent of Spark transformations (
RDD) but not actions (
RDD→ driver local memory).
- There is no broadcast either but the pattern of
RDD→ driver via action and driver →
RDDvia broadcast can be replaced with
- There is no
DStream(continuous series of
RDDs) like in Spark Streaming. Values in a
SCollectionare windowed based on timestamp and windowing operation. The same API works regardless of batch (single global window by default) or streaming mode. Aggregation type transformations that produce
SCollections of a single value under global window will produce one value each window when a non-global window is defined.
SCollectionhas extra methods for side input, side output, and windowing.
Scio has a much simpler abstract data types compared to Scalding.
- Scalding has many abstract data types like
- Many of them are intermediate and enable some optimizations or wrap around Cascading’s data model.
- As a result many Scalding operations are lazily evaluated, for example in
mergeFnis lifted into
groupByto operate on the map side as well.
- Scio on the other hand, has only one main data type
SCollection[(K, V)]is a specialized variation when the elements are key-value pairs.
- All Scio operations are strictly evaluated, for example
(K, Iterable[T])where the values are immediately grouped, whereas
p.reduceByKey(_ + _)groups
(K, V)pairs on
Kand reduces values.
Some features may look familiar to Scalding users.
Argsis a simple command line argument parser similar to the one in Scalding.
- Powerful transforms are possible with
MultiJoinand coGroup of up to 22 sources.
JobTestfor end to end pipeline testing.
SCollection has a few variations.
SCollectionWithSideInputfor replicating small
SCollections to all left-hand side values in a large
SCollectionWithSideOutputfor output to multiple SCollections.
WindowedSCollectionfor accessing window information.
SCollectionWithHotKeyFanoutfor fanout of skewed data.
Scio also offers some additional features.
- Each worker can pull files from Google Cloud Storage via
DistCacheto be used in transforms locally, similar to Hadoop distributed cache. See DistCacheExample.scala.
- Type safe BigQuery IO via Scala macros. Case classes and converters are generated at compile time based on BQ schema. This eliminates the error prone process of handling generic JSON objects. See TypedBigQueryTornadoes.scala.
- Sinks (
ClosedTap[T]that can be opened either in another pipeline as
SCollection[T]or directly as
Iterator[T]once the current pipeline completes. This enables complex pipeline orchestration. See WordCountOrchestration.scala.