Scio, Spark and Scalding
Check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts.
Scio’s API is heavily influenced by Spark with a lot of ideas from Scalding.
Scio and Spark
The Dataflow programming model is fundamentally different from that of Spark. Read this Google blog article for more details.
The Scio API is heavily influenced by Spark but there are some minor differences.
SCollectionis equivalent to Spark’sRDD.PairSCollectionFunctionsandDoubleSCollectionFunctionsare specialized versions ofSCollectionand equivalent to Spark’sPairRDDFunctionsandDoubleRDDFunctions.- Execution planning is static and happens before the job is submitted. There is no driver node in a Dataflow cluster and one can only perform the equivalent of Spark transformations (
RDD→RDD) but not actions (RDD→ driver local memory). - There is no broadcast either but the pattern of
RDD→ driver via action and driver →RDDvia broadcast can be replaced withSCollection.asSingletonSideInputandSCollection.withSideInputs. - There is no
DStream(continuous series ofRDDs) like in Spark Streaming. Values in anSCollectionare windowed based on timestamp and windowing operation. The same API works regardless of batch (single global window by default) or streaming mode. Aggregation type transformations that produceSCollections of a single value under global window will produce one value each window when a non-global window is defined. SCollectionhas extra methods for side input, side output, and windowing.
Scio and Scalding
Scio has a much simpler abstract data types compared to Scalding.
- Scalding has many abstract data types like
TypedPipe,Grouped,CoGrouped,SortedGrouped. - Many of them are intermediate and enable some optimizations or wrap around Cascading’s data model.
- As a result many Scalding operations are lazily evaluated, for example in
pipe.groupBy(keyFn).reduce(mergeFn),mergeFnis lifted intogroupByto operate on the map side as well. - Scio on the other hand, has only one main data type
SCollection[T]andSCollection[(K, V)]is a specialized variation when the elements are key-value pairs. - All Scio operations are strictly evaluated, for example
p.groupBy(keyFn)returns(K, Iterable[T])where the values are immediately grouped, whereasp.reduceByKey(_ + _)groups(K, V)pairs onKand reduces values.
Some features may look familiar to Scalding users.
Argsis a simple command line argument parser similar to the one in Scalding.- Powerful transforms are possible with
sum,sumByKey,aggregate,aggregrateByKeyusing AlgebirdSemigroups andAggregators. MultiJoinand coGroup of up to 22 sources.JobTestfor end to end pipeline testing.
SCollection
SCollection has a few variations.
SCollectionWithSideInputfor replicating smallSCollections to all left-hand side values in a largeSCollection.SCollectionWithSideOutputfor output to multiple SCollections.WindowedSCollectionfor accessing window information.SCollectionWithFanoutandSCollectionWithHotKeyFanoutfor fanout of skewed data.
Additional features
Scio also offers some additional features.
- Each worker can pull files from Google Cloud Storage via
DistCacheto be used in transforms locally, similar to Hadoop distributed cache. See DistCacheExample.scala. - Type safe BigQuery IO via Scala macros. Case classes and converters are generated at compile time based on BQ schema. This eliminates the error prone process of handling generic JSON objects. See TypedBigQueryTornadoes.scala.
- Sinks (
saveAs*methods) returnClosedTap[T]that can be opened either in another pipeline asSCollection[T]or directly asIterator[T]once the current pipeline completes. This enables complex pipeline orchestration. See WordCountOrchestration.scala.
0.14.19-23-4daeffd-20251023T204536Z*