Scio, Spark and Scalding
Check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts.
Scio’s API is heavily influenced by Spark with a lot of ideas from Scalding.
Scio and Spark
The Dataflow programming model is fundamentally different from that of Spark. Read this Google blog article for more details.
The Scio API is heavily influenced by Spark but there are some minor differences.
SCollection
is equivalent to Spark’sRDD
.PairSCollectionFunctions
andDoubleSCollectionFunctions
are specialized versions ofSCollection
and equivalent to Spark’sPairRDDFunctions
andDoubleRDDFunctions
.- Execution planning is static and happens before the job is submitted. There is no driver node in a Dataflow cluster and one can only perform the equivalent of Spark transformations (
RDD
→RDD
) but not actions (RDD
→ driver local memory). - There is no broadcast either but the pattern of
RDD
→ driver via action and driver →RDD
via broadcast can be replaced withSCollection.asSingletonSideInput
andSCollection.withSideInputs
. - There is no
DStream
(continuous series ofRDD
s) like in Spark Streaming. Values in anSCollection
are windowed based on timestamp and windowing operation. The same API works regardless of batch (single global window by default) or streaming mode. Aggregation type transformations that produceSCollection
s of a single value under global window will produce one value each window when a non-global window is defined. SCollection
has extra methods for side input, side output, and windowing.
Scio and Scalding
Scio has a much simpler abstract data types compared to Scalding.
- Scalding has many abstract data types like
TypedPipe
,Grouped
,CoGrouped
,SortedGrouped
. - Many of them are intermediate and enable some optimizations or wrap around Cascading’s data model.
- As a result many Scalding operations are lazily evaluated, for example in
pipe.groupBy(keyFn).reduce(mergeFn)
,mergeFn
is lifted intogroupBy
to operate on the map side as well. - Scio on the other hand, has only one main data type
SCollection[T]
andSCollection[(K, V)]
is a specialized variation when the elements are key-value pairs. - All Scio operations are strictly evaluated, for example
p.groupBy(keyFn)
returns(K, Iterable[T])
where the values are immediately grouped, whereasp.reduceByKey(_ + _)
groups(K, V)
pairs onK
and reduces values.
Some features may look familiar to Scalding users.
Args
is a simple command line argument parser similar to the one in Scalding.- Powerful transforms are possible with
sum
,sumByKey
,aggregate
,aggregrateByKey
using AlgebirdSemigroup
s andAggregator
s. MultiJoin
and coGroup of up to 22 sources.JobTest
for end to end pipeline testing.
SCollection
SCollection
has a few variations.
SCollectionWithSideInput
for replicating smallSCollection
s to all left-hand side values in a largeSCollection
.SCollectionWithSideOutput
for output to multiple SCollections.WindowedSCollection
for accessing window information.SCollectionWithFanout
andSCollectionWithHotKeyFanout
for fanout of skewed data.
Additional features
Scio also offers some additional features.
- Each worker can pull files from Google Cloud Storage via
DistCache
to be used in transforms locally, similar to Hadoop distributed cache. See DistCacheExample.scala. - Type safe BigQuery IO via Scala macros. Case classes and converters are generated at compile time based on BQ schema. This eliminates the error prone process of handling generic JSON objects. See TypedBigQueryTornadoes.scala.
- Sinks (
saveAs*
methods) returnClosedTap[T]
that can be opened either in another pipeline asSCollection[T]
or directly asIterator[T]
once the current pipeline completes. This enables complex pipeline orchestration. See WordCountOrchestration.scala.
0.14.8-23-c45685a-20241105T161920Z*