Scio, Spark and Scalding

Check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts. Also read more about the relationship between Scio, Beam and Dataflow.

Scio’s API is heavily influenced by Spark with a lot of ideas from Scalding.

Scio and Spark

The Dataflow programming model is fundamentally different from that of Spark. Read this Google blog article for more details.

The Scio API is heavily influenced by Spark but there are some minor differences.

  • SCollection is equivalent to Spark’s RDD.
  • PairSCollectionFunctions and DoubleSCollectionFunctions are specialized versions of SCollection and equivalent to Spark’s PairRDDFunctions and DoubleRDDFunctions.
  • Execution planning is static and happens before the job is submitted. There is no driver node in a Dataflow cluster and one can only perform the equivalent of Spark transformations (RDDRDD) but not actions (RDD → driver local memory).
  • There is no broadcast either but the pattern of RDD → driver via action and driver → RDD via broadcast can be replaced with SCollection.asSingletonSideInput and SCollection.withSideInputs.
  • There is no DStream (continuous series of RDDs) like in Spark Streaming. Values in a SCollection are windowed based on timestamp and windowing operation. The same API works regardless of batch (single global window by default) or streaming mode. Aggregation type transformations that produce SCollections of a single value under global window will produce one value each window when a non-global window is defined.
  • SCollection has extra methods for side input, side output, and windowing.

Scio and Scalding

Scio has a much simpler abstract data types compared to Scalding.

  • Scalding has many abstract data types like TypedPipe, Grouped, CoGrouped, SortedGrouped.
  • Many of them are intermediate and enable some optimizations or wrap around Cascading’s data model.
  • As a result many Scalding operations are lazily evaluated, for example in pipe.groupBy(keyFn).reduce(mergeFn), mergeFn is lifted into groupBy to operate on the map side as well.
  • Scio on the other hand, has only one main data type SCollection[T] and SCollection[(K, V)] is a specialized variation when the elements are key-value pairs.
  • All Scio operations are strictly evaluated, for example p.groupBy(keyFn) returns (K, Iterable[T]) where the values are immediately grouped, whereas p.reduceByKey(_ + _) groups (K, V) pairs on K and reduces values.

Some features may look familiar to Scalding users.

  • Args is a simple command line argument parser similar to the one in Scalding.
  • Powerful transforms are possible with sum, sumByKey, aggregate, aggregrateByKey using Algebird Semigroups and Aggregators.
  • MultiJoin and coGroup of up to 22 sources.
  • JobTest for end to end pipeline testing.

SCollection

SCollection has a few variations.

Additional features

Scio also offers some additional features.

  • Each worker can pull files from Google Cloud Storage via DistCache to be used in transforms locally, similar to Hadoop distributed cache. See DistCacheExample.scala.
  • Type safe BigQuery IO via Scala macros. Case classes and converters are generated at compile time based on BQ schema. This eliminates the error prone process of handling generic JSON objects. See TypedBigQueryTornadoes.scala.
  • Sinks (saveAs* methods) return ClosedTap[T] that can be opened either in another pipeline as SCollection[T] or directly as Iterator[T] once the current pipeline completes. This enables complex pipeline orchestration. See WordCountOrchestration.scala.