We learned a lot building and improving Scio. The project was inspired by Spark and Scalding from the beginning, and we improved it over time working with customers of diverse background, including backend, data and ML. The design philosophy behind Scio can be summarized in a few points.
- Make it easy to do the right thing
- Scala made this possible for the most part. We have a fluent API and it’s easy to find the right transformation without going through lengthy documentation or source code. The most obvious thing is usually the best.
.countByValueis clearer and more efficient than
- Case classes and
Options are much safer and easier than JSON-based
Objects and nulls, despite the effort we went through to make it work.
- One can
.sumtypes with built-in
Semigroups easily and correctly.
- Conversely there is no
.groupAllsince it could incur huge performance penalty and is essentially
.groupBy(_ => ()). It’s easier to ask than making the wrong assumption and use it wrong (_“because it’s there”_).
Make common use cases simple
- We have syntactic sugar for most common IO modules e.g.
SCollection#saveAsBigQuerybut don’t cover all possible parameters. There’s a trade-off between covering more use cases and keeping the API simple.
- We opted for a more flexible boilerplate free
Argsinstead of the more type-safe
PipelineOptionsfor command line arguments parsing. Mistakes in these parts of the code are easier to catch and less damaging than those in the computation logic. Another trade-off we made.
- We have syntactic sugars for various types of joins (hash, inner, outer, sketch) and side input operations (cross, lookup) that can be easily swapped to fine tune a pipeline.
Make complex use cases possible
- We wrap complex internal APIs but don’t hide them away from users.
- Most low level Beam APIs (
PTransform) are still easily accessible.
- There are shorthands for integrating native Beam API, e.g.
- Pipelines can be submitted from
main, another process, a backend service, or chained with