Algebird is Twitter’s abstract algebra library. It has a lot of reusable modules for parallel aggregation and approximation. One can use any Algebird Aggregator or Semigroup with: - aggregate and sum on SCollection[T] - aggregateByKey and sumByKey on SCollection[(K, V)]

See AlgebirdSpec.scala and Algebird wiki for more details. Also see these slides on semigroups.

Algebird in REPL

scio> import com.twitter.algebird._
scio> import com.twitter.algebird.CMSHasherImplicits._
scio> val words = sc.textFile("").
     | flatMap(_.split("[^a-zA-Z0-9]+")).
     | filter(_.nonEmpty).
     | aggregate(CMS.aggregator[String](0.001, 1E-10, 1)).
     | materialize
scio> val cms = words.waitForResult()
scio> cms.frequency("scio").estimate
res2: Long = 19

scio> // let's validate:
scio> import sys.process._
scio> "grep -o scio"  #| "wc -l"!