Packages

  • package root
    Definition Classes
    root
  • package com
    Definition Classes
    root
  • package spotify
    Definition Classes
    com
  • package scio
    Definition Classes
    spotify
  • package annotations
    Definition Classes
    scio
  • package avro

    Main package for Avro APIs.

    Main package for Avro APIs. Import all.

    import com.spotify.scio.avro._
    Definition Classes
    scio
  • package bigquery

    Main package for BigQuery APIs.

    Main package for BigQuery APIs. Import all.

    import com.spotify.scio.bigquery._

    There are two BigQuery dialects, legacy and standard. APIs that take a BigQuery query string as argument, e.g. com.spotify.scio.bigquery.client.BigQuery.query.rows, com.spotify.scio.bigquery.client.BigQuery.query.schema, com.spotify.scio.bigquery.client.BigQuery.getTypedRows and BigQueryType.fromQuery, automatically detects the query's dialect. To override this, start the query with either #legacysql or #standardsql comment line.

    Definition Classes
    scio
  • package bigtable

    Main package for Bigtable APIs.

    Main package for Bigtable APIs. Import all.

    import com.spotify.scio.bigtable._
    Definition Classes
    scio
  • package cassandra

    Main package for Cassandra APIs.

    Main package for Cassandra APIs. Import all.

    import com.spotify.scio.cassandra._
    Definition Classes
    scio
  • package coders
    Definition Classes
    scio
  • package datastore
    Definition Classes
    scio
  • package elasticsearch

    Main package for Elasticsearch APIs.

    Main package for Elasticsearch APIs. Import all.

    import com.spotify.scio.elasticsearch._
    Definition Classes
    scio
  • package estimators
    Definition Classes
    scio
  • package extra
    Definition Classes
    scio
  • package annoy

    Main package for Annoy side input APIs.

    Main package for Annoy side input APIs. Import all.

    import com.spotify.scio.extra.annoy._

    Two metrics are available, Angular and Euclidean.

    To save an SCollection[(Int, Array[Float])] to an Annoy file:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))

    Save to a temporary location:

    val s1 = s.asAnnoy(Angular, 40, 10)

    Save to a specific location:

    val s1 = s.asAnnoy(Angular, 40, 10, "gs://<bucket>/<path>")

    SCollection[AnnoyUri] can be converted into a side input:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
    val side = s.asAnnoySideInput(metric, dimension, numTrees)

    There's syntactic sugar for saving an SCollection and converting it to a side input:

    val s = sc
      .parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
      .asAnnoySideInput(metric, dimension, numTrees)

    An existing Annoy file can be converted to a side input directly:

    sc.annoySideInput(metric, dimension, numTrees, "gs://<bucket>/<path>")

    AnnoyReader provides nearest neighbor lookups by vector as well as item lookups:

    val data = (0 until 1000).map(x => (x, Array.fill(40)(r.nextFloat())))
    val main = sc.parallelize(data)
    val side = main.asAnnoySideInput(metric, dimension, numTrees)
    
    main.keys.withSideInput(side)
      .map { (i, s) =>
        val annoyReader = s(side)
    
        // get vector by item id, allocating a new Array[Float] each time
        val v1 = annoyReader.getItemVector(i)
    
        // get vector by item id, copy vector into pre-allocated Array[Float]
        val v2 = Array.fill(dim)(-1.0f)
        annoyReader.getItemVector(i, v2)
    
        // get 10 nearest neighbors by vector
        val results = annoyReader.getNearest(v2, 10)
      }
  • package bigquery
  • package csv
  • package hll
  • package json

    Main package for JSON APIs.

    Main package for JSON APIs. Import all.

    This package uses Circe for JSON handling under the hood.

    import com.spotify.scio.extra.json._
    
    // define a type-safe JSON schema
    case class Record(i: Int, d: Double, s: String)
    
    // read JSON as case classes
    sc.jsonFile[Record]("input.json")
    
    // write case classes as JSON
    sc.parallelize((1 to 10).map(x => Record(x, x.toDouble, x.toString))
      .saveAsJsonFile("output")
  • package rollup
  • package sorter
  • package sparkey

    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey fileset:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    
    // temporary location
    val s1: SCollection[SparkeyUri] = s.asSparkey
    
    // specific location
    val s1: SCollection[SparkeyUri] = s.asSparkey("gs://<bucket>/<path>/<sparkey-prefix>")

    // with multiple shards, sharded by MurmurHash3 of the key val s1: SCollection[SparkeyUri] = s.asSparkey("gs://<bucket>/<path>/<sparkey-dir>", numShards=2) }}}

    The result SCollection[SparkeyUri] can be converted to a side input:

    val s: SCollection[SparkeyUri] = sc.parallelize(Seq("a" -> "one", "b" -> "two")).asSparkey
    val side: SideInput[SparkeyReader] = s.asSparkeySideInput

    These two steps can be done with a syntactic sugar:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput

    An existing Sparkey file can also be converted to a side input directly:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>")

    A sharded collection of Sparkey files can also be used as a side input by specifying a glob path:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-dir>/part-*")

    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput
    
    main.withSideInputs(side)
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
      }

    A SparkeyMap can store any types of keys and values, but can only be used as a SideInput:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyMap[String, Int]] = sc
      .parallelize(Seq("a" -> 1, "b" -> 2, "c" -> 3))
      .asLargeMapSideInput()
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection

    To read a static Sparkey collection and use it as a typed SideInput, use TypedSparkeyReader. TypedSparkeyReader can also accept a Caffeine cache to reduce IO and deserialization load:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val cache: Cache[String, MyObject] = ...
    val side: SideInput[TypedSparkeyReader[MyObject]] = sc
      .typedSparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>", MyObject.decode, cache)
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection
  • Breeze
  • Collections
  • Iterators
  • package hash

    Main package for hash APIs.

    Main package for hash APIs. Import all.

    import com.spotify.scio.hash._
    Definition Classes
    scio
  • package io
    Definition Classes
    scio
  • package jdbc

    Main package for JDBC APIs.

    Main package for JDBC APIs. Import all.

    import com.spotify.scio.jdbc._
    Definition Classes
    scio
  • package metrics

    This package contains the schema types for metrics collected during a pipeline run.

    This package contains the schema types for metrics collected during a pipeline run.

    Definition Classes
    scio
  • package options
    Definition Classes
    scio
  • package parquet
    Definition Classes
    scio
  • package pubsub
    Definition Classes
    scio
  • package runners
    Definition Classes
    scio
  • package schemas
    Definition Classes
    scio
  • package smb
    Definition Classes
    scio
  • package spanner
    Definition Classes
    scio
  • package tensorflow
    Definition Classes
    scio
  • package testing
    Definition Classes
    scio
  • package transforms

    Main package for transforms APIs.

    Main package for transforms APIs. Import all.

    Definition Classes
    scio
  • package util
    Definition Classes
    scio
  • package values
    Definition Classes
    scio

package extra

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Package Members

  1. package annoy

    Main package for Annoy side input APIs.

    Main package for Annoy side input APIs. Import all.

    import com.spotify.scio.extra.annoy._

    Two metrics are available, Angular and Euclidean.

    To save an SCollection[(Int, Array[Float])] to an Annoy file:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))

    Save to a temporary location:

    val s1 = s.asAnnoy(Angular, 40, 10)

    Save to a specific location:

    val s1 = s.asAnnoy(Angular, 40, 10, "gs://<bucket>/<path>")

    SCollection[AnnoyUri] can be converted into a side input:

    val s = sc.parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
    val side = s.asAnnoySideInput(metric, dimension, numTrees)

    There's syntactic sugar for saving an SCollection and converting it to a side input:

    val s = sc
      .parallelize(Seq( 1-> Array(1.2f, 3.4f), 2 -> Array(2.2f, 1.2f)))
      .asAnnoySideInput(metric, dimension, numTrees)

    An existing Annoy file can be converted to a side input directly:

    sc.annoySideInput(metric, dimension, numTrees, "gs://<bucket>/<path>")

    AnnoyReader provides nearest neighbor lookups by vector as well as item lookups:

    val data = (0 until 1000).map(x => (x, Array.fill(40)(r.nextFloat())))
    val main = sc.parallelize(data)
    val side = main.asAnnoySideInput(metric, dimension, numTrees)
    
    main.keys.withSideInput(side)
      .map { (i, s) =>
        val annoyReader = s(side)
    
        // get vector by item id, allocating a new Array[Float] each time
        val v1 = annoyReader.getItemVector(i)
    
        // get vector by item id, copy vector into pre-allocated Array[Float]
        val v2 = Array.fill(dim)(-1.0f)
        annoyReader.getItemVector(i, v2)
    
        // get 10 nearest neighbors by vector
        val results = annoyReader.getNearest(v2, 10)
      }
  2. package bigquery
  3. package csv
  4. package hll
  5. package json

    Main package for JSON APIs.

    Main package for JSON APIs. Import all.

    This package uses Circe for JSON handling under the hood.

    import com.spotify.scio.extra.json._
    
    // define a type-safe JSON schema
    case class Record(i: Int, d: Double, s: String)
    
    // read JSON as case classes
    sc.jsonFile[Record]("input.json")
    
    // write case classes as JSON
    sc.parallelize((1 to 10).map(x => Record(x, x.toDouble, x.toString))
      .saveAsJsonFile("output")
  6. package rollup
  7. package sorter
  8. package sparkey

    Main package for Sparkey side input APIs.

    Main package for Sparkey side input APIs. Import all.

    import com.spotify.scio.extra.sparkey._

    To save an SCollection[(String, String)] to a Sparkey fileset:

    val s = sc.parallelize(Seq("a" -> "one", "b" -> "two"))
    
    // temporary location
    val s1: SCollection[SparkeyUri] = s.asSparkey
    
    // specific location
    val s1: SCollection[SparkeyUri] = s.asSparkey("gs://<bucket>/<path>/<sparkey-prefix>")

    // with multiple shards, sharded by MurmurHash3 of the key val s1: SCollection[SparkeyUri] = s.asSparkey("gs://<bucket>/<path>/<sparkey-dir>", numShards=2) }}}

    The result SCollection[SparkeyUri] can be converted to a side input:

    val s: SCollection[SparkeyUri] = sc.parallelize(Seq("a" -> "one", "b" -> "two")).asSparkey
    val side: SideInput[SparkeyReader] = s.asSparkeySideInput

    These two steps can be done with a syntactic sugar:

    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput

    An existing Sparkey file can also be converted to a side input directly:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>")

    A sharded collection of Sparkey files can also be used as a side input by specifying a glob path:

    sc.sparkeySideInput("gs://<bucket>/<path>/<sparkey-dir>/part-*")

    SparkeyReader can be used like a lookup table in a side input operation:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyReader] = sc
      .parallelize(Seq("a" -> "one", "b" -> "two"))
      .asSparkeySideInput
    
    main.withSideInputs(side)
      .map { (x, s) =>
        s(side).getOrElse(x, "unknown")
      }

    A SparkeyMap can store any types of keys and values, but can only be used as a SideInput:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val side: SideInput[SparkeyMap[String, Int]] = sc
      .parallelize(Seq("a" -> 1, "b" -> 2, "c" -> 3))
      .asLargeMapSideInput()
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection

    To read a static Sparkey collection and use it as a typed SideInput, use TypedSparkeyReader. TypedSparkeyReader can also accept a Caffeine cache to reduce IO and deserialization load:

    val main: SCollection[String] = sc.parallelize(Seq("a", "b", "c"))
    val cache: Cache[String, MyObject] = ...
    val side: SideInput[TypedSparkeyReader[MyObject]] = sc
      .typedSparkeySideInput("gs://<bucket>/<path>/<sparkey-prefix>", MyObject.decode, cache)
    
    val objects: SCollection[MyObject] = main
      .withSideInputs(side)
      .map { (x, s) => s(side).get(x) }
      .toSCollection

Value Members

  1. object Breeze

    Utilities for Breeze.

    Utilities for Breeze.

    Includes Semigroups for Breeze data types like DenseVectors and DenseMatrixs.

    import com.spotify.scio.extra.Breeze._
    
    val vectors: SCollection[DenseVector[Double]] = // ...
    vectors.sum  // implicit Semigroup[T]
  2. object Collections

    Utilities for Scala collection library.

    Utilities for Scala collection library.

    Adds a top method to Array[T] and Iterable[T] and a topByKey method to Array[(K, V)] and Iterable[(K, V)].

    import com.spotify.scio.extra.Collections._
    
    val xs: Array[(String, Int)] = // ...
    xs.top(5)(Ordering.by(_._2))
    xs.topByKey(5)
  3. object Iterators

    Utilities for Scala iterators.

    Utilities for Scala iterators.

    Adds a timeSeries method to Iterator[T] so that it can be windowed with different logic.

    import com.spotify.scio.extra.Iterators._
    
    case class Event(user: String, action: String, timestamp: Long)
    val i: Iterator[Event] = // ...
    
    // 60 minutes fixed windows offset by 30 minutes
    // E.g. minute [30, 90), [90, 120), [120, 150), [150, 180) ...
    i.timeSeries(_.timestamp).fixed(3600000, 1800000)
    
    // session windows with 60 minute gaps between windows
    i.timeSeries(_.timestamp).session(3600000)
    
    // 60 minutes sliding windows, one every 10 minutes, offset by 5 minutes
    // E.g. minute [5, 65), [15, 75), [25, 85), [35, 95) ...
    i.timeSeries(_.timestamp).session(3600000, 600000, 300000)

Ungrouped