Protobuf

Read Protobuf files

Scio comes with custom and efficient support for reading Protobuf files via protobufFile method, for example:

import com.spotify.scio.avro._

// FooBarProto is a Protobuf generated class (must be a subclass of Protobuf's `Message`)
sc.protobufFile[FooBarProto]("gs://path-to-data/lake/part-*.protobuf.avro")
  .map( message => ??? )
// `message` is of type FooBarProto

Important: in most cases the input files should have been previously written by Scio. The reason is that Scio assumes that serialized Protobuf message is stored inside bytes Avro record.

If you want to read serialized protobuf messages directly from a file, one solution is to use textFile followed by a map to parse your messages.

Write Protobuf files

Scio comes with custom and efficient support for writing Protobuf files via saveAsProtobufFile method, for example:

// FooBarProto is a Protobuf generated class (must be a subclass of Protobuf's `Message`)
val data: SCollection[FooBarProto] = ???
data.saveAsProtobufFile("gs://path-to-data/lake/protos-out")

File format

Scio’s Protobuf file is backed by Avro with the following schema:

{
  "type" : "record",
  "name" : "AvroBytesRecord",
  "fields" : [ {
    "name" : "bytes",
    "type" : "bytes"
  } ]
}

Avro gives us a block based file format with compression, split and combine support. Protobuf binary is stored in the bytes field of AvroBytesRecord.

Starting with Scio 0.2.6, the Protobuf schema also is stored as a JSON string in the Avro file metadata under the key protobuf.generic.schema. You can get the schema or JSON records using the proto-tools command line tool from gcs-tools (available in our homebrew tap). Conversion between Protobuf schema, binary and JSON is done via the protobuf-generic library.

brew tap spotify/public
brew install gcs-proto-tools
proto-tools getschema data.protobuf.avro
proto-tools tojson data.protobuf.avro

Common issues

ScalaPB

If you end up using ScalaPB, make sure to use java based message class as input/output type, Scala based message class does not inherit from Protobuf’s Message class. To generate both Scala and Java classes add (to your build.sbt):

import com.trueaccord.scalapb.{ScalaPbPlugin => PB}
PB.javaConversions in PB.protobufConfig := true