Parquet
ParquetType[T]
provides read and write support between Scala type T
and the Parquet columnar storage format. Custom support for type T
can be added with an implicit instance of ParquetField[T]
.
import java.net.URI
case class Inner(long: Long, str: String, uri: URI)
case class Outer(inner: Inner)
val record = Outer(Inner(1L, "hello", URI.create("https://www.spotify.com")))
import magnolify.parquet._
// Encode custom type URI as String
implicit val uriField: ParquetField[URI] = ParquetField.from[String](b => URI.create(b))(_.toString)
val parquetType = ParquetType[Outer]
// Parquet schema
val schema = parquetType.schema
Use ParquetType#readBuilder
and ParquetType#writeBuilder
to create new file reader and writer instances. See HadoopSuite.scala for examples with Hadoop IO.
Case Mapping
To use a different field case format in target records, add an optional CaseMapper
argument to ParquetType
. The following example maps firstName
& lastName
to first_name
& last_name
.
import magnolify.shared.CaseMapper
import com.google.common.base.CaseFormat
import magnolify.parquet._
case class LowerCamel(firstName: String, lastName: String)
val toSnakeCase = CaseFormat.LOWER_CAMEL.converterTo(CaseFormat.LOWER_UNDERSCORE).convert _
val parquetType = ParquetType[LowerCamel](CaseMapper(toSnakeCase))
Enums
Enum-like types map to strings. See EnumType for more details. Additional ParquetField[T]
instances for Char
and UnsafeEnum[T]
are available from import magnolify.parquet.unsafe._
. This conversions is unsafe due to potential overflow.
Logical Types
Parquet decimal
logical type maps to BigDecimal
and supports the following encodings:
import magnolify.parquet._
val pfDecimal32 = ParquetField.decimal32(9, 0)
val pfDecimal64 = ParquetField.decimal64(18, 0)
val pfDecimalFixed = ParquetField.decimalFixed(8, 18, 0)
val pfDecimalBinary = ParquetField.decimalBinary(20, 0)
For a full specification of Date/Time mappings in Parquet, see Type Mappings.
Avro Compatibility
The official Parquet format specification supports the REPEATED
modifier to denote array types. By default, magnolify-parquet conforms to this specification:
import magnolify.parquet._
case class MyRecord(listField: List[Int])
ParquetType[MyRecord].schema
// res3: org.apache.parquet.schema.MessageType = message repl.MdocSession.MdocApp.MyRecord {
// repeated int32 listField (INTEGER(32,true));
// }
//
However, the parquet-avro API encodes array types differently: as a nested array inside a required group.
import org.apache.avro.Schema
val avroSchema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"MyRecord\",\"fields\":[{\"name\": \"listField\", \"type\": {\"type\": \"array\", \"items\": \"string\"}}]}")
// avroSchema: Schema = {"type":"record","name":"MyRecord","fields":[{"name":"listField","type":{"type":"array","items":"string"}}]}
import org.apache.parquet.avro.AvroSchemaConverter
new AvroSchemaConverter().convert(avroSchema)
// res4: org.apache.parquet.schema.MessageType = message MyRecord {
// required group listField (LIST) {
// repeated binary array (STRING);
// }
// }
//
Due to this discrepancy, by default, a Repeated type (i.e. a List
or Seq
) written by parquet-avro isn’t readable by magnolify-parquet, and vice versa.
To address this, magnolify-parquet supports an “Avro compatibility mode” that, when enabled, will:
- Use the same Repeated schema format as parquet-avro
- Write an additional metadata key,
parquet.avro.schema
, to the Parquet file footer, containing the equivalent Avro schema.
Enabling Avro Compatibility Mode
You can enable this mode by importing magnolify.parquet.ParquetArray.AvroCompat._
:
import magnolify.parquet._
import magnolify.parquet.ParquetArray.AvroCompat._
case class MyRecord(listField: List[Int])
// List schema matches parquet-avro spec
ParquetType[MyRecord].schema
// res6: org.apache.parquet.schema.MessageType = message repl.MdocSession.MdocApp5.MyRecord {
// required group listField (LIST) {
// repeated int32 array (INTEGER(32,true));
// }
// }
//
// This String value of this schema will be written to the Parquet metadata key `parquet.avro.schema`
ParquetType[MyRecord].avroSchema
// res7: org.apache.avro.Schema = {"type":"record","name":"MyRecord","namespace":"repl.MdocSession.MdocApp5","fields":[{"name":"listField","type":{"type":"array","items":"int"}}]}
Field Descriptions
The top level class and all fields (including nested class fields) can be annotated with @doc
annotation. Note that nested classes annotations are ignored.
import magnolify.shared._
@doc("This is ignored")
case class NestedClass(@doc("nested field annotation") i: Int)
@doc("Top level annotation")
case class TopLevelType(@doc("field annotation") pd: NestedClass)
Note that field descriptions are not natively supported by the Parquet format. Instead, the @doc
annotation ensures that in Avro compat mode, the generated Avro schema written to the metadata key parquet.avro.schema
will contain the specified field description:
import magnolify.parquet._
// AvroCompat is required to write `parquet.avro.schema` key to file metadata
import magnolify.parquet.ParquetArray.AvroCompat._
import magnolify.shared._
@doc("Top level annotation")
case class MyRecord(@doc("field annotation") listField: List[Int])
val writer = ParquetType[MyRecord]
.writeBuilder(HadoopOutputFile.fromPath(path, new Configuration()))
.build()
// writer: org.apache.parquet.hadoop.ParquetWriter[MyRecord] = org.apache.parquet.hadoop.ParquetWriter@3271c43a
writer.write(MyRecord(List(1,2,3)))
writer.close()
// Note that Parquet MessageType schema doesn't contain descriptor, but serialized Avro schema does
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration())).getFileMetaData
// res12: org.apache.parquet.hadoop.metadata.FileMetaData = FileMetaData{schema: message repl.MdocSession.MdocApp9.MyRecord {
// required group listField (LIST) {
// repeated int32 array (INTEGER(32,true));
// }
// }
// , metadata: {writer.model.name=magnolify, parquet.avro.schema={"type":"record","name":"MyRecord","namespace":"repl.MdocSession.MdocApp9","doc":"Top level annotation","fields":[{"name":"listField","type":{"type":"array","items":"int"},"doc":"field annotation"}]}}}
Therefore, enabling Avro compatibility mode via the AvroCompat
import is required to use the @doc
annotation with ParquetType.