ReadFiles
Scio supports reading file paths/patterns from an SCollection[String] into various formats.
Read as text lines
Reading to String text lines via readTextFiles:
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val lines: SCollection[String] = paths.readTextFiles
Read entire file as String
Reading entire files to String via readFilesAsString:
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val files: SCollection[String] = paths.readFilesAsString
Read entire file as binary
Reading entire files to binary Array[Byte] via readFilesAsBytes:
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val files: SCollection[Array[Byte]] = paths.readFilesAsBytes
Read entire file as a custom type
Reading entire files to a custom type with a user-defined function from FileIO.ReadableFile to the output type via readFiles:
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
import org.apache.beam.sdk.{io => beam}
case class A(i: Int, s: String)
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val userFn: beam.FileIO.ReadableFile => A = ???
val fileBytes: SCollection[A] = paths.readFiles(userFn)
Read with a Beam transform
Reading a file can be done with a beam PTransform from a PCollection[FileIO.ReadableFile] to PCollection[T] (as an example, beam’s TextIO.readFiles()), via another variant of readFiles
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
import org.apache.beam.sdk.{io => beam}
import org.apache.beam.sdk.transforms.PTransform
import org.apache.beam.sdk.values.PCollection
case class Record(i: Int, s: String)
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val userTransform: PTransform[PCollection[beam.FileIO.ReadableFile], PCollection[Record]] = ???
val records: SCollection[Record] = paths.readFiles(userTransform)
Read with a Beam source
Reading a file can be done with a beam FileBasedSource[T] (as example, beam’s TextSource) via another variant of readFiles.
When using readFilesWithPath, the origin file path will be passed along with all elements emitted by the source.
The source will be created with the given file paths, and then split in sub-ranges depending on the desired bundle size.
import com.spotify.scio.ScioContext
import com.spotify.scio.values.SCollection
import org.apache.beam.sdk.{io => beam}
case class Record(i: Int, s: String)
val sc: ScioContext = ???
val paths: SCollection[String] = ???
val desiredBundleSizeBytes: Long = ???
val directoryTreatment: beam.FileIO.ReadMatches.DirectoryTreatment = ???
val compression: beam.Compression = ???
val createSource: String => beam.FileBasedSource[Record] = ???
val records: SCollection[Record] = paths.readFiles(
desiredBundleSizeBytes,
directoryTreatment,
compression
) { file => createSource(file) }
val recordsWithPath: SCollection[(String, Record)] = paths.readFilesWithPath(
desiredBundleSizeBytes,
directoryTreatment,
compression
) { file => createSource(file) }