Snakebite is a python package that provides:
- A pure python HDFS client library that uses protobuf messages over Hadoop RPC to communicate with HDFS.
- A command line interface (CLI) for HDFS that uses the pure python client library.
- A hadoop minicluster wrapper.
- Hadoop RPC specification.
Since the ‘normal’ Hadoop HDFS client (hadoop fs) is written in Java and has a lot of dependencies on Hadoop jars, startup times are quite high (> 3 secs). This isn’t ideal for integrating Hadoop commands in python projects.
At Spotify we use the luigi job scheduler that relies on doing a lot of existence checks and moving data around in HDFS. And since calling hadoop from python is expensive, we decided to write a pure python HDFS client that only relies on protobuf. The current snakebite.client library uses protobuf messages and implements the Hadoop RPC protocol for talking to the NameNode.
During development, we needed to verify snakebite.client behavior against the real client and for that we implemented a minicluster that wraps a Hadoop Java mini cluster. Obviously this minicluster can be used in different projects, so we made it a part of snakebite.
all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.
Copyright (c) 2013 - 2014 Spotify AB
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Code in channel, logger and service was borrowed from https://code.google.com/p/protobuf-socket-rpc/ and carries it’s respective license.