Working with Snakebite in Python

Snakebite is a Python package, created by Spotify, that provides a Python client library, allowing HDFS to be accessed programmatically from Python applications. The client library uses protobuf messages to communicate directly with the NameNode. The Snakebite package also includes a command-line interface for HDFS that is based on the client library.

This section describes how to install and configure the Snakebite package. Snakebite’s client library is explained in detail with multiple examples, and Snakebite’s built-in CLI is introduced as a Python alternative to the hdfs dfs command.

Installation

Snakebite requires Python 2 and python-protobuf 2.4.1 or higher. Python 3 is currently not supported. Snakebite is distributed through PyPI and can be installed using
pip:

$ pip install snakebite

Client Library

The client library is written in Python, uses protobuf messages, and implements the Hadoop RPC protocol for talking to the NameNode. This enables Python applications to communicate directly with HDFS and not have to make a system call to hdfs dfs.

List Directory Contents

Example 1-1 uses the Snakebite client library to list the contents of the root directory in HDFS.

from snakebite.client import Client

client = Client('localhost', 9000)
for x in client.ls(['/']):
print x

The most important line of this program, and every program that uses the client library, is the line that creates a client connection to the HDFS NameNode:

client = Client('localhost', 9000)

The Client() method accepts the following parameters:

host (string)
Hostname or IP address of the NameNode

port (int)
RPC port of the NameNode

hadoop_version (int)
The Hadoop protocol version to be used (default: 9)

use_trash (boolean)
Use trash when removing files

effective_use (string)
Effective user for the HDFS operations (default: None or current user)

The host and port parameters are required and their values are dependent upon the HDFS configuration. The values for these parameters can be found in the hadoop/conf/core-site.xml configuration file under the property fs.defaultFS:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

For the examples in this section, the values used for host and port are localhost and 9000, respectively. After the client connection is created, the HDFS filesystem can be accessed. The remainder of the previous application used the ls command to list the contents of the root directory in HDFS:

for x in client.ls(['/']):
print x

It is important to note that many of methods in Snakebite return generators. Therefore they must be consumed to execute. The ls method takes a list of paths and returns a list of maps that contain the file information.

Executing the list_directory.py application yields the following results:

$ python list_directory.py
{'group': u'supergroup', 'permission': 448, 'file_type': 'd',
'access_time': 0L, 'block_replication': 0, 'modification_
time': 1442752574936L, 'length': 0L, 'blocksize': 0L,
'owner': u'hduser', 'path': '/tmp'}
{'group': u'supergroup', 'permission': 493, 'file_type': 'd',
'access_time': 0L, 'block_replication': 0, 'modification_
time': 1442742056276L, 'length': 0L, 'blocksize': 0L,
'owner': u'hduser', 'path': '/user'}

Reproduced from Hadoop with Python free ebook

Leave a Comment