Interacting with HDFS

Interacting with Hadoop Distributed File System (HDFS) is primarily performed from the command line using the script named hdfs. The hdfs script has the following usage:

$ hdfs COMMAND [-option <arg>]

The COMMAND argument instructs which functionality of HDFS will be used. The -option argument is the name of a specific option for the specified command, and <arg> is one or more arguments that that are specified for this option.

Common File Operations

To perform basic file manipulation operations on HDFS, use the dfs command with the hdfs script. The dfs command supports many of the same file operations found in the Linux shell.

It is important to note that the hdfs command runs with the permissions of the system user running the command. The following examples are run from a user named “hduser.”

List Directory Contents

To list the contents of a directory in HDFS, use the -ls command:

$ hdfs dfs -ls

Running the -ls command on a new cluster will not return any results. This is because the -ls command, without any arguments, will attempt to display the contents of the user’s home directory on HDFS. This is not the same home directory on the host machine (e.g., /home/$USER), but is a directory within HDFS.

Providing -ls with the forward slash (/) as an argument displays the contents of the root of HDFS:

$ hdfs dfs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2015-09-20 14:36 /hadoop
drwx------ - hadoop supergroup 0 2015-09-20 14:36 /tmp

The output provided by the hdfs dfs command is similar to the output on a Unix filesystem. By default, -ls displays the file and folder permissions, owners, and groups. The two folders displayed in this example are automatically created when HDFS is formatted.

The hadoop user is the name of the user under which the Hadoop daemons were started (e.g., NameNode and DataNode), and the supergroup is the name of the group of superusers in HDFS (e.g.,hadoop).

Reproduced from Hadoop with Python

Leave a Comment