Pig contains multiple modes that can be specified to configure how Pig scripts and Pig statements will be executed.
Pig has two execution modes: local and MapReduce. Running Pig in local mode only requires a single machine. Pig will run on the local host and access the local filesystem. To run Pig in local mode, use the -x local flag:
$ pig -x local ...
Running Pig in MapReduce mode requires access to a Hadoop cluster. MapReduce mode executes Pig statements and jobs on the cluster and accesses HDFS. To run Pig in MapReduce mode, simply call Pig from the command line or use the -x mapreduce flag:
$ pig ... or $ pig -x mapreduce ...
Pig can be run interactively in the Grunt shell. To invoke the Grunt shell, simply call Pig from the command line and specify the desired execution mode. The following example starts the Grunt shell in local mode:
pig -x local ... grunt>
Once the Grunt shell is initialized, Pig Latin statements can be entered and executed in an interactive manner. Running Pig interactively is a great way to learn Pig.
The following example reads /etc/passwd and displays the usernames from within the Grunt shell:
grunt> A = LOAD '/etc/passwd' using PigStorage(':'); grunt> B = FOREACH A GENERATE $0 as username; grunt> DUMP B;
Batch mode allows Pig to execute Pig scripts in local or MapReduce mode.
The Pig Latin statements in Example 3-3 read a file named passwd and use the STORE operator to store the results in a directory called user_id.out. Before executing this script, ensure that /etc/passwd is copied to the current working directory if Pig will be run in local mode, or to HDFS if Pig will be executed in MapReduce mode.
A = LOAD 'passwd' using PigStorage(':'); B = FOREACH A GENERATE $0 as username; STORE B INTO 'user_id.out';
Use the following command to execute the user_id.pig script on the local machine:
$ pig -x local user_id.pig
This section describes the basic concepts of the Pig Latin language, allowing those new to the language to understand and write basic Pig scripts. For a more comprehensive overview of the language, visit the Pig online documentation.
All of the examples in this section load and process data from the tab-delimited file, resources/students (Example 3-4).
Example 3-4. resources/students
john 21 3.89 sally 19 2.56 alice 22 3.76 doug 19 1.98 susan 26 3.25
Statements are the basic constructs used to process data in Pig. Each statement is an operator that takes a relation as an input, performs a transformation on that relation, and produces a relation as an output. Statements can span multiple lines, but all statements must end with a semicolon (;).
The general form of each Pig script is as follows:
- A LOAD statement that reads the data from the filesystem
- One or more statements to transform the data
- A DUMP or STORE statement to view or store the results, respectively