Spark is a cluster computing framework that uses in-memory primitives to enable programs to run up to a hundred times faster than Hadoop MapReduce applications. Spark applications consist of a driver program that controls the execution of parallel operations across a cluster. The main programming abstraction provided by Spark is known as Resilient Distributed Datasets (RDDs). RDDs are collections of elements partitioned across the nodes of the cluster that can be operated on in parallel.
Spark was created to run on many platforms and be developed in many languages. Currently, Spark can run on Hadoop 1.0, Hadoop 2.0, Apache Mesos, or a standalone Spark cluster. Spark also natively supports Scala, Java, Python, and R. In addition to these features, Spark can be used interactively from a command-line shell.
This chapter begins with an example Spark script. PySpark is then introduced, and RDDs are described in detail with examples. The chapter concludes with example Spark programs written in Python.
WordCount in PySpark
The code in Example 4-1 implements the WordCount algorithm in PySpark. It assumes that a data file, input.txt, is loaded in HDFS under /user/hduser/input, and the output will be placed in HDFS under /user/hduser/output.
Example 4-1. python/Spark/word_count.py
from pyspark import SparkContext
sc = SparkContext(appName='SparkWordCount') input_file = sc.textFile('/user/hduser/input/input.txt') counts = input_file.flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile('/user/hduser/output') sc.stop() if __name__ == '__main__': main()
To execute the Spark application, pass the name of the file to the spark-submit script:
$ spark-submit --master local word_count.py
While the job is running, a lot of text will be printed to the console. The results of a word_count.py Spark script are displayed in Example 4-2 and can be found in HDFS under /user/hduser/output/part-00000.
Example 4-2. /user/hduser/output/part-00000
(u'be', 2) (u'jumped', 1) (u'over', 1) (u'candlestick', 1) (u'nimble', 1) (u'jack', 3) (u'quick', 1) (u'the', 1)
This section describes the transformations being applied in the word_count.py Spark script.
The first statement creates a SparkContext object. This object tells Spark how and where to access a cluster:
sc = SparkContext(appName='SparkWordCount')
The second statement uses the SparkContext to load a file from HDFS and store it in the variable input_file:
input_file = sc.textFile('/user/hduser/input/input.txt')
The third statement performs multiple transformations on the input data. Spark automatically parallelizes these transformations to run across multiple machines:
counts = input_file.flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)
The fourth statement stores the results to HDFS:
The fifth statement shuts down the SparkContext: