Spark with Python

Spark is a cluster computing framework that uses in-memory primitives to enable programs to run up to a hundred times faster than Hadoop MapReduce applications. Spark applications consist of a driver program that controls the execution of parallel operations across a cluster. The main programming abstraction provided by Spark is known as Resilient Distributed Datasets (RDDs). RDDs are collections of elements partitioned across the nodes of the cluster that can be operated on in parallel.

Spark was created to run on many platforms and be developed in many languages. Currently, Spark can run on Hadoop 1.0, Hadoop 2.0, Apache Mesos, or a standalone Spark cluster. Spark also natively supports Scala, Java, Python, and R. In addition to these features, Spark can be used interactively from a command-line shell.

This chapter begins with an example Spark script. PySpark is then introduced, and RDDs are described in detail with examples. The chapter concludes with example Spark programs written in Python.

WordCount in PySpark

The code in Example 4-1 implements the WordCount algorithm in PySpark. It assumes that a data file, input.txt, is loaded in HDFS under /user/hduser/input, and the output will be placed in HDFS under /user/hduser/output.

Example 4-1. python/Spark/word_count.py

from pyspark import SparkContext

def main():

sc = SparkContext(appName='SparkWordCount')
input_file = sc.textFile('/user/hduser/input/input.txt')
counts = input_file.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile('/user/hduser/output')
sc.stop()
if __name__ == '__main__':
main()

To execute the Spark application, pass the name of the file to the spark-submit script:

$ spark-submit --master local word_count.py

While the job is running, a lot of text will be printed to the console. The results of a word_count.py Spark script are displayed in Example 4-2 and can be found in HDFS under /user/hduser/output/part-00000.

Example 4-2. /user/hduser/output/part-00000

(u'be', 2)
(u'jumped', 1)
(u'over', 1)
(u'candlestick', 1)
(u'nimble', 1)
(u'jack', 3)
(u'quick', 1)
(u'the', 1)

WordCount Described

This section describes the transformations being applied in the word_count.py Spark script.

The first statement creates a SparkContext object. This object tells Spark how and where to access a cluster:

sc = SparkContext(appName='SparkWordCount')

The second statement uses the SparkContext to load a file from HDFS and store it in the variable input_file:

input_file = sc.textFile('/user/hduser/input/input.txt')

The third statement performs multiple transformations on the input data. Spark automatically parallelizes these transformations to run across multiple machines:

counts = input_file.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)

The fourth statement stores the results to HDFS:

counts.saveAsTextFile('/user/hduser/output')

The fifth statement shuts down the SparkContext:

sc.stop()

What's Your Reaction?

hate

confused

fail

fun

geeky

love

lol

omg

win

Spark with Python

WordCount in PySpark

WordCount Described

What's Your Reaction?

Posted by Anand Narayanaswamy

0 Comments

Cancel reply

Book Review: Cybersecurity Simplified

Union Budget 2025: Nirmala Sitharaman Unveils Bharatiya Bhasha Pustak Scheme and AI Centre for Education

Union Budget 2025: Quotes from Tech Leaders

The Impact of Politics on College Campuses: Navigating the Complex Terrain

Why I Should Buy A Computer?

Nevron .NET Vision 2010 Released

Why I Should Buy A Computer?

WordCount in PySpark

WordCount Described

Like it? Share with your friends!

What's Your Reaction?

0 Comments