Although it is recommended to learn and use High Level API(Dataframe-Sql-Dataset) for beginners, Low Level API -resilient distributed dataset (RDD) is the basics of Spark programming. Mainly, RDD is a collection of elements partitioned between the nodes (workers) of a cluster which easily provides parallel operation in the nodes.
RDDs can be created only in two ways: either parallelizing an already existing dataset, collection in your drivers and external storages which provides data sources like Hadoop InputFormats (HDFS,HBase,Cassandra..) or by tranforming from already created RDDs.
Creating RDD And SparkContext
Spark RDDs can be created by two ways;
First way is to use SparkContext’s textFilemethod which create RDDs by taking an URI of the file and reads file as a collection of lines:
Dataset = sc.textFile("mydata.txt")
And the second way is SparkContext’s parallelize() method which use your existing iterable or collection in your program:
data = [1, 2, 3, 4, 5, 6]
myData = sc.parallelize(data)
There are only two types of operation supported by Spark RDDs: transformations, which create a new RDD by transforming from an existing RDD, and actions which compute and write a value to the driver program.
Spark transformations are all lazily evaluated which also called “Lazy Transformation”. As the name implies , it means that they do not execute every transformation but instead waits for an action to compute and return a result to the driver program, which enables Spark to run much more efficiently and faster.
Now it is time to get our hands dirty:) starting with creating our first RDD and show some transformations by using it:
mydata=[1, 2, 2, 3, 4, 5, 5, 5, 6]
rdd = sc.parallelize(mydata)
Now let’s start to play with our first RDD with some important and basic transformations highly used in Spark:
The map() transformation apply lambda functions to all elements of the RDD and returns new RDD:
Note: As we mentioned above, results of transformations are not returned to driver program. So that we need to use an action method to write the results. Like take(), which fetches the stated n elements of an RDD.
new_RDD = rdd.map(lambda x: x*2)
[2, 4, 4, 6, 8, 10, 10, 10, 12]
The filter() transformation apply lambda functions to all elements of the RDD and returns a new RDD, by using elements which ensure the function returning true:
new_RDD = rdd.filter(lambda x: x >= 4)
[4, 5, 5, 5, 6]
The distinct() transformation are applied to all elements of the RDD and returns a new RDD which contains unique elements:
new_RDD = rdd.distict()
[1, 2 ,3, 4, 5 ,6]
The union() transformation returns a new RDD which contains all elements from two RDDs:
rdd2 = sc.parallelize([2, 2, 3, 5, 6, 6, 7, 8, 9])
new_RDD = rdd.union(rdd2)
[1, 2, 2, 3, 4, 5, 5, 5, 6, 2, 2, 3, 5, 6, 6, 7, 8, 9]
The intersection() transformation returns a new RDD which contains an intersection of the elements in both RDDs:
new_RDD = rdd.intersection(rdd2)
[2, 3, 5, 6]
The subtract() transformation returns a new RDD which has elements present in the first RDD but not in second RDD:
new_RDD = rdd.subtract(rdd2)
The sample() transformation returns a new RDD containing n-ratio sampled elements subset of existing RDD:
new_RDD = rdd.sample(False,0.5)
[2, 3, 5, 5]
PAIR RDD Transformations
Pair RDD tranformations are just another way of referring to an RDD which containins key/value pairs like tuples of data. Pair RDD transformations are applied on each key/element in parallel, where normal transfomations on RDD (like map()) are applied to the all elements of the collection. Because being like dictionaries with key-value pairs, Pair RDDs are widely used.
Now let’s look some important and commonly used Pair RDD transformations in Spark:
The groupByKey() transformation converts key-value pair into a key- ResultIterable pair in Pyspark grouping by keys:
Note: As we mentioned before, results of transformations are not return to driver program. So that we need to use an action method to write the results. Like collect(), which fetches all the elements of an RDD.
rdd=sc.parallelize([(1, 2),(1, 5),(3, 4),(3, 6)])
[(1, <pyspark.resultiterable.ResultIterable at 0x2218cd4b430>),
(3, <pyspark.resultiterable.ResultIterable at 0x2218ccb9c70>)]
As you can see above from the example we get a key- ResultIterable pair in Pyspark. To see what ReslutIterable objects look like, we need to convert them into a list with the help of map() transformation:
[(1, [2, 5]), (3, [4, 6])]
The reduceByKey() tranformation aplly lambda functions to a new RDD and produces a pair RDD that contains the total sum for each value:
rdd.reduceByKey(lambda x,y: x+y).collect()
[(1, 7), (3, 10)]
The sortByKey() tranformation can sort a pair RDD in ascending or descending order according to value. By using ascending=False we get an ascending order:
[(3, 4), (3, 6), (1, 2), (1, 5)]
The subtractByKey() tranformation return each key-value pair in self which has no pair with matching key in other:
[(3, 4), (3, 6)]
The countByKey() tranformation count the number of elements for each key of an RDD, and return the result as a dictionary with using items() function:
dict_items([(1, 2), (3, 2)])
The join() tranformation return an RDD which contains all pairs of elements with matching keys in self and other RDD like inner-join function in SQL:
[(3, (4, 9)), (3, (6, 9))]
Note: There are also rightOuter () and leftOuter () transformations that we can use in a same logic like SQL right, left join functions.
As we mentioned above in RDD Operations title, because of being lazy, RDDs do not return the final result to the driver program until we use actions. Briefly, actions are RDD operations returning non-RDD values. Let’s look at some of common actions highly used in Spark core:
The collect() action returns all elements in an RDD as a list:
rdd = sc.parallelize([2, 2, 3, 5, 6, 6, 7, 8, 9, 0])
[2, 2, 3, 5, 6, 6, 7, 8, 9, 0]
The take() action returns the first n elements of an RDD:
[2, 2, 3, 5]
The count() action counts the number of elements in an RDD:
The top() action returns the top n elements of an RDD:
[9, 8, 7]
The reduce() action returns one element from two elements from RDD by applying lambda function:
rdd.reduce(lambda x, y: x + y)
The first() action returns the first element of an RDD:
The sum() action returns the sum of the elements of an RDD:
The aggregate() action takes two functions to return final result:
new_rdd=sc.parallelize([1, 2, 3, 4])
rdd.aggregate((0,0),(lambda x,y:(x+y,x+1)),(lambda x,y:(x+y,x+y)))
With the help of the schema we can summarize the operation step by step :
aggregate(zeroValue, seqOp, combOp)
In this article i have tried to brief Apache Spark’s RDD, which is unstructured part of Spark core, based on highly used Spark RDD transformations and actions examples in Pyspark. You can always improve your knowledge by searching Spark RDD Programming Guide and Python Api docs for pyspark in Apache Spark documentations.
In the next article, I will talk about High Level Api (Spark-Sql and Dataframes) which is structured part of Spark core.
I will be happy to hear any comments or questions from you. May the data be with you!