what is exact difference between Spark Transform in DStream and map.?
DStream has Several RDD's, since every batch interval is a different RDD.
So by using transform(), you get the chance to to apply an RDD operation on the
entire DStream.
Example from Spark Docs: http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
map
is an elementary transformation and transform
is an RDD transformation
map
map(func) : Return a new DStream by passing each element of the source DStream through a function func.
Here is an example which demonstrates both map operation and transform operation on a DStream
val conf = new SparkConf().setMaster("local[*]").setAppName("StreamingTransformExample")
val ssc = new StreamingContext(conf, Seconds(5))
val rdd1 = ssc.sparkContext.parallelize(Array(1,2,3))
val rdd2 = ssc.sparkContext.parallelize(Array(4,5,6))
val rddQueue = new Queue[RDD[Int]]
rddQueue.enqueue(rdd1)
rddQueue.enqueue(rdd2)
val numsDStream = ssc.queueStream(rddQueue, true)
val plusOneDStream = numsDStream.map(x => x+1)
plusOneDStream.print()
the map
operation adds 1 to each element in all the RDDs within DStream, gives an output as shown below
-------------------------------------------
Time: 1501135220000 ms
-------------------------------------------
2
3
4
-------------------------------------------
Time: 1501135225000 ms
-------------------------------------------
5
6
7
-------------------------------------------
transform
transform(func) : Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
val commonRdd = ssc.sparkContext.parallelize(Array(0))
val combinedDStream = numsDStream.transform(rdd=>(rdd.union(commonRdd)))
combinedDStream.print()
transform allows to perform RDD operation such as join, union etc upon the RDDs within DStream, the example code given here will produce an output as below
-------------------------------------------
Time: 1501135490000 ms
-------------------------------------------
1
2
3
0
-------------------------------------------
Time: 1501135495000 ms
-------------------------------------------
4
5
6
0
-------------------------------------------
Time: 1501135500000 ms
-------------------------------------------
0
-------------------------------------------
Time: 1501135505000 ms
-------------------------------------------
0
-------------------------------------------
here the commonRdd
which contains the element 0
is performed a union operation with all the underlying RDDs within the DStream.
The transform function in Spark Streaming allows you to perform any transformation on underlying RDD's in Stream. For example you can join two RDD's in streaming using Transform wherein one RDD would be some RDD made from textfile or parallelized collection and other RDD is coming from Stream of textfile/socket etc.
Map works on each element of RDD in a particular batch and results in the RDD after applying the function passed to Map.
The transform
function in Spark streaming allows one to use any of Apache Spark's transformations on the underlying RDDs
for the stream. map
is used for an element to element transform, and could be implemented using transform
. Essentially, map
works on the elements of the DStream
and transform
allows you to work with the RDDs
of the DStream. You may find http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams to be useful.