Splitting strings in Apache Spark using Scala

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

Click to copy

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title

This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:

Click to copy

val df = spark.read
  .option("delimiter", "\t")
  .option("header", false)
  .csv("s3n://file.txt")
  .toDF("title", "text")

Then, split the text column on space and explode to get one word per row.

Click to copy

val df2 = df.select($"title", explode(split($"text", " ")).as("words"))

Finally, group on the title column and count the number of words for each.

Click to copy

val countDf = df2.groupBy($"title").agg(count($"words"))

Splitting strings in Apache Spark using Scala

Tags:

String

Scala

Apache Spark

Related

Recent Posts