reading data from URL using spark databricks platform

Above answer works but might be error prone some times SparkFiles.get will return null

#1 is more prominent way of getting a file from any url or public s3 location

Option 1 :

IOUtils.toString will do the trick see the docs of apache commons io jar will be already present in any spark cluster whether its databricks or any other spark installation.

Below is the scala way of doing this... I have taken a raw git hub csv file for this example ... can change based on the requirements.

Click to copy

import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
import java.net.URL 

val urlfile=new URL("https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv")
  val testcsvgit = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
  val testcsv = spark
                .read.option("header", true)
                .option("inferSchema", true)
                .csv(testcsvgit)
  testcsv.show

Result :

Click to copy

+-----------+------+----+----+---+-----+
|Experiment |Virus |Cell| MOI|hpi|Titer|
+-----------+------+----+----+---+-----+
|      EXP I| C159S|OFTu| 0.1|  0| 4.75|
|      EXP I| C159S|OFTu| 0.1|  6| 2.75|
|      EXP I| C159S|OFTu| 0.1| 12| 2.75|
|      EXP I| C159S|OFTu| 0.1| 24|  5.0|
|      EXP I| C159S|OFTu| 0.1| 48|  5.5|
|      EXP I| C159S|OFTu| 0.1| 72|  7.0|
|      EXP I| C159S| STU| 0.1|  0| 4.75|
|      EXP I| C159S| STU| 0.1|  6| 3.75|
|      EXP I| C159S| STU| 0.1| 12|  4.0|
|      EXP I| C159S| STU| 0.1| 24| 3.75|
|      EXP I| C159S| STU| 0.1| 48| 3.25|
|      EXP I| C159S| STU| 0.1| 72| 3.25|
|      EXP I| C159S|OFTu|10.0|  0|  6.5|
|      EXP I| C159S|OFTu|10.0|  6| 4.75|
|      EXP I| C159S|OFTu|10.0| 12| 4.75|
|      EXP I| C159S|OFTu|10.0| 24| 6.25|
|      EXP I| C159S|OFTu|10.0| 48|  6.5|
|      EXP I| C159S|OFTu|10.0| 72|  7.0|
|      EXP I| C159S| STU|10.0|  0|  7.0|
|      EXP I| C159S| STU|10.0|  6| 4.75|
+-----------+------+----+----+---+-----+
only showing top 20 rows

Option 2 : in Scala

Click to copy

import java.net.URL
import org.apache.spark.SparkFiles
val urlfile="https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv"
spark.sparkContext.addFile(urlfile)

val df = spark.read
.option("inferSchema", true)
.option("header", true)
.csv("file://"+SparkFiles.get("c159s.csv"))
df.show

Result : Will be same as Option #1 like below

Click to copy

+-----------+------+----+----+---+-----+
|Experiment |Virus |Cell| MOI|hpi|Titer|
+-----------+------+----+----+---+-----+
|      EXP I| C159S|OFTu| 0.1|  0| 4.75|
|      EXP I| C159S|OFTu| 0.1|  6| 2.75|
|      EXP I| C159S|OFTu| 0.1| 12| 2.75|
|      EXP I| C159S|OFTu| 0.1| 24|  5.0|
|      EXP I| C159S|OFTu| 0.1| 48|  5.5|
|      EXP I| C159S|OFTu| 0.1| 72|  7.0|
|      EXP I| C159S| STU| 0.1|  0| 4.75|
|      EXP I| C159S| STU| 0.1|  6| 3.75|
|      EXP I| C159S| STU| 0.1| 12|  4.0|
|      EXP I| C159S| STU| 0.1| 24| 3.75|
|      EXP I| C159S| STU| 0.1| 48| 3.25|
|      EXP I| C159S| STU| 0.1| 72| 3.25|
|      EXP I| C159S|OFTu|10.0|  0|  6.5|
|      EXP I| C159S|OFTu|10.0|  6| 4.75|
|      EXP I| C159S|OFTu|10.0| 12| 4.75|
|      EXP I| C159S|OFTu|10.0| 24| 6.25|
|      EXP I| C159S|OFTu|10.0| 48|  6.5|
|      EXP I| C159S|OFTu|10.0| 72|  7.0|
|      EXP I| C159S| STU|10.0|  0|  7.0|
|      EXP I| C159S| STU|10.0|  6| 4.75|
+-----------+------+----+----+---+-----+
only showing top 20 rows

Try this.

Click to copy

url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

**df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**

Just fetching few columns of your csv url.

Click to copy

df.select("age","workclass","fnlwgt","education").show(10);
>>> df.select("age","workclass","fnlwgt","education").show(10);
+---+----------------+------+---------+
|age|       workclass|fnlwgt|education|
+---+----------------+------+---------+
| 39|       State-gov| 77516|Bachelors|
| 50|Self-emp-not-inc| 83311|Bachelors|
| 38|         Private|215646|  HS-grad|
| 53|         Private|234721|     11th|
| 28|         Private|338409|Bachelors|
| 37|         Private|284582|  Masters|
| 49|         Private|160187|      9th|
| 52|Self-emp-not-inc|209642|  HS-grad|
| 31|         Private| 45781|  Masters|
| 42|         Private|159449|Bachelors|
+---+----------------+------+---------+

SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

reading data from URL using spark databricks platform

Option 1 :

Option 2 : in Scala

Tags:

Scala

Apache Spark

Pyspark

Apache Spark Sql

Databricks

Related

Recent Posts