Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

You are using the wrong tools for the job.

Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.

Get the dataset with Sqoop and then process it with Spark.

you can try the following:-

Read data from netezza without any partitions and with increased fetch_size to a million.

sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")

repartition the data before writing it to final file.

val df3 = df2.repartition(10) //to reduce the shuffle

ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
```
df3.write.format("ORC").save("hdfs://Hostname/test")
```

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

Tags:

Hadoop

Bigdata

Sqoop

Apache Spark Sql

Related

Recent Posts