Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

You are using the wrong tools for the job.

Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.

Get the dataset with Sqoop and then process it with Spark.


you can try the following:-

  1. Read data from netezza without any partitions and with increased fetch_size to a million.

    sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
    
  2. repartition the data before writing it to final file.

    val df3 = df2.repartition(10) //to reduce the shuffle 
    
  3. ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.

    df3.write.format("ORC").save("hdfs://Hostname/test")