Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?
The issue is the following section in build.sbt
:
// META-INF discarding
assemblyMergeStrategy in assembly := {
{
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
It says that all META-INF
entires should be discarded, including the "code" that makes data source aliases (e.g. kafka
) work.
But the META-INF
files are very important for kafka
(and other aliases of streaming data sources) to work.
For kafka
alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry:
org.apache.spark.sql.kafka010.KafkaSourceProvider
KafkaSourceProvider
is responsible to register kafka
alias with the proper streaming data source, i.e. KafkaSource.
Just to check that the real code is indeed available, but the "code" that makes the alias registered is not, you could use the kafka
data source by the fully-qualified name (not the alias) as follows:
spark.readStream.
format("org.apache.spark.sql.kafka010.KafkaSourceProvider").
load
You will see other problems due to missing options like kafka.bootstrap.servers
, but...we're digressing.
A solution is to MergeStrategy.concat
all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
(that would create an uber-jar with all data sources, incl. the kafka
data source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
I tried like this it's working for me. Submit like this and let me know once you have any issues
./spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 --class com.inndata.StructuredStreaming.Kafka --master local[*] /Users/apple/.m2/repository/com/inndata/StructuredStreaming/0.0.1SNAPSHOT/StructuredStreaming-0.0.1-SNAPSHOT.jar
In my case I also got this error while compiling with sbt, and the cause was that sbt assembly
was not including the spark-sql-kafka-0-10_2.11
artifact as part of the fat jar.
(I would be very welcome to comments here. The dependency was not specified a scope, so it should not be assumed to be "provided").
So I changed to deploying a normal (slim) jar and including the dependencies with the --jars
parameters to spark-submit.
In order to gather all dependencies in one place, you can add retrieveManaged := true
to your sbt project settings, or you can, in the sbt console, issue:
> set retrieveManaged := true
> package
That should bring all dependencies to the lib_managed
folder.
Then you can copy all those files (with a bash command you can for example use something like this
cd /path/to/your/project
JARLIST=$(find lib_managed -name '*.jar'| paste -sd , -)
spark-submit [other-args] target/your-app-1.0-SNAPSHOT.jar --jars "$JARLIST"