How to get rid of derby.log, metastore_db from Spark Shell
The use of the hive.metastore.warehouse.dir
is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db
directory and the derby.log
file being created in every working subdirectory is the derby.system.home
property defaulting to .
.
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf
:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby
can be replaced by the directory of your choice.
For spark-shell, to avoid having the metastore_db
directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml
file and copy this file into spark conf directory.
A sample hive-site.xml
file to make the location of metastore_db
in /tmp
(refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell
as the following to get rid of derby.log
as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
Try setting derby.system.home
to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html