Apache Drill vs Spark

Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/

Drill is fundamentally different in both the user's experience and the architecture. For example:

  • Drill is a schema-free query engine. For instance, you can point it at a directory of JSON or Parquet log files (on your local box, an NFS share, S3, HDFS, MapR-FS, etc.) and run a query. You don't have to load data, create and manage schemas or pre-process the data.
  • Drill uses a JSON document model internally which allows it to query data of any structure. A lot of modern data is complex, meaning a record can contain nested structures and arrays, and field names may actually encode values such timestamps or web page URLs. Drill allows normal BI tools to operate seamlessly on such data without requiring the data to be flattened in advance.
  • Drill works with a variety of non-relational datastores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage. Additional datastores will be added.

Drill 1.0 was just released (May 19, 2015). You can easily download it onto your laptop and play with it without any infrastructure (Hadoop, NoSQL, etc.).


Drill provides the ability for you to query different kinds of datasets with ANSI SQL. This makes it great for adhoc data exploration, and connecting BI tools to datasets via ODBC. You can even use Drill to SQL JOIN different kinds of datasets. For example, you could join records in a MySQL table with rows in a JSON file, or a CSV file, or OpenTSDB, or MapR-DB... the list goes on. Drill can connect to lots of different types of data.

When I think to use Spark, I'm typically wanting to use it for RDDs (resilient distributed dataset). RDDs make it easy to process a lot of data, quickly. Spark also has a bunch of libraries for ML and streaming. Drill doesn't process data at all. It just gets you access to said data. You could use Drill to pull data into Spark, or Tensorflow, or PySpark, or Tableau, etc.


Apache Spark-SQL:

  • You need to write code (Scala, Java or Python) to access the data and process it.
  • SQL queries can be executed against Dataframes.
  • Execution can be done in a distributed fashion (cluster).
  • Almost every data storage has a Spark driver or connector.
  • Used for massive parallel computing/ data analytics.
  • Support stream processing.
  • Has a bigger support community.

Apache Drill:

  • No need to write code, Drill will explore the data source and create its own data catalog.
  • Easier to use, just SQL.
  • Execution can be done in a distributed fashion (cluster).
  • It can be used to read data from many data sources such as MongoDB, Parquet files, MySQL and any JDBC database.
  • Used for ad-hoc data exploration.
  • It does not support stream processing.
  • It has a smaller support community.