SQL4ES – a JDBC driver for Elasticsearch
For one of our projects we explored the possibility to replace a datawarehouse with an Elasticsearch cluster. It seemed a perfect fit but there was one catch… the front-end it had to be integrated with only supported JDBC drivers. Thus began the search for a JDBC driver for ES. We found a number of (partial) solutions which will be discussed briefly in this blog post. Since none met all of the requirements we started developing our own JDBC driver called SQL4ES. The first release was published on Github last week!
A couple of months ago Annhormen got the question if it was possible to replace an existing datawarehouse (DWH) with a NoSQL solution. Primary goal was to improve scalability in order to handle the increasing amount of data this customer had to deal with. The DWH provides data for a large number of dashboards used by thousands of analysts. Smooth loading and performance of those dashboards was the primary requirement. This meant the NoSQL alternative we had to find should be able to return results, even when aggregating millions of records, within a single second.
One promising alternative was Elasticsearch; good (aggregation) performance for lots of users and scales horizontally. Unfortunately, ES does not have a JDBC driver which was needed to connect it to the BI front-end. This challenge is not rare, several NoSQL databases have JDBC / ODBC drivers. Examples are Phoenix (JDBC driver for HBase), Apache Drill (SQL engine for Hadoop) and Presto (distributed SQL query engine for various databases). Hence it is no surprise there are several SQL solutions for Elasticsearch as well.
Whenever you google for Elasticsearch and SQL the Elasticsearch-sql plugin is bound to be one of the top hits. It provides a webservice and GUI used to execute SQL SELECT statements and supports a wide range of SQL and Elasticsearch specific functions. It offers an easy way to access data through familiar SQL statements. Unfortunately, it is a webservice and not a JDBC driver. Although the webservice could of course be used by a driver…
The open source project Crabs actually is a JDBC driver but has not been worked on in over a year. We managed to upgrade it to Elasticsearch 2.x but encountered numerous challenges when loading the driver in SQLWorkbenchJ or other clients. Given its ‘age’ and the single developer that worked on it we decided to abandon this project.
When we started the search for an Elasticsearch JDBC driver we came across this pull request for a Presto connector for Elasticsearch. At the time we were looking into it the connector did not look ‘very alive’ and not worthwhile looking into. But at the time of writing this blog we found that it is alive again and people continued the work to create a Presto connector for Elasticsearch. The status of the connector is a bit unclear but it is definitely something to keep an eye on!
Most of our time was spend reviewing Crate, a SQL engine (with JDBC driver) build on top of Elasticsearch. It is very easy to setup locally as well as on AWS and scales automatically when adding or removing nodes the same way as Elasticsearch. The current release even contains the ability to perform inner- and cross joins (which we did not test). Crate really feels like a distributed relational database leveraging Elasticsearch to scale out and Lucene as storage engine. It does however have its own query execution engine which we found to be slower than Elasticsearch when aggregating data (the ES API can still be used on Crate). We performed a number of evaluations, using different cluster configurations, on a single table (type) containing 250M records. Performance was measured using Apache JMeter with 24 queries, each aggregating 2M-5M rows (counts, sums, averages). Exactly the same queries were executed against the ES API of the same Crate cluster. This benchmark showed that queries on Crate take around 7 times longer than the Elasticsearch counterparts. Crate’s query engine, designed to support relational features like joins, added a significant amount of time on aggregations. Unfortunately, Crate’s aggregation performance was not acceptable and we had to abandon it. Crate is however a great solution and now that it starts to support joins it would be interesting to see how it compares with other distributed relational databases such as MemSQL or Redshift.
Because we did not find a JDBC driver suiting our needs we decided to build one our selves. It proved to be an educating experience on the JDBC interface, Elasticsearch API as well as on the Presto SQL parser we use. The result is a JDBC driver implementing most of the common SQL statements like Select (distinct, where, group by, having, sort, limit), Insert, Delete and Create. We tried to keep the driver simple in the sense that it should translate SQL into ES and vice-versa and not support SQL features that are not natural for Elasticseach. Hence the driver does not support joins!
The first beta release of the driver for Elasticsearch 2.X can be found on Github. It includes a readme describing the SQL features supported, limitations and a number of examples. The driver can be loaded in SQLWorkbenchJ (see screenshot below) and SquirrelSQL after which these clients can be used to query or update Elasticsearch indexes. Feel free to try it out and let us know what you think!