Elasticsearch as a Pentaho source through sql4es
In one of my previous posts, I described the Elasticsearch JDBC driver we developed, called SQL4ES. I put the SQL4ES driver to the test by using it to expose Elasticsearch as a data source for Pentaho.
For this effort I used Elasticsearch version 2.2.1, Pentaho Business Analytics 18.104.22.168 Trial and SQL4ES driver version 0.8.2.2. The following sections describe the data used, how the driver was loaded and configured within Pentaho and used to generate reports.
I created a typical BI sales dataset containing a single fact (daily point of sales) with a number of dimensions:
- Time: containing the date, month, week and day fields
- Product: name, category, subcategory and price
- Store: name, country, currency (see below)
- Currency: name and exchange rate (dollar)
This data was denormalized, each row containing the joined data with all dimensions, using Apache Spark. The flattened dataset was indexed in Elasticsearch without strings being analysed. This resulted in roughly 10M rows with random sales volumes per day. Obviously a very small dataset but good enough for the proof of concept running on a single laptop. The entire data process is shown in the figure below.
Dimension data is duplicated in order to avoid joining which makes it faster to aggregate data. This might seem very expensive with respect to storage and memory usage but this is not necessarily so, at least not when using Elasticsearch. Under the hood Elasticsearch maintains a vocabulary with identifiers for all terms it encounters. For this type of data, with a very controlled vocabulary of product names, categories, and other fields with low cardinality, etc. this structure is relatively small.
Since Elasticsearch usually stores the document source, denormalization of data will require more storage. It is however possible to increase document compression or disable _sourcedocument storage altogether (see this page). When _sourcestorage is disabled it is no longer possible to retrieve the original documents, but . Tthis is not a problem when you just want to aggregate data (as in the use case described in this post). Storage requirements for this dataset was cut in half when document storage was disabled!
I added the driver to Pentaho’s BI Server as described here. In short it means adding the driver jar file to the appropriate lib directory and restarting the service. Once the driver is exposed to Pentaho it can be used to configure a Data Source. I used the ‘Manage Data Source’ wizard to create a new Data Source for my sales dataset index in Elasticsearch. These steps are required to get this done:
- Open ‘Manage Data Source’
- Source Type à Database Table(s)
- Click the ‘+’ sign to add a new Database Connection (see screenshot below)
- Give the connection a name (ElasticJDBC for example)
- Specify the url of your Elasticsearch cluster such as jdbc:sql4es://localhost:9300/myindex (also see the SQL4ES page on Github)
- Specify the driver class: jdbc:sql4es://localhost:9300/myindex
- Test connection and finish by clicking ‘Ok’
- Finish the Database connection by giving it a name and finish by clicking ‘Ok’ (or test it first )
- Select the newly created connection
- Select: ‘Create Datasource for reporting and analysis’
- Although it is a flat structure the data resembles a ‘star schema’!
- Click ‘Next’
- Pentaho now fetches the tables (Elasticsearch types) and allows you to select these
- Select the table (which is an Elasticsearch Type) you want to use
- Specify this table to be the ‘Fact Table’
- Specify the model and types if needed
- Click Finish
The dummy data looked like the screenshot shown below showing two facts and the dimensions I created. Since these dimensions are indexed in a nested structure they show up in dotted notation within Pentaho’s model management window.
With the Elasticsearch index/type configured as a Data Source it was time to create the reports and dashboards I was aiming for. Essentially this is basic Pentaho functionality: select the Data Source and start dragging and dropping facts and dimensions in order to create tables and visualisations. I created a number of visualisations shown below. Please note that all the data is fictional and do not resemble any real stores or products!
The example below shows the table created by combining two product dimensions with time.
The following screenshot shows sales volume per category per month (clearly showing the not-so-random data generator I used when creating the dataset… business is booming!).
It is also possible to add filters to be able to to remove certain data points from the aggregation/visualization. The SQL4ES driver translates the SQL statements into Elasticsearch queries and puts the returned information in the tabular format Pentaho expects. Since Elasticsearch is very fast, both simple selects as aggregations, these visualisations are generated very quickly and it is very easy to drill down in the data. Aggregating the 10 million data on 4 dimensions took 500ms on average on an i5 laptop.
This little exercise showed how the SQL4ES driver can be used to let a system interact with Elasticsearch using the well known JDBC interfaces. The ‘data warehouse use case’ described is just an example and requires the data to be denormalized in order to work. In the same way the driver can be within other JDBC supporting systems to access, for example timeseries data or documents. Together with MapR we are looking into integration with Apache Drill which would allow data from Elasticsearch to be joined with other data sources as well.