Data Explorer (ADX) – Azure’s hidden gem
Although it has been around for many years, there is a service within Azure which does not get the attention it deserves. The Azure website describes it as a ‘fully managed data analytics service for real-time analysis on large volumes of data streaming from applications, websites, IoT devices, and more.’ Even though this is accurate, I feel it does not do justice to this service which is very feature-rich and mature. The service I am referring to is Azure Data Explorer (ADX), formerly known as Kusto. For me, the name ‘Data Explorer’ implies it is a visualization tool of some kind and the description from Microsoft hints a bit in that direction. But that is not the case. ADX can better be described as a ‘streaming (timeseries) optimized data warehouse’. It is used within Azure itself for all monitoring, so whenever you look at metrics or query logs you are actually working with ADX. This is an indication of its scalability and maturity. Within this post I will explain briefly what ADX is, its main functions and at the end I will do a walkthrough of a small benchmark I performed using a 75 billion record dataset on an 8 core ADX cluster.
What is Azure Data Explorer (ADX)
As I said, I find the name rather misleading as it is not just an explorer but a timeseries database optimized for analytics. Its architecture, features, and query capabilities are strongly focused on timeseries data using a hot/cold storage architecture. Hot data, defined by recent X days of data, is stored on SSD’s to facilitate fast data access while old data is stored on cheap blob storage. The hot window measured in days can be defined per database or table and enables a trade-off between performance and costs. The data itself is stored by default co-located on datetime in a columnar file format. This makes it very suitable for timeseries analytics and as a result it is mostly promoted within this domain. Although focused on streaming data the actual storage and transformation within ADX is done in a micro-batch way. Typically, sources are read in a 10-30 second interval which makes sense given the analytical nature of ADX but also makes it less suitable for use-cases which require very low end-to-end latency. It supports ways to handle data in a more event-driven way called Streaming Ingestion but doing this has limitations so it has to be enabled with caution.
Among others, some of the top features of Azure Data Explorer (ADX) are:
- Its own query language KQL (Kusto Query Language) which I find very powerful and user friendly. I find it a lot more flexible than SQL and avoids typical sub-queries required within SQL. It should be noted that ADX supports SQL as well and can be accessed using standard SQL server drivers which makes it easy to integrate with.
- A lot of ways to ingest data, for example directly from Blob / ADLS storage and Event Hubs. While most database require some external process like an Azure Function or Stream Analytics to read a stream and write data into tables; ADX can do this by itself. It can for example stream an Event Hub ‘topic’ into a table (typically in a 10-30 second micro-batch way). Data Factory also provides integration with ADX to copy data and execute statements.
- Data mappings, update policies and materialized views allows new records from source tables to be parsed, transformed, joined, and loaded to target tables. This includes options to parse semi & unstructured data (XML, JSON, CSV and text) but also joins with other tables. Joining streams with each other or with reference data is typically a hard scenario but can be done rather efficiently in ADX from local SSD’s. Given its micro-batch architecture most operations can be done very efficiently. Tables can have a retention of 0 which means data is not persisted and the table simply acts as a ‘stream’ between transformations. No external ETL/ELT ‘tooling’ is required although for big transformation jobs it might be more performant to use a scalable external ‘ETL’ solution like Databricks to do the heavy lifting.
- A number of specific timeseries functions like datetime aggregation, line fitting (regression), seasonality detection, missing values filing, anomaly detection and forecasting. Although not commonly used they can come in handy as I have noticed within my projects. The screenshot below shows the query and chart to perform forecasting and outlier detection for some meter with a daily pattern.
- A Dashboarding function which is currently in preview. This allows data to be visualized in various ways and exposed to others. It is by no means a substitute for something like Power BI but it does provide an easy way to expose data and insights to (business) users. In my experience it is comparable to Databricks dashboards but without the need for a cluster to start (which is no longer an issue when Databricks Serverless SQL becomes available in Azure). Grafana can also be used to provide a richer dashboarding experience. Screenshot below shows a simple dashboard showing airplane information (location, altitude, etc.) from the OpenSky network.
- External tables on blob, ADLSg2 or SQL Server are supported and ADX can sync to these tables automatically. These tables are not indexed though so query performance is a lot less than native ADX tables. This provides a way for ADX to perform (streaming) data curation and write the results to a data lake for bulk processing / analytics.
- All kinds of ‘query plugins’ used to execute code (R or Python), perform machine learning like clustering, and get data from other services like Azure Digital Twins, Cosmos DB, SQL, MySQL, Web API’s. This can be used to combine ADX data with data in other databases / services; for example as part of an update policy. I have used the http_request plugin once to combine ADX data with data stored in Azure Table Storage. Some plugins need to be enabled and any connections they make to other systems need to be allowed in a callout policy.
- Options to scale cluster manually (up and out) and automatically (out). The automatic scaling can be reactive and even predictive based on periodicity patterns but I must admit I have never used this feature.
- Cross-tenant access allows users/groups from another tenant to access ADX Databases and tables without the need for those users to be ‘invited’ into the AAD. This creates a very flexible way to provide access to external users.
- Although not a feature of ADX itself it is worth mentioning that other Azure services like Databricks, Data Factory, Stream Analytics, etc. support it. This means that for example big or complex transformations can be taken out of ADX and be performed by Databricks or Data Flows.
All great so far, so what are the drawback?
- Issues with bulk processing – ADX is optimized for stream processing and as a result not particularly performant when it comes to bulk processing like reloading a big dataset. An ‘ingestion’ like this is mostly done using a single core with a timeout of 60 minutes. Depending on the cluster size it is possible to run multiple ingestions in parallel, but you must orchestrate that yourself. So, the advocated approach is to split the job into small parts, for example load a historic dataset per day or month. Multiple loads can run in parallel, but you have to manage this yourself or use Data Factory/Synapse pipelines to orchestrate it. This makes it a lot more complicated than a solution like Databricks Spark in which it is simple to ‘ETL’ a massive dataset as a whole or at least in very big pieces without cores or timeout limitations. Using Databricks is actually a good way to (re)load a large amount of data into ADX in which Databricks does the heavy lifting and ADX simply writes the data in small batches.
- Updating records – It is possible to delete records, but it is not possible to update records which can make life difficult. Effectively an update must be made by appending fully updated records and deleting or filtering ‘old’ records when requesting data. This can be made easier by using a (materialized) view to get the latest state based on a (multi column) id and timestamp. Obviously updating very large datasets must be avoided and hence it is important to think about the data model and carefully distinguish dimensional data which are updated frequently from fact data which is not or rarely updated. In my experience updates typically come in full records from source systems and are modelled as a type of ‘slowly changing dimension’ to keep track of changes. In a data warehouse or data lake this would typically require an update on the ‘old’ record (for example setting an ‘end_date’) and appending the ‘new’ record with is_active flag or end_date NULL. Within ADX this can be done by simply appending the new record and using an arg_max(timestamp, *) by ID aggregation to get the latest record based on an explicit timestamp or ingestion_time. When the dimension table is not extremely large and partitioned / sorted on ID field(s) this will perform well. The latest records can also maintained in a materialized-view to omit the overhead of a arg_max(…) operation upon request.
- Define what data is stored on ‘hot’ SSD can only be done based on datetime. By default, the hot definition looks like ‘now – X days’ but it is also possible to define other time periods called hot windows. This makes sense but in reality, it is handy to define hot windows on other aspects of data as well like a status flag. One way to handle this situation is to split a single logical dataset into separate ones to define different policies. For example, store data with status ‘important’ in a table with caching policy of 365 days and store data with status ‘less important’ in a table with caching policy of 90 days. These tables can be ‘unioned’ in a view to expose a single dataset to data consumers. Although possible it is less than ideal as it adds some serious complexity to the data model and transformation flows.
- ADX Pricing – pricing is based on VM’s used with an additional markup of 0,11 euro per core per hour. A production setup always has at least two data VM’s and two small ‘management’ VM’s. A cluster containing of 2x (4CPU, 28GB mem and 160GB SSD) data VM’s costs ~1270,- euro per month, without any discounts, in West Europe region. This is very similar to Databricks Premium pricing for the same setup with the major difference that Databricks clusters can be on/off while ADX is always on. But in a streaming scenario something must be always on anyway. A development cluster containing of a single VM (2 cores, 14GB mem and 80GB SSD) costs 200 euros.
ADX within a data platform
A modern data platform supports both batch as well as streaming data and typically consists architecture components like storage, batch processing, stream processing, analytics capabilities and reporting as described here. Although ADX focuses on ‘steam storage and analytics’ it can perform all these functions including some basic machine learning inference (not the learning itself). It can easily be combined with a data lake setup in which ADX syncs tables to external tables on the data lake and/or reads data from the data lake.
Databricks is an alternative service which can perform all of these functions as well, and in most cases offers more features and is more scalable. However, the Databricks advantages come with a cost which is increased complexity. I really love Databricks and the Spark framework but also struggled with things like spark coding, job optimization, table optimization (partitioning, clustering, ‘small files problem’), etc. ADX has less features than Databricks but is powerful because it can still do a lot while it is easy to use. So ADX can be a good fit for smaller, less complex data platform’s. And ADX can always be used alongside Databricks and a Data lake to provide the best of both worlds.
Microsoft reference architectures for streaming architectures tend to default quickly to Cosmos DB. For use by applications and/or event driven architectures this makes a lot of sense, but Cosmos is very ill suited and expensive for analytics cases due to Request Units limitations and associated costs. Cosmos Analytics Store can be used for analytical queries but last time I checked these were not so performant, especially not when comparing with ADX. So for when analytics are involved one should really consider using ADX instead or next to Cosmos DB for cost effective long term retention and analytical queries.
As preparation for a project plan, I performed an ADX benchmark to evaluate performance of some specific types of queries and estimate ADX sizing. The PoC was performed on a production cluster consisting of 2 servers, each with 4 vCores, 28GB memory and 160GB SSD (hot storage). A public Smart Meter dataset was used as a basis which contains ~2 years of smart meter data (one measurement), measured every 30 minutes, for 5000 London meters. Given its relatively small size the data was expanded to make it more realistic:
- number of unique meters, their master data and their measurements were duplicated from ~5000 to 450.000
- 30 minute measurements were ‘duplicated’ to 15 minute measurements with minor random changes to numeric measurements and was duplicated from two to five years. This dataset contained 75 billion measurements records each having and id, timestamp and 4 numeric values. One year of this data was kept in hot storage (127GB) while the rest was stored in cold storage (508GB). Due to compression the average record size was 9,05 bytes.
- An additional materialized view was created to store daily statistics like count, average and standard deviation of all measurements per smart meter (per day). This materialized view contained 784million records and was stored in hot storage (34GB).
In total ~160GB of hot SSD storage was used which is roughly 50% of available SSD capacity. This means there is plenty of room for other datasets to be stored in hot.
A Benchmark was performed by JMeter using the REST endpoint of the ADX cluster. Different types of queries which get data for one smart meter ID were executed multiple times with different parallelism (i.e. threads) to simulate multiple users hitting the database simultaneously. It should be noted that these are ‘OLTP queries’ which simply fetch a number of records given an ID which typically is not the sweet spot of analytics databases like ADX.
The average response times in milliseconds and standard deviation for the queries are shown in the table below. These times measure both query execution as well as data transfer time (full round trip from client to ADX and back). No query result cashing was used.
|Query||threads||AVG (ms)||STD (ms)|
|load all records for an ID and 3M (Hot)||1||191||130|
|load all records for an ID and YEAR (Hot)||1||767||162|
|load all daily aggregated data for one ID (Hot)||1||729||142|
|load all records for an ID and 3M (Cold)||1||1.520||1.040|
|load all records for an ID and YEAR (Cold)||1||4.187||2.520|
|load all records for an ID and 5 YEARS (Cold||1||20.334||13.944|
It is clear that frequently requested data must be put on SSD to provide sub second query response times and queries on cold storage are relatively slow (no real surprise). For 10 concurrent users the hot queries return within a second on average but there are some that take longer. Cold data performance is notably worse but that is no surprise if for example 31K records must be selected from 75 billion stored on blob storage.
JMeter output for a ‘10 thread run’ are shown below.
Besides queries on a single meter ID we also executed a number of analytical queries in which data is aggregated. These queries were executed 50 times using one thread on the daily aggregated data (the materialized view).
|Query||#records scanned||#results||AVG (ms)||STD (ms)|
|Aggregate all columns for one year||~157 Million||12||3487||768|
|Aggregate all columns for 5 years||~785 Million||60||10308||869|
|Aggregate one profile for 5 years incl. histogram on 100 watt||~785 Million||93K||11055||1165|
JMeter output for analytical queries using 1 thread
Analytical queries on a single year of original measurements (15 billion records) in hot storage were not included in the benchmark but experiments aggregating all that data took between 4 and 7 minutes depending on aggregation complexity.
From my perspective the performance is certainly respectable given the data size, cluster size variety of queries and parallelism of queries. For sure there are solutions like Cosmos DB, Elastic Search SQL DWH/Pools or Snowflake which are more suitable for some of these workloads. But these come with additional costs, complexity and/or limitations. Cosmos DB for example is great for the ‘OLTP queries’ (sub 10ms I reckon) but will become expensive on storage and will not be able to perform the aggregations. Elastic Search can handle both the single ID and analytical queries at blazing speeds but requires significant storage and memory usage to do so.
From my perspective ADX is a relatively easy to use can-do-it-all data service using a cost-effective data architecture and good all-round performance. It can handle batch and streaming workloads, offers a simple yet amazing query language, has some (very) basic ML capabilities and comes with build in dashboarding capabilities. For sure it is not the best at everything but certainly decent and is relatively easy to work with which is something I value more and more.
To enable people to work with ADX we will publish a blog describing how to set up an entire streaming data pipeline to load, analyze and view real time airplane locations. This pipeline uses an Azure Function to fetch airplane locations from a public API, distribute this data using an Event Hub and use ADX to ingest, store, analyze and visualize this data. The Dashboard shown in this post is actually taken from this pipeline. So stay tuned for this new post so you can set it up yourself!