Azure Services: Synapse Workspace part 2
Where is it now
Roughly a year ago, I looked at Azure Synapse Workspaces which was still in preview at that time. My observations and conclusion from back then can be found in this post. I called it a ‘diamond in the rough’ which shows a lot of potential but was also lacking in many key ways.
Since Microsoft is continuously improving the Azure services, I thought it was time to take another look at Synapse workspaces and see what has changed. One of the additions is Synapse Link for Cosmos DB which is described in its own section of this post. Same as last year, my focus will be on the Data Lake and Spark subjects rather than the SQL Pools.
Comparison with last year
Given the state of Synapse Workspaces a year ago, a lot has improved:
- Git support for all Synapse artifacts like SQL Scripts, pipelines and notebooks has been added. Synapse users can now safely make mistakes and recover using a repository.
- CI/CD is also supported (see documentation) using the same approach as Data Factory deployment. Basically, this means that the ARM template from source workspace is deployed into a target workspace. However, resources like Spark and SQL pools are not deployed and neither are workspace items like notebooks, SQL scripts, dataflows etc. For deployment of these workspace items, it is possible to use a solution from the DevOps marketplace. Putting everything together to deploy a Synapse Workspace including code is a bit complex and monolithic. It’s not possible to deploy pipelines / items separately.
- Synapse now supports 10 different roles to define what users can do within a workspace. A year ago there were 3 roles which granted a relatively large set of permissions.
- More control over Spark:
- Possible to stop / kill Spark sessions separately but not Spark pools. This is typically good enough as it allows users to manage sessions when Spark breaks down.
- Users can now manage libraries in different ways; and these can be managed within the Synapse workspace. This is easy to do for Python using a Python yml file files. Java libraries (jar files) must be put on Synapse storage first. As far as I know Synapse does not support repositories like Maven, which makes library management (other than Python) still very difficult.
- Spark Session configuration can now be done on Spark Pool or even Spark Session level. This can be done by uploading a text file containing spark configuration, similar to spark-defaults.conf file. It’s even possible to configure a spark session from within a session using code.
At the same time a lot still has to be improved:
- Very limited Spark Pool configuration options compared to Databricks.
- Only one VM type is supported while it would be great to pick different VM types based on workloads (CPU heavy vs Memory heavy vs balanced).
- Only Spark version 2.4 is supported, so not possible to run older versions if needed due to library requirements nor new version such as Spark 3.
- Minimum of 3 nodes, which can be on the heavy side for small data and simple exploration. Databricks supports a single node option (driver and worker in one) which is perfect for the small work often needed.
- Still unable to ALTER TABLE / MSCK REPAIR Spark tables which is vital when data is altered outside of Spark.
- SQL on demand, now called Serverless SQL, is largely unchanged. It supports Parquet and CSV files but not yet VIEW’s nor Delta format. A ‘workaround’ to query Delta using manifest files is described here but is pretty cumbersome and I would not use it for production.
The SQL Side
Unlike last year, I wanted to take a better look at Serverless SQL with the new ‘Synapse Link for Cosmos DB’ feature and SQL Pools to see what has changed there. There’s quite a lot to mention!
Synapse ‘SQL-on-demand’ has been rebranded to ‘Serverless SQL’ but is largely unchanged. It allows execution of SQL SELECT on data stored on Azure blob or data lake storage. Databases and tables can be created via SQL scripts or Spark notebooks. When using Spark, any table defined which uses CSV or Parquet as storage is automatically synced with Serverless SQL. From my perspective it would be a real improvement if other popular formats like AVRO and Delta are supported as well.
From a functional perspective it is pretty clear what Serverless SQL provides, but I was also interested in the actual performance and benchmarked Serverless SQL performance against a single node Databricks cluster (4cores, 32gb mem) with Delta caching enabled. The dataset contained 988 million records, 131 files, and was ~14GB in size. It was a vacuumed Delta dataset, so both Serverless SQL and Databricks could query it and Databricks could use its Delta optimizations. On more complex queries Databricks clearly outperforms Serverless SQL, especially when the Delta cache is used. Having the data cached locally on SSD has its benefits!
Some queries, such as the one showed below, resulted in an error ‘The query cannot be executed due to current resource constraints’. A quick google search led to this best practices documentation, which indicated that the data was ‘over-partitioned’ leading to lots of small files. Overall, most files were 100MB or bigger, but Serverless SQL did not like it and errored out. Although, I understand why it happened, I do feel like throwing an error is not an elegant way of handling this. Just run the query, be slow about it, and point users to relevant documentation.
Synapse Link for Cosmos DB
Microsoft has also added the ‘Synapse Link for Cosmos DB’ which provides good analytical performance for up-to-date Cosmos DB tables. When enabled all changes to a transactional (OLTP oriented) Cosmos DB table are replicated to an analytical (OLAP oriented) table optimized for analytical queries. This enables users to perform analytics on real-time data; for example using Power BI dashboards. The data replication as well as analytics queries introduce increased costs (€0.017 per GB/month for storage, €0.046 per 10K writes/month and €0.005 per 10K reads/month).
This feature seems solely build for Synapse as this is the only service that can query the Analytical tables via Spark or Serverless SQL. I tested Synapse Link using both Spark and Serverless SQL on a single table containing 6.5m million records. Screenshots of Serverless SQL query is shown below.
The OLAP option works as expected and executes analytical queries a lot faster than its OLTP counterpart. I compared Cosmos OLAP to Databricks performance (1 node, 4 cores, 32GB mem) on the same dataset stored in Delta. Overall, the performance of both options was very similar, until Delta caching kicked in. Sometimes a Cosmos query would take a lot longer than a previous execution. I have seen the exact same queries go from 3 to 4 seconds to 18 to 45 seconds on repeated executions. Honestly, I expected a bit more from Cosmos DB but it was a good experience nonetheless. Especially given the ease in which Analytical tables can be enabled and are updated automatically without any code. The fact that this feature is exclusively for Synapse makes it bitter-sweet for me. I would have loved this feature to be available in Databricks as well. But the same goes for some of the Databricks specific features like improved Delta and other things on the horizon. Both Microsoft and Databricks try to lure and lock-in customers with these kinds of features and customers needing both will have to put both services into their solutions including all overhead that comes with it… I know it’s the way of the world, but it still makes me sad.
Honestly, I am not an expert on relational databases or warehouses, but I did want to take a look at the Synapse Analytics SQL Pools. In essence, an SQL Pool is an SQL DWH scaled with a number of Data Warehouse Units (DWU’s). A single Synapse workspace may have multiple SQL Pools, similar as it can have multiple Spark Pools. While a Spark pool can auto scale, an SQL Pool cannot. But, it can auto pause based on the number of minutes it has been idle.
Loading data into an SQL Pool via Data Factory (pipelines) or Spark is very easy. Polybase or bulk loading can be used to load data very quickly. Optimizing tables for specific loads using data distribution and indexing requires some tuning and testing but can be done relatively quickly using CTAS statements (~5 minutes for 123 million records I was testing with). As it can be expected from such a mature product, SQL pools are rich in features and perform very well.
Looking from the ‘big data’ domain my conclusion from last year still stands. Although Synapse Analytics has made a lot of improvements, I do feel its Spark and Serverless SQL functions are still lacking in many ways. From a ‘big data’ perspective the whole (being Synapse) is still smaller than the sum of the individual parts (being Storage, Data Factory, and Databricks). Admittedly, all these different services are harder to manage than Synapse, but their features make up for it. This is very similar to the finding of this post answering the question ‘Is Azure Synapse Analytics ready for production?’.
I guess Synapse is a good fit for those with a Data Warehouse background, as I feel it is very strong and flexible in that domain. In that case, starting with Synapse to add a bit of Data Lake and Spark makes total sense. I also see value for Synapse Link as it allows (near) real-time analytics on streaming data written to Cosmos DB tables. It performs reasonably and is very low maintenance.
While working with Serverless SQL, Cosmos Analytical Tables and SQL Pools I started wondering how these technologies compare to the competition (Databricks Delta/Parquet, Snowflake and Elastic) terms of costs and performance. Which is now on my TODO list and a likely topic for my next post!