Azure Services: Synapse Analytics Workspace
Overview of the new data-oriented Azure service
Synapse Analytics Workspace is a new data-oriented service within Azure which brings together a number of existing services in a unified workspace. It can be used for data preparation, data management, data warehousing, big data, and AI tasks. Because Synapse Workspace overlaps with our Data Hub, I wanted to see what it’s all about and how it relates to our best practices when working with data within Azure.
So, I setup a workspace to put it to the test.
Since my background is in big data, I focused on Spark, SQL-on-demand (SQL-OD), architecture and security. In short, I followed Microsoft guides to create the workspace and used Spark to load Airline CSV data (airports, carriers and flights) into Parquet tables and query them using SQL-OD.
Workspace setup, Pipelines (basically Data Factory), Synapse DWH and SQL Pools are not covered in this post. For those interested in a complete overview and Workspace setup I advise this blog post which covers all of this. Another post focuses more on the background and road map of Synapse Workspace based on information from Microsoft and taking into account partnerships. Integration with SAP (via Qlik), Event Hub and Databricks seem to be on the horizon, which I find very exciting.
My opinion in short: Synapse Workspace is a diamond in the rough; showing lots of promise but not yet ready for serious use. I will explain why in the Evaluation and Conclusion sections below. But before going there, let’s start with a brief Synapse introduction and its security.
Synapse workspace introduction and architecture
Before diving into the evaluation, I want to explain the new (preview) Synapse Analytics Workspace service and my thoughts on it. My perspective comes from the Big Data domain and I guess that will be a bit different from someone with a Data Warehousing / BI background.
This is what Microsoft has to say about Azure Synapse Analytics Workspace:
Azure Synapse is an integrated analytics service that accelerates time to insight from all data at any scale, across data warehouses and big data analytics systems. It brings together the best of the SQL technologies used in enterprise data warehousing, Spark technologies used in big data analytics, and Pipelines to orchestrate activities and data movement.
As stated, it’s the integration of several existing and new data-oriented services ‘aiming to accelerate time to insight’. From my perspective, it brings together the following existing services: SQL Databases, Data Lake storage, Cosmos DB (‘NO-SQL’ database), Data Factory and Databricks (Spark). As far as I know, Databricks is not actually used and Microsoft has built their own Spark service. All of these services have some (small) integration and can be used from a single workbench. This sounds great as the ‘the whole should be greater than the sum of its parts’. But this statement is only true if the parts are properly integrated and as good as the original parts. I come back to this in the evaluation section.
There are different ways to look at Synapse Workspace and I put my architectural view in the image below. It shows a single workbench governing multiple ‘processing engines‘ allowing for integration and analytics across different ‘storage services’. Data Lake Storage gen 2 and Cosmos DB seem to be first class citizens for Synapse as they have direct support but almost every storage service can be connected via Data Factory and Spark as well. The general approach would be to load data from source systems into SQL or Data lake and perform your analytics and data pipelines from there.
It must be noted that the storage layer is not actually part of the Synapse Workspace service but the processing services are. This means you have to setup your own storage separately from Synapse. A Data Lake Storage gen 2 account is required as Synapse uses it to store its own state and temporary results. Such a storage account can be created when a Synapse Workspace is being setup but you can also bring your own.
Synapse Link for Cosmos DB is very interesting as it will allow near real-time analytics on data stored in Cosmos DB. It leverages Cosmos DB’s analytical tables which are OLAP versions of OLTP tables. Real time ‘changed data capture’ is used to replicate changes on operation data to the OLAP table which can be used for analytics from Synapse. I did not test this yet but it’s very interesting for streaming cases for which we tend to use Cosmos DB to store hot operational data. Near real-time analytical access to that data without affecting any operational applications would really be great.
Because of the fact that Synapse Workspace consists of several different processing and storage services this topic is actually very hard to pin down. I can’t say I fully comprehend it, but I will share what I know below. Feel free to reach out if you feel something is not right here and I will rectify (if I agree of course 😉 )
First of all, Synapse provides security on the networking layer of the service. It is possible to enable a ‘Managed VNET’ which will put your Synapse processing services in isolated networks. As far as I know it is not possible to define the VNET’s to use nor can you see or manage them. This is different from Databricks for example, which allows you to put clusters in a VNET of your own and really control the traffic using ‘Network Security Groups’ and ‘User Defined Routes’. Similar to most Azure services, Synapse Workspace also support Private Endpoints and Firewall rules. I did not test these but I assume they apply on the Workspace and SQL endpoints which are part of the service.
Synapse workspaces have their own simplified RBAC which is not yet linked with AAD. It supports three roles:
- Workspace admin – able to manage every aspect of the workspace
- SQL admin – able to manage SQL aspects of the workspace
- Apache Spark admin – able to manage the Spark aspects of the workspace
It is advised to create corresponding AAD groups and assign them to the groups above in Synapse (see this documentation). Doing this allows users to be managed in AAD instead of using Synapse itself. Microsoft advises to make these groups ‘Storage Blob Data Contributor’ on the Storage Accounts that Synapse works on. Really Microsoft… avoiding the POSIX like security of the storage you promote so much? I understand this is handy because Synapse stores its own configuration and temporary files here as well and users will need full access to that. But doing this gives users full access to all data within that Data Lake account which honestly must be avoided.
I performed some testing and suggest an alternative approach:
- Manage the Synapse groups as suggested in the documentation but do not make them Storage Blob Data Contributor
- Give these AAD groups RWX permissions on the /synapse storage container recursively in order to work with synapse as this is also the location where notebooks and temporary (hive) data are stored.
- Manage data access using AAD groups and ACL’s on appropriate directory hierarchies as described in the storage documentation
Doing this separates service access from data access which should be the default in my opinion.
As already highlighted above, data access is mostly managed within the storage layer itself. Access to data in Data Lake storage is managed via its ACL system and permissions on SQL objects are ‘granted’ within the SQL-OD engine.
When using SQL-on-demand it’s possible to secure objects/data in the SQL layer and in the storage layer:
- Databases, tables, views etc. can be secured in SQL and then the Synapse Managed Identity is used to access data in storage.
- It is also possible to pass the user identity to the storage service and let it determine if the user is allowed to access data or not. This is the case when Spark Parquet tables are replicated to the SQL-OD engine.
My evaluation – The Good
I used a common data loading and denormalization scenario to evaluate my Synapse Workspace. I copied the Airline CSV data to storage, two small CSV files (carriers and airports) and ~10GB of CSV files containing flight data (120 million records). A small Spark pool and notebook were used to convert the CSV files into parquet, create tables on them and use a CTAS statement to create a denormalized version in which all data was joined. A number of Spark SQL statements were used to perform some simple analytics like determining the average delay per carrier per mile, most popular airports to fly from etc.
The creation of the Spark pool is very simple but offers few choices in terms of VM sizes and Spark versions compared to Databricks. For most workloads the VM sizes have properly balanced CPU vs memory but some workloads require different setups which is not possible. Choosing a large Driver VM to run some expensive (non-distributed) Python / R code is not really possible. I must admit this example is not really best practice but can be very handy at times and some flexibility here would be nice.
When starting from paused mode it takes around 3 minutes for a Spark pool to start which is the same or even faster than Databricks. The Synapse notebooks provide a good experience allowing users to code and visualize results (see screenshot below). Progress of Spark jobs and stages can be tracked within the notebook and it is possible to access the Spark History server for more details. I could use my existing PySpark code I used before in Databricks without having to do any changes. So, loading and transforming the CSV files worked as expected. Once I created tables on the Parquet data I could use Spark SQL to analyze data and transform it further. Simple versioning is supported allowing for undo and redo but there is no repo integration or a way to view revisions made.
Besides using Spark SQL, I also used SQL-on-demand to perform the same queries (from Azure Data Studio, see screenshot below). A bit to my surprise this worked very well. The Spark Parquet tables were automatically synced to the SQL-OD engine and I could access the data as if it resided in an MS SQL Database. Performance seemed to be good and sometimes faster than Spark. It’s hard to compare as it is not possible to see how much compute power the SQL-OD engine is actually using. The SQL-OD engine seemed to require a little bit of warm up as subsequent queries on the same data typically were faster which implies the use of caching.
This really makes it easier to integrate a Data Lake with other applications such as Power BI, R-Studio, etc. Currently we use Databricks JDBC/ODBC for this scenario but that has the downside of cluster startup. The first one connecting may wait several minutes which is a bit of a bummer (and leaving clusters on all the time is often too expensive). Synapse SQL-OD does not have this downside.
My evaluation – The Bad
Although the overall experience was OK, I did encounter a number of stability issues. My notebooks got stuck a couple of times with a cell just in ‘waiting’ status. A session restart seemed to fix it most of the time but not always… Since it’s not possible to restart the Spark pool yourself there is no logical way to handle this situation. I ended up changing the Spark Pool auto pause to 1 minute to trigger it to restart and get my notebook working again. More control over the Spark Pool would be really nice to handle these kinds of situations which unfortunately can occur with Spark.
The same basically holds for Spark configuration and library management. Neither can really be managed at the moment. Only python libraries can be managed on a cluster level but not for the other supported languages Java/Scala, R and .NET. I didn’t need libraries for my case but that is rare. Often enough I use Java libraries to connect with sources or provide Spark with the JDBC driver. As this is not possible it really limits Synapse Spark usability. A decent number of libraries are loaded (see here) but in reality people will require to manage their own libs. We often build our own Java / Python libraries with common code and use that within Databricks but that scenario doesn’t seem possible yet with Synapse.
Last but not least, my Notebooks went out-of-sync with the internal Metastore resulting in file not found exceptions or missing data. This can happen with Spark if files are managed directly in the storage layer without Spark, or the Metastore knowing about it. This is exactly what I did by removing and adding files in the Data lake. There are several tricks to remedy this situation like REFRESH TABLES, FSCK REPAIR TABLE and ADD/REMOVE PARTITION statements. Unfortunately none of them seem to work and a session restart was needed to fix the issue.
I also tried a Synapse Workspace with ‘Managed VNET’ enabled which puts Synapse resources within isolated VNET’s. Often this is a must have from security perspective. Unfortunately I did not get my Spark notebooks to run in this mode as they could not connect to the Livy service. I did not have time to retry it with another Workspace so it might have been a glitch with my specific deployment.
As stated in the introduction I think Synapse Analytics Workspaces are ‘a diamond in the rough’. This is OK as it is very new and still in Preview. When it matures it may very well be my go-to service for all data use cases. But for now the Spark side has too many issues and I stick with Databricks even if it is 25% more expensive. Things I believe really need to be fixed are:
- DevOps integration (pipeline, notebook, SQL script versioning and CI/CD to another Synapse Workspace)
- More control over Spark:
- Spark Pool lifecycle, i.e. being able to start, stop and restart them manually
- Library management. Users must be able to add, update and remove their own libraries for any language.
- Spark configuration which is needed for certain tricks and optimizations
SQL-OD on Spark tables really shows promise and I might start using that soon to provide SQL on Data lake capabilities to our customers. Support for DML, SQL views and file formats other than Parquet like CSV, JSON and Delta would really be the cherry on the cake though. One can only dream…