
The long named Azure Databricks Synapse Connector (DSC) allows you to connect to Synapse Dedicated (and only dedicated) SQL Databases and query them. MS has a nice outline on how to do it here. After working with it for a while now, there are few things that you need to consider.
First up, you don’t have to install the connector, it’s a JDBC connection to the Synapse DB. It supports the normal runtime usual suspects, Scala, Python, SQL, R, and can be used in reading a writing data from/to Synapse. You can copy the data over to your Databricks/Datalake, or use it in a kind of a Direct Query mode, were it sends a query to the source and returns the data.
I’ve been using the SQL runtime, to create a set of tables that the user(s) can access in Databricks.
Creating Tables in Databricks
So for these DSC tables, you don’t have to do a full ‘CREATE TABLE’ statement. It is more like a view than a table, in that it just sits over the source table in Synapse.
CREATE TABLE dsc_sandbox_tablename
USING com.databricks.spark.sqldw
OPTIONS ( url 'jdbc:sqlserver://sqlsrv_name.azuresynaspe.windows.net:1433;database=sqldw_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.azuresynaspe.windows.net;loginTimeout=30;'
, useAzureMSI 'true'
, enableServicePrincipalAuth 'true'
, dbtable 'schema.sqldw_tablename'
, tempDir 'abfss://dsc_temp@mydatalake.dfs.core.windows.net/schema/tablename);
Lets break this down:
CREATE TABLE dsc_sandbox_tablename
USING com.databricks.spark.sqldw
So this creates the table, as you can see there isn’t setting up of columns names and data types.
For the options:
url 'jdbc:sqlserver://sqlsrv_name.azuresynaspe.windows.net:1433;database=sqldw_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.azuresynaspe.windows.net;loginTimeout=30;'
This sets up the connection string to the database, and you need to specify the instance name and the database you would like to connect to.
useAzureMSI 'true'
enableServicePrincipalAuth 'true'
In this case, I’ve got a service principal that has been added to Synapse, the Azure Datalake Storage Account,which also can be used by Databricks. This service principal needs to be able to allow the Azure Synapse DB to write to a storage account that you will be using in a later step. The ‘useAzureMSI’ option is also used in querying the SQL DB. Have a look at the MS Documentation for a full list of authentication methods.
dbtable 'schema.sqldw_tablename'
So you do have two options here. You can have this Databricks table just sit over your Synapse table, or you can use the option of a query. The use of a query is not shown in the examples for SQL connections, but is for Scala, Python and R, but it can be used.
query 'SELECT column1, column2 FROM schema.sqldw_tablename'
The next bit is for the temporary storage, the Synapse DB will send the results from the query back to a storage location for Databricks to then read. You have to declare the folder names, it doesn’t take the schema and table name for the table I’ve named in the connection.
, tempDir 'abfss://dsc_temp@mydatalake.dfs.core.windows.net/schema/tablename);
As I mentioned, this connected table is more like a view than a table. All that the code does is create the connection, there is no structure to define, and nothing is stored until you query this connected item. So in Databricks I’ll just run a normal SQL query on this table, so what happens when I do?
SELECT * FROM dsc_sandbox_tablename
The query hits the connected table then passes the query to the Azure Synapse DB set in the JDBC connection.
Then the Azure Synapse DB runs the query and returns the results to a the blob storage as parquet files, which then Databricks reads and uses/displays the data.
There is no caching! So every time you query that connected table it will write something to the blob storage. What does it write? Well using the above example of container ‘dsc_temp’ and folder ‘schema/tablename’:
dsc_temp/schema/tablename/2023-01-31/10-09-01-962/b90ff91f-482f-45ac-8f45-679263d6a838
‘2023-01-31/10-09-01-962/b90ff91f-482f-45ac-8f45-679263d6a838’ is in the format of the data, then time, then unique id.
In that folder, will be a bunch of parquet snappy files with the data in them.
As mentioned, there is no caching! If you run the same or new query again, it will create a new folder, with the new query run date, time and id. Which can lead to your blob storage having a number of folders and items lingering about. So set up a rule in the life cycle management section of your storage account to delete blobs older than 1 day, or what time period you like.
Optimising Performance
Some things you may need to consider, in terms of performance. What seems to be a good way, is not settings up a number of tables in Databricks then writing a complicated view on top of them. So for example you have in Databricks a view that sits over the connected tables:
SELECT t1.columnA
, t2.columnB
, t3.columnC
FROM dbx_tableA AS t1
LEFT JOIN dbx_tableB AS t2
ON t1.columnD = t2.columnD
LEFT JOIN dbx_tableC AS t3
ON t2.columnE = t3.columnE
This will perform significantly slower than creating a connected table, using the query option and using that query in it. It looks like Databricks is sending down a query to all those tables, then doing joins, not sending down an optimised query to the Synapse DB instance. So for complicated views let the Synapse DB do all the hard work, with the query option. (Note: I do have to test this fully by profiling what queries are actually being sent down to the Synapse DB).
The Tricky Bits
Most of the issues in creating the connection that have been experienced, have been permissions issues between Databricks being able to run a query on the Synapse DB and the Synapse DB writing to the temp storage.
In the above examples, it has used a Service Principal to authenticate to the various moving parts. Ensure that the SP can run a query on the SQL Server, it also needs permission to write to the blob storage using the role ‘ Storage Blob Data Contributor’ for it to write back to the storage
Make sure that if possible, make your Databricks cluster storage aware, by adding the SP details in an ini file or cluster setting. If not you’ll have to define the service principals in the notebook, the MS Docs site has examples. Depending on your setup, you may need to create a database scoped credential to allow the SQL DB write back to the storage account.
The other main issue was when you define an SQL query if your Synapse DB schema and/or table has for example ‘-‘ in it, you can’t write the query with [square brackets] around them, it has to be “double quotes” for the schema and table names. Also it doesn’t seem to like CTE’s when you use them in query, so if you can move them to a join.