
The Problem
It didn’t match. Damn it. Why not? So the issue was having to move a workload from SQL Server, to Spark/Databricks. One of the columns on the table used SHA256 encryption to create a key column on a table. So SQL Server output this:
SELECT HASHBYTES('SHA2_256', cast('248' as NVARCHAR(8))) AS Key
Result: 0x301DB0B8C94E22CC954E83F47964983011330F98A2A48BBC108FCC30F8445EBF
Spark/Databricks output this:SELECT SHA2(ENCODE('248','UTF-8'), 256) AS Key766cb53c753baedac5dc782593e04694b3bae3aed057ac2ff98cc1aef6413137
Result:
Ah right, I can see the issue, I should use UTF-16
SELECT SHA2(ENCODE('248','UTF-16'), 256) AS Keycfa15dd8fe9e7262a7769904aff8fa34e3e494ad62eaa1c9eaeff0e729c1eb2e
Result:
Right, OK, that is not right. I was looking like I had to regenerate the whole column, and then the foreign keys etc through the data warehouse. Not a easy thing to do, as it was massive.
After some messing around with encoding and stuff I found the answer…. by asking a colleague! Thanks Sean!
To match the encoding with SQL Server you have to use a variant of UTF-16, UTF-16LE
SELECT SHA2(ENCODE('248','UTF-16LE'), 256) AS Key
Result: 301DB0B8C94E22CC954E83F47964983011330F98A2A48BBC108FCC30F8445EBF
Right so now it matches, but why? UTF-16LE is used normally used in Windows environments, so I suspect that this has been carried over to SQL Server. Spark looks like underlying cluster doesn’t, maybe something to do with Java or it running on Linux. I haven’t found a good answer to why, but if you know, let me know!