How synapse handles duplicate records with Primary and Unique constraints
Having primary key and/or unique key allows dedicated SQL pool engine to generate an optimal execution plan for a query. but it does not guarantee uniqueness or perform record uniqueness checks during insert operations. users needs to make sure primary key column or a unique constraint column values are unique.
In Synapse, PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used.
UNIQUE constraint is only supported when NOT ENFORCED is used.
The creation of indexes with unique constraint is not supported in Synapse.
By default, a dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table.
Here is an example with a unique constraint:
Without unique constraint
CREATE TABLE t1 (a1 INT, b1 INT)
INSERT INTO t1 VALUES (1, 100)
INSERT INTO t1 VALUES (1, 1000)
SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1
result:
1 2
with unique constraint
drop table t1
CREATE TABLE t1 (a1 INT UNIQUE NOT ENFORCED, b1 INT)
INSERT INTO t1 VALUES (1, 100)
INSERT INTO t1 VALUES (1, 1000)
SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1
result:
1 1
1 1
Here, with the unique constraint, both a1 values are considered as unique records.
With Primary Key:
drop table t1
CREATE TABLE t1 (a1 INT NOT NULL, b1 INT)
ALTER TABLE t1 add CONSTRAINT PK_t1_a1 PRIMARY KEY NONCLUSTERED (a1) NOT ENFORCED
INSERT INTO t1 VALUES (1, 100)
INSERT INTO t1 VALUES (1, 1000)
SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1
Result:
1 1
1 1
Please note: In both cases, Users need to make sure all values in those columns are unique. A violation of that may cause the query to return inaccurate result.
But the question is, how to enforce data uniqueness while inserting the records into the tables.
Unfortunately, synapse can’t enforce data uniqueness while inserting the records into the tables.
The reason is Primary key, and Unique constraints are only supported when NOT ENFORCED is used.
This is called Information Constraints(for the optimal execution plan).
Informational constraints are not enforced by the database engine, and are not used for additional verification of data; rather, they are used to improve query performance.
But we can use other methods like Hashing in dataflow for detecting duplicate rows before inserting them into the synapse.
Below is a video tutorial explaining Hashing with ADF and synapse.