How synapse handles duplicate records with Primary and Unique constraints

March 20, 2023 0 By Bhargava

Having primary key and/or unique key allows dedicated SQL pool engine to generate an optimal execution plan for a query. but it does not guarantee uniqueness or perform record uniqueness checks during insert operations. users needs to make sure primary key column or a unique constraint column values are unique.

In Synapse, PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used.

UNIQUE constraint is only supported when NOT ENFORCED is used.

The creation of indexes with unique constraint is not supported in Synapse.

By default, a dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table.

Here is an example with a unique constraint:

Without unique constraint

CREATE TABLE t1 (a1 INT, b1 INT)

INSERT INTO t1 VALUES (1, 100)

INSERT INTO t1 VALUES (1, 1000)

SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

result:

1 2

with unique constraint

drop table t1

CREATE TABLE t1 (a1 INT UNIQUE NOT ENFORCED, b1 INT)

INSERT INTO t1 VALUES (1, 100)

INSERT INTO t1 VALUES (1, 1000)

SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

result:

1 1

1 1

Here, with the unique constraint, both a1 values are considered as unique records.

With Primary Key:

drop table t1

CREATE TABLE t1 (a1 INT NOT NULL, b1 INT)

ALTER TABLE t1 add CONSTRAINT PK_t1_a1 PRIMARY KEY NONCLUSTERED (a1) NOT ENFORCED

INSERT INTO t1 VALUES (1, 100)

INSERT INTO t1 VALUES (1, 1000)

SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

Result:

1 1

1 1

Please note: In both cases, Users need to make sure all values in those columns are unique. A violation of that may cause the query to return inaccurate result.

But the question is, how to enforce data uniqueness while inserting the records into the tables.

 Unfortunately, synapse can’t enforce data uniqueness while inserting the records into the tables.

The reason is Primary key, and Unique constraints are only supported when NOT ENFORCED is used.

This is called Information Constraints(for the optimal execution plan).

Informational constraints are not enforced by the database engine, and are not used for additional verification of data; rather, they are used to improve query performance.

But we can use other methods like Hashing in dataflow for detecting duplicate rows before inserting them into the synapse.

Below is a video tutorial explaining Hashing with ADF and synapse.