Distinct on Postgresql JSON data column

If you use PG 9.4 , using JSONB rather than JSON solves this problem Example :

-- JSON datatype test 

create table t1 (id int, val json);
insert into t1 (id,val) values (1,'{"name":"value"}');
insert into t1 (id,val) values (1,'{"name":"value"}');
insert into t1 (id,val) values (2,'{"key":"value"}');
select * from t1 order by id;
select distinct * from t1 order by id;

-- JSONB datatype test 

create table t2 (id int, val jsonb);
insert into t2 (id,val) values (1,'{"name":"value"}');
insert into t2 (id,val) values (1,'{"name":"value"}');
insert into t2 (id,val) values (2,'{"key":"value"}');

select * from t2 order by id;

select distinct * from t2 order by id;

Result of running the above script :

CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1
1 | {"name":"value"}
1 | {"name":"value"}
2 | {"key":"value"}

ERROR:  could not identify an equality operator for type json
LINE 1: select distinct * from t1 order by id;
                    ^
CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1
1 | {"name": "value"}
1 | {"name": "value"}
2 | {"key": "value"}

1 | {"name": "value"}
2 | {"key": "value"}

As you can see PG succeeded to imply DISTINCT on a JSONB column while it fails on a JSON column !

Try also the following to see that actually keys in the JSONB are sorted :

insert into t2 values (3, '{"a":"1", "b":"2"}');
insert into t2 values (3, '{"b":"2", "a":"1"}');
select * from t2;

1 | {"name": "value"}
1 | {"name": "value"}
2 | {"key": "value"}
3 | {"a": "1", "b": "2"}
3 | {"a": "1", "b": "2"}

note that '{"b":"2", "a":"1"}' was inserted as '{"a":"1", "b":"2"}' therefor PG identifies that as the same record :

select distinct * from t2;
3 | {"a": "1", "b": "2"}
2 | {"key": "value"}
1 | {"name": "value"}

Yeah, unfortunately postgres json doesn't implement equality, but jsonb does. So migrating json columns to jsonb and it should work okay.


Sorry I'm late on this answer, but it might help others.

As I understand your query, you're only getting possible duplicates on profiles because of the many-to-many join to integrations (which you're using to determine which profiles to access).

Because of that, you can use a new GROUP BY feature as of 9.1:

When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or if the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

So in your case, you could get Ruby to create the query (sorry, I don't know the Ruby syntax you're using)...

SELECT profiles.* 
FROM "profiles" 
  INNER JOIN "integration_profiles" ON "profiles"."id" = "integration_profiles"."profile_id" 
  INNER JOIN "integrations" ON "integration_profiles"."integration_id" = "integrations"."id" 
WHERE "integrations"."user_id" = $1
GROUP BY "profiles"."id"

I only removed the DISTINCT from your SELECT clause and added the GROUP BY.

By referring ONLY to the id in the GROUP BY, you take advantage of that new feature because all the remaining profiles columns are "functionally dependent" on that id primary key.

Somehow, wonderfully that avoids the need for Postgres to do equality checks on the dependent columns (ie your json column in this case).

The DISTINCT ON solution is also great, and clearly sufficient in your case, but you can't use aggregate functions like array_agg with it. You CAN with this GROUP BY approach. Happy days! :)


The reason behind this, is that in PostgreSQL (up to 9.3) there is no equality operator defined for json (i.e. val1::json = val2::json will always throw this exception) -- in 9.4 there will be one for the jsonb type.

One workaround is, you can cast your json field to text. But that won't cover all json equalitions. f.ex. {"a":1,"b":2} should be equal to {"b":2,"a":1}, but won't be equal if casted to text.

Another workaround is (if you have a primary key for that table -- which should be) you can use the DISTINCT ON (<expressions>) form:

u.profiles.select("DISTINCT ON (profiles.id) profiles.*")

Note: One known caveat for DISTINCT ON:

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.