Collect all Similar Persons to One Group
It will be better to make such calculations in iterative manner.
The "issue" of such solution is that, in case of large source table, the output table will contains all the rows. But, actually, this is the benefit also: in case of large source table, such code will work much more faster, that recursions.
if object_id('tempdb..#src') is not null drop table #src;
create table #src (id1 int not null, id2 int)
if object_id('tempdb..#rez') is not null drop table #rez;
create table #rez (id int not null, chainid int not null);
declare @chainId int;
insert #src
values (10, 20), (10, 30), (30, 30), (10, 40), (50, 70), (60, 50), (70, 70)
--values (4457745,255714),(4457745,2540222),(2540222,4457745),(255714,4457745)
set @chainId = 1
while 1 = 1
begin
insert #rez(id, chainid)
select top 1 id1, @chainId
from #src
where
Id1 NOT IN (select Id from #rez)
and id2 NOT IN (select Id from #rez)
if @@rowcount = 0 break
while 1 = 1
begin
insert #rez(id, chainid)
select id1, @chainid
from #src
where
id2 in (select id from #rez where chainid = @chainId)
and id1 not in (select id from #rez where chainid = @chainId)
union
select id2, @chainId
from #src
where
id1 in (select id from #rez where chainid = @chainId)
and id2 not in (select id from #rez where chainid = @chainId)
if @@rowcount = 0 break
end
select @chainId = @chainId + 1
end
select [NewId] = chainid, OldID = id
from #rez
order by 1, 2;
In some details, the "logic" of the script is the following:
set new ID (chainid) initial value (ex. 1)
get the first entry which was not found yet (external cycle)
- find and store all related to it (in both columns) (internal cycle)
increment the new ID
find next entry which is not in the result table
continue till have data to work
return sorted result
Here I adapted my old answer from a similar SO question How to find all connected subgraphs of an undirected graph.
This variant doesn't use a cursor or explicit loop, but uses a single recursive query.
Essentially, it treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.
See the detailed explanations of how it works below. I recommend you to run the query CTE-by-CTE and examine each intermediate result to understand what it does.
Sample data
DECLARE @T TABLE (ID1 int NOT NULL, ID2 int NOT NULL);
INSERT INTO @T (ID1, ID2) VALUES
(10, 20),
(10, 30),
(30, 30),
(10, 40),
(50, 70),
(60, 50),
(70, 70),
(4457745, 255714),
(4457745,2540222),
(2540222,4457745),
( 255714,4457745);
Query
WITH
CTE_Ids
AS
(
SELECT ID1 AS ID
FROM @T
UNION
SELECT ID2 AS ID
FROM @T
)
,CTE_Pairs
AS
(
SELECT ID1, ID2
FROM @T
WHERE ID1 <> ID2
UNION
SELECT ID2 AS ID1, ID1 AS ID2
FROM @T
WHERE ID1 <> ID2
)
,CTE_Recursive
AS
(
SELECT
CAST(CTE_Ids.ID AS varchar(8000)) AS AnchorID
,ID1
,ID2
,CAST(
',' + CAST(ID1 AS varchar(8000)) +
',' + CAST(ID2 AS varchar(8000)) +
',' AS varchar(8000)) AS IdPath
,1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Ids ON CTE_Ids.ID = CTE_Pairs.ID1
UNION ALL
SELECT
CTE_Recursive.AnchorID
,CTE_Pairs.ID1
,CTE_Pairs.ID2
,CAST(
CTE_Recursive.IdPath +
CAST(CTE_Pairs.ID2 AS varchar(8000)) +
',' AS varchar(8000)) AS IdPath
,CTE_Recursive.Lvl + 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.ID2 = CTE_Pairs.ID1
WHERE
CTE_Recursive.IdPath NOT LIKE
CAST('%,' + CAST(CTE_Pairs.ID2 AS varchar(8000)) + ',%' AS varchar(8000))
)
,CTE_RecursionResult
AS
(
SELECT AnchorID, ID1, ID2
FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
SELECT AnchorID, ID1 AS ID
FROM CTE_RecursionResult
UNION
SELECT AnchorID, ID2 AS ID
FROM CTE_RecursionResult
)
SELECT
DENSE_RANK() OVER (ORDER BY CA_Data.XML_Value) AS [NewID]
,CTE_Ids.ID AS [OldID]
,CA_Data.XML_Value AS GroupMembers
FROM
CTE_Ids
CROSS APPLY
(
SELECT CAST(CTE_CleanResult.ID AS nvarchar(max)) + ','
FROM CTE_CleanResult
WHERE CTE_CleanResult.AnchorID = CTE_Ids.ID
ORDER BY CTE_CleanResult.ID FOR XML PATH(''), TYPE
) AS CA_XML(XML_Value)
CROSS APPLY
(
SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
) AS CA_Data(XML_Value)
ORDER BY [NewID], [OldID];
Result
+-------+---------+-------------------------+
| NewID | OldID | GroupMembers |
+-------+---------+-------------------------+
| 1 | 10 | 10,20,30,40, |
| 1 | 20 | 10,20,30,40, |
| 1 | 30 | 10,20,30,40, |
| 1 | 40 | 10,20,30,40, |
| 2 | 255714 | 255714,2540222,4457745, |
| 2 | 2540222 | 255714,2540222,4457745, |
| 2 | 4457745 | 255714,2540222,4457745, |
| 3 | 50 | 50,60,70, |
| 3 | 60 | 50,60,70, |
| 3 | 70 | 50,60,70, |
+-------+---------+-------------------------+
How it works
CTE_Ids
CTE_Ids
gives the list of all Identifiers that appear in both ID1
and ID2
columns.
Since they can appear in any order we UNION
both columns together. UNION
also removes any duplicates.
+---------+
| ID |
+---------+
| 10 |
| 20 |
| 30 |
| 40 |
| 50 |
| 60 |
| 70 |
| 255714 |
| 2540222 |
| 4457745 |
+---------+
CTE_Pairs
CTE_Pairs
gives the list of all edges of the graph in both directions. Again, UNION
is used to remove any duplicates.
+---------+---------+
| ID1 | ID2 |
+---------+---------+
| 10 | 20 |
| 10 | 30 |
| 10 | 40 |
| 20 | 10 |
| 30 | 10 |
| 40 | 10 |
| 50 | 60 |
| 50 | 70 |
| 60 | 50 |
| 70 | 50 |
| 255714 | 4457745 |
| 2540222 | 4457745 |
| 4457745 | 255714 |
| 4457745 | 2540222 |
+---------+---------+
CTE_Recursive
CTE_Recursive
is the main part of the query that recursively traverses the graph starting from each unique Identifier.
These starting rows are produced by the first part of UNION ALL
.
The second part of UNION ALL
recursively joins to itself linking ID2
to ID1
.
Since we pre-made CTE_Pairs
with all edges written in both directions, we can always link only ID2
to ID1
and we'll get all paths in the graph.
At the same time the query builds IdPath
- a string of comma-delimited Identifiers that have been traversed so far.
It is used in the WHERE
filter:
CTE_Recursive.IdPath NOT LIKE
CAST('%,' + CAST(CTE_Pairs.ID2 AS varchar(8000)) + ',%' AS varchar(8000))
As soon as we come across the Identifier that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted.
AnchorID
is the starting Identifier for the recursion, it will be used later to group results.
Lvl
is not really used, I included it for better understanding of what is going on.
+----------+---------+---------+--------------------------+-----+
| AnchorID | ID1 | ID2 | IdPath | Lvl |
+----------+---------+---------+--------------------------+-----+
| 10 | 10 | 20 | ,10,20, | 1 |
| 10 | 10 | 30 | ,10,30, | 1 |
| 10 | 10 | 40 | ,10,40, | 1 |
| 20 | 20 | 10 | ,20,10, | 1 |
| 30 | 30 | 10 | ,30,10, | 1 |
| 40 | 40 | 10 | ,40,10, | 1 |
| 50 | 50 | 60 | ,50,60, | 1 |
| 50 | 50 | 70 | ,50,70, | 1 |
| 60 | 60 | 50 | ,60,50, | 1 |
| 70 | 70 | 50 | ,70,50, | 1 |
| 255714 | 255714 | 4457745 | ,255714,4457745, | 1 |
| 2540222 | 2540222 | 4457745 | ,2540222,4457745, | 1 |
| 4457745 | 4457745 | 255714 | ,4457745,255714, | 1 |
| 4457745 | 4457745 | 2540222 | ,4457745,2540222, | 1 |
| 2540222 | 4457745 | 255714 | ,2540222,4457745,255714, | 2 |
| 255714 | 4457745 | 2540222 | ,255714,4457745,2540222, | 2 |
| 70 | 50 | 60 | ,70,50,60, | 2 |
| 60 | 50 | 70 | ,60,50,70, | 2 |
| 40 | 10 | 20 | ,40,10,20, | 2 |
| 40 | 10 | 30 | ,40,10,30, | 2 |
| 30 | 10 | 20 | ,30,10,20, | 2 |
| 30 | 10 | 40 | ,30,10,40, | 2 |
| 20 | 10 | 30 | ,20,10,30, | 2 |
| 20 | 10 | 40 | ,20,10,40, | 2 |
+----------+---------+---------+--------------------------+-----+
CTE_CleanResult
CTE_CleanResult
leaves only relevant parts from CTE_Recursive
and again merges both ID1
and ID2
using UNION
.
+----------+---------+
| AnchorID | ID |
+----------+---------+
| 10 | 10 |
| 10 | 20 |
| 10 | 30 |
| 10 | 40 |
| 20 | 10 |
| 20 | 20 |
| 20 | 30 |
| 20 | 40 |
| 2540222 | 255714 |
| 2540222 | 2540222 |
| 2540222 | 4457745 |
| 255714 | 255714 |
| 255714 | 2540222 |
| 255714 | 4457745 |
| 30 | 10 |
| 30 | 20 |
| 30 | 30 |
| 30 | 40 |
| 40 | 10 |
| 40 | 20 |
| 40 | 30 |
| 40 | 40 |
| 4457745 | 255714 |
| 4457745 | 2540222 |
| 4457745 | 4457745 |
| 50 | 50 |
| 50 | 60 |
| 50 | 70 |
| 60 | 50 |
| 60 | 60 |
| 60 | 70 |
| 70 | 50 |
| 70 | 60 |
| 70 | 70 |
+----------+---------+
Final SELECT
Now we need to build a string of comma-separated ID
values for each AnchorID
.
CROSS APPLY
with FOR XML
does it.
DENSE_RANK()
calculates the NewID
numbers for each AnchorID
.
This query builds a full path for each identifier, which is not really necessary.
It is possible to make the overall solution more efficient by running this query once for each group/subgraph, in a loop.
Pick a single starting Identifier (add WHERE ID = 10
to CTE_Ids
and CTE_Pairs
), then delete from the source table all identifiers that were appended to this group. Repeat until the big table is empty.