Is there a SQL Server implementation of the Longest Common Substring problem?
This can be done rather easily as a SQLCLR User-Defined Aggregate (UDA). An aggregate operates over a set of rows, so you would be able to operate over all rows or just a subset, based on a WHERE
condition and optional GROUP BY
(if wanting to operate over separate sets of rows).
BUT, whether or not you should do this depends on what you are planning on doing with the result. If this is a one-off project to do some research that won't be repeated, then it is probably best to just make a small Console App to read in the rows and process accordingly.
However, if you do have some need to use the returned value within a database-centered process, then SQLCLR should be fine (assuming you follow the two recommendations mentioned under "The following tricks can be used to reduce the memory usage of an implementation" in the Pseudocode section. You will just need to find a "creative" way of dealing with the situation of having multiple results for what is considered the longest common substring (i.e. if 2 or more common substrings tie for "first place"). Using the example from the Wikipedia page (which shows 2 matches for the 2 strings):
ABAB
BABA
Returns both:
- BAB
- ABA
Perhaps returning an XML document of matches since that is parsable and can contain any string (when properly escaped).
UPDATE (updated, and updated again)
The .NET C# source code for this can be found on Pastebin.com at:
SQLCLR UDA for Longest Common Substring - Source Code
And for anyone that wants to play with this UDA without compiling it, an installation T-SQL script (no external DLL) can be found on Pastebin.com at:
SQLCLR UDA for Longest Common Substring - Installer
The UDA should be fairly memory-efficient as it only stores substrings that match the current string and all previously encountered strings. As a new row calls the UDA, any substrings that are not found in the new string are removed from the collection.
It is also CPU-efficient in that, if at any point the number of "common" substrings goes to zero, it sets a flag indicating that no substrings are even possible and short-cicruits all future executions to simply exit upon being called. This happens immediately if an empty string is encountered. In any of these cases, an empty XML document (i.e. root element only) is returned. The meaning of the empty document is equivalent to an empty string, since the only thing the input strings have in common is that they were non-NULL
strings.
NULL
, in my interpretation, is ignored and does not indicate no possible matches like an empty string does.
A NULL
is returned in the following two cases:
- All inputs are
NULL
- Only a single non-
NULL
row was in the set and hence there was no other string to compare to, hence nothing to be considered "common".
I added a second input parameter which controls whether or not the return is just the longest common substrings or all common substrings. In the case of returning all, an attribute is added to each "item" to indicate whether or not it is one of the "longest" substrings:
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test1a]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);
Returns:
<Items Merged="False">
<Item>ABA</Item>
<Item>BAB</Item>
</Items>
And
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test1b]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);
Returns:
<Items Merged="False">
<Item IsLongest="True">ABA</Item>
<Item IsLongest="True">BAB</Item>
<Item IsLongest="False">AB</Item>
<Item IsLongest="False">BA</Item>
<Item IsLongest="False">A</Item>
<Item IsLongest="False">B</Item>
</Items>
Also, the comparisons are now case-InSensitive to match the typical Collation.
Below are 16 more test cases that check functionality only, not performance. I will post additional tests later that operate over many rows of much longer strings. I intentionally left out Combining Characters and Supplementary Characters, for now, as they are bit a more complicated.
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test2]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test3]
FROM (VALUES (N'ABAB'), (N'BABA'), (NULL), (N'2BAB5')) tab(col);
-- <Items><Item>BAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test4]
FROM (VALUES (NULL), (NULL), (NULL)) tab(col);
-- NULL
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test5]
FROM (VALUES (N'ABAB'), (N'BABA'), (N''), (N'2BAB5')) tab(col);
-- <Items />
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test6]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'L'), (N'2BAB5')) tab(col);
-- <Items />
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test7]
FROM (VALUES (N'ABAB')) tab(col);
-- NULL
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8a-DuplicatesAcross2Rows]
FROM (VALUES (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8b-DuplicatesAcross3Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8c-DuplicatesAcross4Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test9-DuplicatesWithinOneString]
FROM (VALUES (N'ABAB'), (N'zABABh2348923ABABf')) tab(col);
-- <Items Merged="False"><Item>ABAB</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test10-XmlEncodableCharacters]
FROM (VALUES (N'ABA&B'), (N'zABA&Bh2348923ABA&Bf')) tab(col);
-- <Items Merged="False"><Item>ABA&B</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11a-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11b-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'512tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11c-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'ABCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11d-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'5123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test11e-FinalMatchesShorterThanInitialSet]
FROM (VALUES (N'BCDq1234g'), (N'1234qABCDg'), (N'uiyuiuy1234qBCDg'), (N'123tttrtrtBCDdfdfgdg')) tab(col);
-- <Items Merged="False"><Item>BCD</Item><Item>123</Item></Items>
SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test12-CaseInSensivity]
FROM (VALUES (N'AbAB'), (N'BAbA')) tab(col);
/*
<Items Merged="False">
<Item IsLongest="True">AbA</Item>
<Item IsLongest="True">bAB</Item>
<Item IsLongest="False">Ab</Item>
<Item IsLongest="False">bA</Item>
<Item IsLongest="False">A</Item>
<Item IsLongest="False">b</Item>
</Items>
*/
Final Update (hopefully)
I made a couple of minor changes to the test code provided in @MisterMagoo's answer so that it would: a) return ties for "longest" common substring, and b) include the last character in each string as part of the search. I also changed, slightly, how the initial test rows are gathered such that there would be just over 1 million rows, and re-ran the tests. The results were that the T-SQL version took 1 minute and 13 seconds while the SQLCLR UDA took only 12 seconds (and even has the option of returning all common substrings, not just the longest).
I then modified the test data to include another common substring of the same length as the current winner (7 characters), a shorter but still common substring of 4 characters, and 3 random characters. This increased the max test string size from 32 characters to 46 characters. Running the test again (same code), the T-SQL version had to be killed at 23 minutes and had only tested the first 3 lengths: 27, 26, and 25. The SQLCLR UDA returned in about 1 minute and 10 seconds.
Is it time to celebrate? Has SQLCLR saved the day? Hold on..
On a hunch, I decided to see if the optimization I added to the SQLCLR version would help the T-SQL, namely:
Grab the shortest two strings (shortest would have the fewest number of possible substrings and would be the longest possible match anyway).
Extract all possible common substrings from the two short strings (any substrings that are "common" across all rows necessarily must be in the set derived from just these two strings, and no new substrings from other rows can be introduced as they wouldn't be "common").
Starting with the longest common substring, test to see if it is found in all rows. I used
IF (NOT EXISTS (WHERE CHARINDEX(substring, test_row) > 0))
because theEXISTS
clause will exit on the first row that returns a 0 (meaning substring not present) and hence not need to test all rows. This part is accomplished with aCURSOR
(shh, don't tell anyone) because it allows for starting at the top of the list and just picking each new row off without needing to re-scan the list each time to find the next entry.Once the first substring is found, save its length in a variable, and save the substring itself into a table variable.
Keep testing, but only for strings that are the same length as the first common substring found. as soon as the length of the next substring to search for is less than the length of the first substring to match, exit the loop and clean up the cursor.
All of this cursor stuff must make it painfully slow, right? Well, keeping in mind that the prior T-SQL version would have taken several hours to finish, and the SQLCLR UDA took 1 minute and 10 seconds, you might guess this updated version would take, what? A few minutes? 10 minutes? More? After all, just mentioning "cursor" is an automatic 5 minute hit, right? The actual time for the modified version:
22 seconds!!!
Of course, that is largely due to the substrings being on the longer side (compared to the size of the strings) and so the loop was able to exit earlier than if the longest common substring was only 3 or 4 characters long (i.e. it had less to test, but that is still a valid case and is not cheating). Also, one benefit of working in T-SQL is that you can test the entire set against a single value, whereas SQLCLR only has the current row and cannot see the entire set, hence the SQLCLR cannot short-circuit upon finding the longest common substring (because it won't know what is "common" in the first place until it has been executed across all rows).
A final change was to allow the new T-SQL version to return all "common" substrings, not just the longest ones, while indicating which ones were the longest in a second column of datatype BIT
. When returning all substrings, mirroring the functionality of the SQLCLR UDA (even when the UDA only returns the longest common substrings, it still has the full list of all common substrings stored since it, again, has no ability to short-circuit), the T-SQL version returns in 2 minutes and 41 seconds.
So, there are conditions were the T-SQL can be faster, even at 1.2 million rows. But the performance is far more stable for the SQLCLR version, and definitely faster when you want all common substrings.
The test script, containing all 3 tests along with their results, can be found on Pastebin at:
SQLCLR UDA for Longest Common Substring - Testing
P.S. Even though the test was done over 1.2 million rows, the longest string tested was 46 characters. I am not sure how the performance of either approach would be affected by operating over strings that are much longer. That testing will have to wait until there is time to do it ;-).
Solomon is probably right, but until a CLR solution shows up, here's a T-SQL version to play with.
/* Create some test data : on SQL 2016 this creates about 470K rows to test (with duplicates) */
if object_id('tempdb..#Strings') is not null drop table #Strings
select a.Name as String, len(a.Name)+1 as StringLength
into #Strings
from sys.all_columns a, sys.all_columns b
where a.Name like '%refer%';
set nocount on;
/* Any nulls mean there is not a "longest common string" */
if exists(select 1 from #Strings where String is null)
begin
return;
end;
/* We need to know number of rows in the sample and the length of the shortest string
as the longest common substring cannot be longer than the shortest string in the set */
declare @totalrows int;
declare @minlen tinyint;
declare @result varchar(50);
select @minlen = min(StringLength-1), @totalrows = count(distinct String) from #Strings;
raiserror(N'Maximum Possible Length: %d Total Distinct Rows: %d',0,0,@minlen,@totalrows) with nowait;
/* Check backwards from the longest possible string to the shortest and break when we find a match */
/* You might want to check the air conditioner is switched on here */
while @minlen>1 and @result is null
begin
raiserror(N'Processing strings of length: %d',0,0,@minlen) with nowait;
/* this method is "brute force"
1. find all substrings for each input string
2. pick the first substring that appears in every input string
we find this by grouping, counting and comparing to the number of input strings
*/
select top(1) @result=match
from (
select String, substring(String, T.N, @minlen) match
from #Strings
cross apply ( select StringLength - @minlen ) a(L)
cross apply (
select top(a.L) V.N
from (
values(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50)
) V(N)
) T(N)
) matches(String, match)
group by match
having count(distinct String) = @totalrows;
-- order by match;
/* Decrement so next time we check for a shorter match */
set @minlen = @minlen -1;
end
/* display the result */
select 'Longest Common Substring: '+isnull(@result,'*** no match found ***');
UPDATED on 20171127 at 11:58PM CST TO HANDLE UNIGRAMS
I know I'm a bit late here (by more than a year) but, my newest t-sql Longest Common Substring function is several thousand times faster than anything I've seen posted anywhere, including the aforementioned CLR (I just tested it).
It's a semi brute force technique I came up that:
Breaks up the shorter of the two strings and searches the longer substring for a match.
Accepts a 3rd parameter called a "window" (it's more like an NTile parameter but I use the term "window" because few people understand Ntile.) This is the secret sauce that makes this bad dog so fast
The routine then uses a pure brute force to see if the substrings size 20 or less in the short string also exist in the longer string. Using a tally table - the brute force approach for substrings size 20 or less will be instantaneous (0 ms).
After that it searches the longer string for substrings that exist in the shorter string whose size is evenly divisible by @window. For example, if the @window = 100, it will search for matching substrings that are of length 100, 200.... up to the length of the longer string (e.g. if the longer string is 515 characters long it will search for matching substrings that are 100, 200, 300, 400 and 500 characters long.
Once we have identified which "window" the longest substring lives in, I use a tally table and a "gaps & islands on strings" trick I learned from Chris Morris (discussed here) to compare each string for matching substrings.
TOP 1 with ties is used to identify the longest substring.
The function(s):
CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;
GO
CREATE FUNCTION dbo.lcssWindowAB(@s1 varchar(8000), @s2 varchar(8000), @window int)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
/*****************************************************************************************
Purpose
Calculates the longest common substring between two varchar(n) strings up to 8000
characters each.
Developer Notes:
1. Optimal performance gains will be seen on longer strings. All Longest Common Substring
functions I have seen in SQL Server begin to choke at 500-1000. Set based Brute force
solutions that use a tally table are very fast for up to 1000 characters but begin to
slow dramatically after.
With N as the length of a string, The number of substrings is: N(N+1)/2
For 1,000 character string: 1000(1000+1)/2 = 500,500 substrings
For 2,000 characters: 2000(2000+1)/2 = 2,001,000 substrings
2. Requires a materialized tally table beginning with 1 containing at least 8000 numbers;
as written, this function will slow to a crawl using a cte tally table. This is due to
the WHERE x.n BETWEEN 2 AND 10 clause. A CTE tally table will struggle with this but a
correctly indexed materialized tally table will not.
For optimal performance your tally table should have a unique nonclustered index.
THE DDL TO BUILD THE REQUIRED TALLY TABLE IS BELOW.
3. Performance optimizations:
3.1. The first major optimization is that the *shorter* of the two strings is broken up
into substrings which are then searched for in the larger string using CHARINDEX.
This reduces the number of substrings generated by:
abs(len(@s1)-len(@s2)) * (abs(len(@s1)-len(@s2)) + 1) / 2
For example, if one is 20 characters longer than the other, 210 fewer substrings
will be evaluated; for 100 it's 5,050 fewer substrings, etc.
3.2. The second optimization is a technique I developed that I call "windowing". I use
a @window parameter to break up the search specific windows for the presense of a
matching substring. 1-10 legnth substrings in the smaller string are searched for
in the longer string. After that, only tokens with sizes evenly divisible by the
@window parameter are evaluated. For example, say the short string length is 100
and I set @window 20. All tokens sized 1-10 are searched for, the 20-grams, then
40, 60... 100. This reduces the number of substrings from 5,050 to somewhere
beween 480 & 500 (depending on the size of the longest common substring.)
4. The window parameter is for optimization only! It does not affect the final result set
in any way. I strongly suggest that testing the function using different window sizes
for different scenarios; the optimal size will vary. I have seen queries execute 3-10
faster when changing the window size. Start with 100, then try windows sizes of
20, 200, 300, 1000...
5. This function does not see a performance gain when run in parallel;
use option (maxdop 1) unless you're testing shows performance gains when running under
a parallel execution plan.
Required Tally Table DDL (this will build exactly 8000 rows):
------------------------------------------------------------------------------------------
-- (1) Create tally table
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (N int not null);
-- (2) Add numbers to the tally table
WITH DummyRows(V) AS (
SELECT 1 FROM (VALUES -- 100 Dummy Rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N))
--INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c;
-- (3) Required constraints (and indexes) for performance
ALTER TABLE dbo.tally ADD CONSTRAINT pk_cl_tally PRIMARY KEY CLUSTERED(N)
WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally ADD CONSTRAINT uq_nc_tally UNIQUE NONCLUSTERED(N);
GO
------------------------------------------------------------------------------------------
Usage Examples:
SELECT * FROM dbo.lcssWindowAB('abcxxx', 'yyyabczzz', 10);
SELECT * FROM dbo.lcssWindowAB('123xxx', '!!123!!xxx!!', 10);
History:
20171126 - Initial Development - Developed by Alan Burstein
20171127 - updated to return unigrams when longest common substring is 1;
updated to use a materialized tally table; -- Alan Burstein
*****************************************************************************************/
SELECT TOP (1) WITH TIES itemIndex, itemLen = itemLen+addThis, item
FROM dbo.getshortstring8k(@s1, @s2) xs
CROSS APPLY
(
SELECT
itemIndex = MIN(position) over (partition by grouper order by (select $)),
itemLen = itemLen,
addThis = position-MIN(position) over (partition by grouper order by (select $)),
item = SUBSTRING
(
xs.s1,
MIN(position) over (partition by grouper order by (select $)),
itemLen+position-MIN(position) over (partition by grouper order by (select $))
)
FROM
(
SELECT position - ROW_NUMBER() OVER (ORDER BY position), position, itemLen
FROM
(
SELECT TOP (1) WITH TIES t.N, x.N -- Get the "longest" (including ties)
FROM dbo.tally t -- all positions within the shorter string (s.s1)
CROSS JOIN dbo.tally x -- all sizes of substrings within the shorter string
WHERE (t.N <= LEN(xs.s1) AND x.N <= LEN(xs.s1) AND LEN(xs.s1) - t.N + 1 - x.N >= 0)
AND (x.N BETWEEN 2 AND 10 OR x.N % @window = 0) -- only 2-20 & @window-sized tokens
AND CHARINDEX(SUBSTRING(xs.s1, t.N, x.N), xs.s2) > 0 -- only tokens matched in both strings
ORDER BY -x.N
) longSubstrings (position, itemLen)
UNION ALL
SELECT -position-1, position, 1
FROM
(
SELECT t.N, 1 -- unigrams only
FROM dbo.tally t -- all positions within the shorter string (s.s1)
WHERE (t.N <= LEN(xs.s1)) -- all valid unigrams
AND CHARINDEX(SUBSTRING(xs.s1, t.N, 1), xs.s2) > 0 -- only unigrams matched in both strings
) unigrams (position, itemLen)
) addGrouper (grouper, position, itemLen)
) lcssWindow (itemIndex, itemLen, addthis, item)
WHERE @window >= 10 AND @window%10 = 0 -- must be greater than 10 and divisible by 10
AND CHARINDEX(item, xs.s2) > 0
ORDER BY -itemLen, -addThis;
GO
Here's an example of how the function calculating the longest common substring between two strings about 7700 characters long. It returns the correct answer in 16 milliseconds.
set statistics time on;
declare
@s1 varchar(8000) = replicate('x',50)+replicate('Wow!',1900)+'!'+'junk,junk,junk...',
@s2 varchar(8000) = replicate('z',95)+replicate('Wow!',1900)+'!'+'zzzzzzzzzzz......',
@window int = 100;
select len(@s1), len(@s2);
select * from dbo.lcssWindowAB(@s1, @s2, 100);
set statistics time off;
This is part of an article I'm working on. More to come!