More efficient algorithm for shortest superstring search
This should do it.
import itertools as it
SEQUENCES = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
LONGEST_SUPERSTRING = ''.join(SEQUENCES)
def find_shortest_superstring():
current_shortest = LONGEST_SUPERSTRING
trim = len(current_shortest)-1
seen_prefixes = set()
for perm in it.permutations(SEQUENCES):
candidate_string = ''.join(perm)[:trim]
if candidate_string in seen_prefixes:
continue
seen_prefixes.add(candidate_string)
while is_superstring(candidate_string):
current_shortest = candidate_string
candidate_string = candidate_string[:-1]
trim = len(current_shortest)-1
return current_shortest
def is_superstring(s):
return all(seq in s for seq in SEQUENCES)
def main():
print 'Searching for shortest superstring containing all strings.'
ss = find_shortest_superstring()
print 'Found shortest superstring containing all strings:\n{}'.format(ss)
if __name__ == '__main__':
main()
The code takes about 15 seconds to run and produces the following output:
Searching for shortest superstring containing all strings.
Found shortest superstring containing all strings:
CCGTAGGTGGAGT
Just backtracking, but always check most overlapped first. After get a good candidate answer, later when current path result in a string has length big or equal to this candidate answer, we do not need to go further with this path.
Tested in my Jupyter notebook. It seems to be much faster than the other two answers here (11/18/2018)
def shortestSuperstring(A):
"""
:type A: List[str]
:rtype: str
"""
if len(A)==1:
return A[0]
dic={}
for i in xrange(len(A)):
for j in xrange(len(A)):
if i!=j:
ol=0
for k in xrange(1,min(len(A[i]),len(A[j]))):
if A[j][:k]==A[i][-k:]:
ol=k
dic[(i,j)]=ol
if max(dic.values())==0:
return "".join(A)
else:
ret="".join(A)
l=len(ret)
stack=[]
for i,wd in enumerate(A):
tmp=set(range(len(A)))
tmp.remove(i)
stack.append((wd,i,tmp))
while stack:
ans,cur,remain=stack.pop()
if len(ans)<l:
if not remain:
ret=ans
l=len(ret)
else:
tmp=[[dic[cur,idx],idx] for idx in remain] # [#overlap,idx]
tmp.sort()
for ol,idx in tmp:
nans=ans+A[idx][ol:]
nremain=set(remain)
nremain.remove(idx)
stack.append((nans,idx,nremain))
return ret
The test case in the problem takesL
1.93 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
to run and gives the answer:
'CCGTGGTAGGAGT'
Some other test cases (longer strings and that start to beat the other two methods, all about 1~5 seconds):
****************************************************************************************************
case:
['mftpvodataplkewcouz', 'krrgsoxpsnmzlhprsl', 'qhbfymytxzbmqma', 'hunjgeaolcuznhpodi', 'kewcouzbwlftz', 'xzbmqmahunjgeaolcu', 'zlhprslqurnqbhsjr', 'rrgsoxpsnmzlhprslqur', 'diqukrrgsoxpsnmz', 'sjrxzavamftpvoda']
****************************************************************************************************
ans: qhbfymytxzbmqmahunjgeaolcuznhpodiqukrrgsoxpsnmzlhprslqurnqbhsjrxzavamftpvodataplkewcouzbwlftz
****************************************************************************************************
case:
['cedefifgstkyxfcuajfa', 'ooncedefifgstkyxfcua', 'assqjfwarvjcjedqtoz', 'fcuajfassqjfwarvjc', 'fwarvjcjedqtozctcd', 'zppedxfumcfsngp', 'kyxfcuajfassqjfwa', 'fumcfsngphjyfhhwkqa', 'fassqjfwarvjcjedq', 'ppedxfumcfsngphjyf', 'dqtozctcdk']
****************************************************************************************************
ans: zppedxfumcfsngphjyfhhwkqaooncedefifgstkyxfcuajfassqjfwarvjcjedqtozctcdk
****************************************************************************************************
case:
['ekpijtseahvmprvefkgn', 'yyevvcmeekpijtseahvm', 'vsfcyyevvcmeekp', 'xwmkoqhxvrovlmmvsfcy', 'cmeekpijtseahvmpr', 'oqhxvrovlmmvsfcyy', 'zpuemtclxbxwsypfxevx', 'clxbxwsypfxevxw', 'fkgnjgdvfygnlckyiju', 'xevxwmkoqhxvrovlmm']
****************************************************************************************************
ans: zpuemtclxbxwsypfxevxwmkoqhxvrovlmmvsfcyyevvcmeekpijtseahvmprvefkgnjgdvfygnlckyiju
****************************************************************************************************
case:
['ppgortnmsy', 'czmysoeeyugbiylso', 'nbfzpppvhbjydtx', 'rnzynedhoiunkpon', 'ornzynedhoiunkpo', 'ylsomoktkyfgljcf', 'jtvkrornzynedhoiunk', 'hvhhihwdffmxnczmyso', 'ktkyfgljcfbkqcpp', 'nzynedhoiunkponbfz', 'nedhoiunkponbfzpppvh']
****************************************************************************************************
ans: hvhhihwdffmxnczmysoeeyugbiylsomoktkyfgljcfbkqcppgortnmsyjtvkrornzynedhoiunkponbfzpppvhbjydtx
****************************************************************************************************
case:
['amefulhsdgvjvoab', 'giqxpqszaitzfzvtalx', 'cyqeolfgkihssycmiodg', 'glhhcfuprwazet', 'cmiodgiqxpqszaitzf', 'lhsdgvjvoabdviglhhcf', 'ssycmiodgiqxpqsza', 'bxtdqnamefulhsdg', 'namefulhsdgvjvo', 'ihssycmiodgiqxp', 'itzfzvtalxfybxtdqn']
****************************************************************************************************
ans: cyqeolfgkihssycmiodgiqxpqszaitzfzvtalxfybxtdqnamefulhsdgvjvoabdviglhhcfuprwazet
****************************************************************************************************
case:
['yobbobwqymlordokxka', 'jllfoebgbsrguls', 'rgulsnatnpuuwiyba', 'ordokxkamymamofefr', 'wqymlordokxkamy', 'fycxifzsjllfoebgbsrg', 'lordokxkamymamofe', 'kxkamymamofefrmfycx', 'frmfycxifzsjllf', 'srgulsnatnpuuwiy']
****************************************************************************************************
ans: yobbobwqymlordokxkamymamofefrmfycxifzsjllfoebgbsrgulsnatnpuuwiyba
****************************************************************************************************
case:
['jnbbbbsczcscxawcze', 'bsczcscxawczeumyyr', 'lyofvbhvjmquhkgz', 'quhkgzyzdwtjnbbb', 'kgzyzdwtjnbbbbsczc', 'uouxnfplptpkgnronf', 'pqgyfqglyofvbhv', 'kgnronftgswvpqgy', 'marvhdxtbmkcpnli', 'qgyfqglyofvbhvjmquhk', 'xtbmkcpnliz']
****************************************************************************************************
ans: marvhdxtbmkcpnlizuouxnfplptpkgnronftgswvpqgyfqglyofvbhvjmquhkgzyzdwtjnbbbbsczcscxawczeumyyr
****************************************************************************************************
case:
['qrwpawefqzfjsan', 'jsanzdukfkdlmyox', 'neaxnkedjxbpgsyq', 'nqjvzryhfjdsxmwolwo', 'hfjdsxmwolwomeeewvi', 'lmyoxbpvkneaxnkedjxb', 'qbhpqrwpawefqzfjsa', 'pawefqzfjsanzdukfk', 'bqbhpqrwpawefqzfj', 'dlmyoxbpvkneaxnk', 'xnkedjxbpgsyqovvh']
****************************************************************************************************
ans: bqbhpqrwpawefqzfjsanzdukfkdlmyoxbpvkneaxnkedjxbpgsyqovvhnqjvzryhfjdsxmwolwomeeewvi
****************************************************************************************************
case:
['vgrikrnwezryimj', 'umwgwvzpsfpmctzt', 'pjourlpgeemdjor', 'urlpgeemdjorpzbkbz', 'jorpzbkbzcqyewih', 'xuwkzvoczozhhvf', 'ihbumoogibirbsvch', 'nwezryimjivvpjourlp', 'kzvoczozhhvfwgeplv', 'ezryimjivvpjourlpgee', 'zhhvfwgeplvqngglu', 'rikrnwezryimjivvp']
****************************************************************************************************
ans: xuwkzvoczozhhvfwgeplvqngglumwgwvzpsfpmctztvgrikrnwezryimjivvpjourlpgeemdjorpzbkbzcqyewihbumoogibirbsvch
****************************************************************************************************
case:
['nbsgonqmpreelpbr', 'hnysjajtiguehrokus', 'udgzbzmevnkzzba', 'axtbmcpbmoubyoscn', 'vqnbsgonqmpreel', 'xvqnbsgonqmpree', 'ajtiguehrokustktudgz', 'brgkgihuetpqrhhbhn', 'dgzbzmevnkzzbaxtbmcp', 'ehrokustktudgzbzmevn', 'uetpqrhhbhnysjaj', 'vnkzzbaxtbmcpbmo']
****************************************************************************************************
ans: xvqnbsgonqmpreelpbrgkgihuetpqrhhbhnysjajtiguehrokustktudgzbzmevnkzzbaxtbmcpbmoubyoscn
****************************************************************************************************
case:
['orugbsuuxowmhjh', 'zjyxzmpduthlsioor', 'qtxocgehmhfqnstl', 'tlrlcnnrsyryfrywuebq', 'hozjyxzmpduthlsio', 'hjhdmnqtxocgehm', 'mjhzwdudlnbfkjawqacf', 'hfqnstlrlcnnrsyryfry', 'yfrywuebqhvwewzmq', 'zzieemjhzwdudlnbfkj', 'nnrsyryfrywuebqhvw', 'acfgaihbhozjyxzmpdut']
****************************************************************************************************
ans: zzieemjhzwdudlnbfkjawqacfgaihbhozjyxzmpduthlsioorugbsuuxowmhjhdmnqtxocgehmhfqnstlrlcnnrsyryfrywuebqhvwewzmq
****************************************************************************************************
case:
['phuutlgczfspygaljkv', 'fspygaljkvahvuii', 'csywjodtnkynkjckq', 'poyykqyrhbvcwvjl', 'xijupvzzwphuutlg', 'aljkvahvuiivqbqrw', 'vahvuiivqbqrwryd', 'wjodtnkynkjckqurgu', 'ecdmbshotqbxjqgbou', 'hvuiivqbqrwrydgnr', 'ivqbqrwrydgnrubcsywj', 'wphuutlgczfspyga']
****************************************************************************************************
ans: ecdmbshotqbxjqgbouxijupvzzwphuutlgczfspygaljkvahvuiivqbqrwrydgnrubcsywjodtnkynkjckqurgupoyykqyrhbvcwvjl
Also see the dynamic programming approach: https://leetcode.com/problems/find-the-shortest-superstring/solution/
I applied the Dijkstra algorithm (width-search) and have a solution giving an answer to this task in less than a second. I optimized it a bit in terms of memory usage, but I think concerning the algorithm this is a better approach than the one in the other answer. Unless we run out of memory this should be a better solution.
from collections import defaultdict
def dijkSuperstring(originalSeqs):
paths = defaultdict(set)
paths[0] = { '' }
while paths:
minLength = min(paths.keys())
while paths[minLength]:
candidate = paths[minLength].pop()
seqAdded = False
for seq in originalSeqs:
if seq in candidate:
continue
seqAdded = True
for i in reversed(range(len(seq)+1)):
if candidate.endswith(seq[:i]):
newCandidate = candidate + seq[i:]
paths[len(newCandidate)].add(newCandidate)
if not seqAdded: # nothing added, so all present?
return candidate
del paths[minLength]
print dijkSuperstring(
[ 'AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG' ])
I also tried using random sequences as input:
seqs = [ ''.join(random.choice('GATC')
for i in range(3))
for j in range(11) ]
print dijkSuperstring(deqs)
I soon found out that the solving time greatly depends on the size of the result(!) not of the input's size (so it isn't predictable). This isn't too surprising, but it makes comparing different algorithms a little difficult as others don't necessarily also have this property. In particular, the set of sequences from the OP seems to pose a comparatively lightweight problem. Other sets of 11 sequences of 3 characters are much harder to solve.
So I made some statistical measurements; I solved 1000 sets of 8 sequences. This I did for sequences of 3 and of 4 characters. Then I grouped the durations in 100 groups (equally spaced from 0s to the highest duration) and counted how many fell into each group. To smoothen the graph I always used the sum of three neighboring groups.
The diagrams below each show two such experiments, performed with an earlier (non-optimized) version of my algorithm (but the shape of the curves are the same as now); I did it twice to at least have an idea whether a strange ditch in the graph could have a reason or was just by pure chance.
I'd be interested to see similar graphs for the same kind of input for other algorithms. This could be interesting because my algorithm clearly has a memory issue. Solving 11 sequences of 3 characters stalled my machine several times due to memory exhaustion, so having another algorithm could make sense even if it is slower.