More efficient algorithm for shortest superstring search

This should do it.

import itertools as it

SEQUENCES = ['AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG']
LONGEST_SUPERSTRING = ''.join(SEQUENCES)

def find_shortest_superstring():
    current_shortest = LONGEST_SUPERSTRING
    trim = len(current_shortest)-1
    seen_prefixes = set()
    for perm in it.permutations(SEQUENCES):
        candidate_string = ''.join(perm)[:trim]
        if candidate_string in seen_prefixes:
            continue
        seen_prefixes.add(candidate_string)
        while is_superstring(candidate_string):
            current_shortest = candidate_string
            candidate_string = candidate_string[:-1]
            trim = len(current_shortest)-1
    return current_shortest

def is_superstring(s):
    return all(seq in s for seq in SEQUENCES)

def main():
    print 'Searching for shortest superstring containing all strings.'
    ss = find_shortest_superstring()
    print 'Found shortest superstring containing all strings:\n{}'.format(ss)

if __name__ == '__main__':
    main()

The code takes about 15 seconds to run and produces the following output:

Searching for shortest superstring containing all strings.
Found shortest superstring containing all strings:
CCGTAGGTGGAGT

Just backtracking, but always check most overlapped first. After get a good candidate answer, later when current path result in a string has length big or equal to this candidate answer, we do not need to go further with this path.

Tested in my Jupyter notebook. It seems to be much faster than the other two answers here (11/18/2018)

def shortestSuperstring(A):
    """
    :type A: List[str]
    :rtype: str
    """

    if len(A)==1:
        return A[0]
    dic={}
    for i in xrange(len(A)):
        for j in xrange(len(A)):
            if i!=j:
                ol=0
                for k in xrange(1,min(len(A[i]),len(A[j]))):
                    if A[j][:k]==A[i][-k:]:
                        ol=k
                dic[(i,j)]=ol
    if max(dic.values())==0:
        return "".join(A)
    else:
        ret="".join(A)
        l=len(ret)
        stack=[]
        for i,wd in enumerate(A):
            tmp=set(range(len(A)))
            tmp.remove(i)
            stack.append((wd,i,tmp))
        while stack:
            ans,cur,remain=stack.pop()
            if len(ans)<l:
                if not remain:
                    ret=ans
                    l=len(ret)
                else:
                    tmp=[[dic[cur,idx],idx] for idx in remain] # [#overlap,idx]
                    tmp.sort()
                    for ol,idx in tmp:
                        nans=ans+A[idx][ol:]
                        nremain=set(remain)
                        nremain.remove(idx)
                        stack.append((nans,idx,nremain))
        return ret

The test case in the problem takesL

1.93 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

to run and gives the answer:

'CCGTGGTAGGAGT'

Some other test cases (longer strings and that start to beat the other two methods, all about 1~5 seconds):

    ****************************************************************************************************


    case: 

     ['mftpvodataplkewcouz', 'krrgsoxpsnmzlhprsl', 'qhbfymytxzbmqma', 'hunjgeaolcuznhpodi', 'kewcouzbwlftz', 'xzbmqmahunjgeaolcu', 'zlhprslqurnqbhsjr', 'rrgsoxpsnmzlhprslqur', 'diqukrrgsoxpsnmz', 'sjrxzavamftpvoda']


    ****************************************************************************************************


    ans:  qhbfymytxzbmqmahunjgeaolcuznhpodiqukrrgsoxpsnmzlhprslqurnqbhsjrxzavamftpvodataplkewcouzbwlftz


    ****************************************************************************************************


    case: 

     ['cedefifgstkyxfcuajfa', 'ooncedefifgstkyxfcua', 'assqjfwarvjcjedqtoz', 'fcuajfassqjfwarvjc', 'fwarvjcjedqtozctcd', 'zppedxfumcfsngp', 'kyxfcuajfassqjfwa', 'fumcfsngphjyfhhwkqa', 'fassqjfwarvjcjedq', 'ppedxfumcfsngphjyf', 'dqtozctcdk']


    ****************************************************************************************************


    ans:  zppedxfumcfsngphjyfhhwkqaooncedefifgstkyxfcuajfassqjfwarvjcjedqtozctcdk


    ****************************************************************************************************


    case: 

     ['ekpijtseahvmprvefkgn', 'yyevvcmeekpijtseahvm', 'vsfcyyevvcmeekp', 'xwmkoqhxvrovlmmvsfcy', 'cmeekpijtseahvmpr', 'oqhxvrovlmmvsfcyy', 'zpuemtclxbxwsypfxevx', 'clxbxwsypfxevxw', 'fkgnjgdvfygnlckyiju', 'xevxwmkoqhxvrovlmm']


    ****************************************************************************************************


    ans:  zpuemtclxbxwsypfxevxwmkoqhxvrovlmmvsfcyyevvcmeekpijtseahvmprvefkgnjgdvfygnlckyiju


    ****************************************************************************************************


    case: 

     ['ppgortnmsy', 'czmysoeeyugbiylso', 'nbfzpppvhbjydtx', 'rnzynedhoiunkpon', 'ornzynedhoiunkpo', 'ylsomoktkyfgljcf', 'jtvkrornzynedhoiunk', 'hvhhihwdffmxnczmyso', 'ktkyfgljcfbkqcpp', 'nzynedhoiunkponbfz', 'nedhoiunkponbfzpppvh']


    ****************************************************************************************************


    ans:  hvhhihwdffmxnczmysoeeyugbiylsomoktkyfgljcfbkqcppgortnmsyjtvkrornzynedhoiunkponbfzpppvhbjydtx


    ****************************************************************************************************


    case: 

     ['amefulhsdgvjvoab', 'giqxpqszaitzfzvtalx', 'cyqeolfgkihssycmiodg', 'glhhcfuprwazet', 'cmiodgiqxpqszaitzf', 'lhsdgvjvoabdviglhhcf', 'ssycmiodgiqxpqsza', 'bxtdqnamefulhsdg', 'namefulhsdgvjvo', 'ihssycmiodgiqxp', 'itzfzvtalxfybxtdqn']


    ****************************************************************************************************


    ans:  cyqeolfgkihssycmiodgiqxpqszaitzfzvtalxfybxtdqnamefulhsdgvjvoabdviglhhcfuprwazet


    ****************************************************************************************************


    case: 

     ['yobbobwqymlordokxka', 'jllfoebgbsrguls', 'rgulsnatnpuuwiyba', 'ordokxkamymamofefr', 'wqymlordokxkamy', 'fycxifzsjllfoebgbsrg', 'lordokxkamymamofe', 'kxkamymamofefrmfycx', 'frmfycxifzsjllf', 'srgulsnatnpuuwiy']


    ****************************************************************************************************


    ans:  yobbobwqymlordokxkamymamofefrmfycxifzsjllfoebgbsrgulsnatnpuuwiyba


    ****************************************************************************************************


    case: 

     ['jnbbbbsczcscxawcze', 'bsczcscxawczeumyyr', 'lyofvbhvjmquhkgz', 'quhkgzyzdwtjnbbb', 'kgzyzdwtjnbbbbsczc', 'uouxnfplptpkgnronf', 'pqgyfqglyofvbhv', 'kgnronftgswvpqgy', 'marvhdxtbmkcpnli', 'qgyfqglyofvbhvjmquhk', 'xtbmkcpnliz']


    ****************************************************************************************************


    ans:  marvhdxtbmkcpnlizuouxnfplptpkgnronftgswvpqgyfqglyofvbhvjmquhkgzyzdwtjnbbbbsczcscxawczeumyyr


    ****************************************************************************************************


    case: 

     ['qrwpawefqzfjsan', 'jsanzdukfkdlmyox', 'neaxnkedjxbpgsyq', 'nqjvzryhfjdsxmwolwo', 'hfjdsxmwolwomeeewvi', 'lmyoxbpvkneaxnkedjxb', 'qbhpqrwpawefqzfjsa', 'pawefqzfjsanzdukfk', 'bqbhpqrwpawefqzfj', 'dlmyoxbpvkneaxnk', 'xnkedjxbpgsyqovvh']


    ****************************************************************************************************


    ans:  bqbhpqrwpawefqzfjsanzdukfkdlmyoxbpvkneaxnkedjxbpgsyqovvhnqjvzryhfjdsxmwolwomeeewvi


    ****************************************************************************************************


    case: 

     ['vgrikrnwezryimj', 'umwgwvzpsfpmctzt', 'pjourlpgeemdjor', 'urlpgeemdjorpzbkbz', 'jorpzbkbzcqyewih', 'xuwkzvoczozhhvf', 'ihbumoogibirbsvch', 'nwezryimjivvpjourlp', 'kzvoczozhhvfwgeplv', 'ezryimjivvpjourlpgee', 'zhhvfwgeplvqngglu', 'rikrnwezryimjivvp']


    ****************************************************************************************************


    ans:  xuwkzvoczozhhvfwgeplvqngglumwgwvzpsfpmctztvgrikrnwezryimjivvpjourlpgeemdjorpzbkbzcqyewihbumoogibirbsvch


    ****************************************************************************************************


    case: 

     ['nbsgonqmpreelpbr', 'hnysjajtiguehrokus', 'udgzbzmevnkzzba', 'axtbmcpbmoubyoscn', 'vqnbsgonqmpreel', 'xvqnbsgonqmpree', 'ajtiguehrokustktudgz', 'brgkgihuetpqrhhbhn', 'dgzbzmevnkzzbaxtbmcp', 'ehrokustktudgzbzmevn', 'uetpqrhhbhnysjaj', 'vnkzzbaxtbmcpbmo']


    ****************************************************************************************************


    ans:  xvqnbsgonqmpreelpbrgkgihuetpqrhhbhnysjajtiguehrokustktudgzbzmevnkzzbaxtbmcpbmoubyoscn


    ****************************************************************************************************


    case: 

     ['orugbsuuxowmhjh', 'zjyxzmpduthlsioor', 'qtxocgehmhfqnstl', 'tlrlcnnrsyryfrywuebq', 'hozjyxzmpduthlsio', 'hjhdmnqtxocgehm', 'mjhzwdudlnbfkjawqacf', 'hfqnstlrlcnnrsyryfry', 'yfrywuebqhvwewzmq', 'zzieemjhzwdudlnbfkj', 'nnrsyryfrywuebqhvw', 'acfgaihbhozjyxzmpdut']


    ****************************************************************************************************


    ans:  zzieemjhzwdudlnbfkjawqacfgaihbhozjyxzmpduthlsioorugbsuuxowmhjhdmnqtxocgehmhfqnstlrlcnnrsyryfrywuebqhvwewzmq


    ****************************************************************************************************


    case: 

     ['phuutlgczfspygaljkv', 'fspygaljkvahvuii', 'csywjodtnkynkjckq', 'poyykqyrhbvcwvjl', 'xijupvzzwphuutlg', 'aljkvahvuiivqbqrw', 'vahvuiivqbqrwryd', 'wjodtnkynkjckqurgu', 'ecdmbshotqbxjqgbou', 'hvuiivqbqrwrydgnr', 'ivqbqrwrydgnrubcsywj', 'wphuutlgczfspyga']


    ****************************************************************************************************


    ans:  ecdmbshotqbxjqgbouxijupvzzwphuutlgczfspygaljkvahvuiivqbqrwrydgnrubcsywjodtnkynkjckqurgupoyykqyrhbvcwvjl

Also see the dynamic programming approach: https://leetcode.com/problems/find-the-shortest-superstring/solution/

I applied the Dijkstra algorithm (width-search) and have a solution giving an answer to this task in less than a second. I optimized it a bit in terms of memory usage, but I think concerning the algorithm this is a better approach than the one in the other answer. Unless we run out of memory this should be a better solution.

from collections import defaultdict

def dijkSuperstring(originalSeqs):
  paths = defaultdict(set)
  paths[0] =  { '' }
  while paths:
    minLength = min(paths.keys())
    while paths[minLength]:
      candidate = paths[minLength].pop()
      seqAdded = False
      for seq in originalSeqs:
        if seq in candidate:
          continue
        seqAdded = True
        for i in reversed(range(len(seq)+1)):
          if candidate.endswith(seq[:i]):
            newCandidate = candidate + seq[i:]
            paths[len(newCandidate)].add(newCandidate)
      if not seqAdded:  # nothing added, so all present?
        return candidate
    del paths[minLength]

print dijkSuperstring(
  [ 'AGG', 'AGT', 'CCG', 'CGT', 'GAG', 'GGA', 'GGT', 'GTA', 'GTG', 'TAG', 'TGG' ])

I also tried using random sequences as input:

seqs = [ ''.join(random.choice('GATC')
  for i in range(3))
    for j in range(11) ]
print dijkSuperstring(deqs)

I soon found out that the solving time greatly depends on the size of the result(!) not of the input's size (so it isn't predictable). This isn't too surprising, but it makes comparing different algorithms a little difficult as others don't necessarily also have this property. In particular, the set of sequences from the OP seems to pose a comparatively lightweight problem. Other sets of 11 sequences of 3 characters are much harder to solve.

So I made some statistical measurements; I solved 1000 sets of 8 sequences. This I did for sequences of 3 and of 4 characters. Then I grouped the durations in 100 groups (equally spaced from 0s to the highest duration) and counted how many fell into each group. To smoothen the graph I always used the sum of three neighboring groups.

The diagrams below each show two such experiments, performed with an earlier (non-optimized) version of my algorithm (but the shape of the curves are the same as now); I did it twice to at least have an idea whether a strange ditch in the graph could have a reason or was just by pure chance.

I'd be interested to see similar graphs for the same kind of input for other algorithms. This could be interesting because my algorithm clearly has a memory issue. Solving 11 sequences of 3 characters stalled my machine several times due to memory exhaustion, so having another algorithm could make sense even if it is slower.

More efficient algorithm for shortest superstring search

8 Sequences of 3 Characters

8 Sequences of 4 Characters

Tags:

Python

Algorithm

Optimization

Related

Recent Posts