Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

With combination of re.findall() (using specific regex patten) and max() functions:

import re

#  extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'

def find_longest_rep(s):
    result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
    return result[0]

print(find_longest_rep(s))

The output:

UBAUBAUBAUBAUBA

The crucial pattern:

((\w+?)\2+):
- (....) - the outermost captured group which is the 1st captured group
- (\w+?) - any non-whitespace character sequence enclosed into the 2nd captured group; +? - quantifier, matches between one and unlimited times, as few times as possible, expanding as needed
- \2+ - matches the same text as most recently matched by the 2nd capturing group

Here is the solution based on ((\w+?)\2+) regex but with additional improvements:

import re
from itertools import chain


def repetitive(sequence, rep_min_len=1):
    """Find the most repetitive sequence in a string.

    :param str sequence: string for search
    :param int rep_min_len: minimal length of repetitive substring
    :return the most repetitive substring or None
    """
    greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')

    all_rep_seach = lambda regex: \
        (regex.search(sequence[shift:]) for shift in range(len(sequence)))

    searched = list(
        res.groups()
        for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
        if res)

    if not sequence:
        return None

    cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
    return max(searched, key=cmp_key)[0]

You can test it like so:

def check(seq, expected, rep_min_len=1):
    result = repetitive(seq, rep_min_len)
    print('%s => %s' % (seq, result))
    assert result == expected, expected


check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')

Key features:

used greedy ((\w+)\2+) and non-greedy ((\w+)\2+?) regex;
search repetitive substring in all substrings with the shift from the beginning (e.g.'string' => ['string', 'tring', 'ring', 'ing', 'ng', 'g']);
selection is based on the number of repetitions not on the length of subsequence (e.g. for 'ABABABAB_ABCDEF_ABCDEF' result will be 'ABABABAB', not '_ABCDEF_ABCDEF');
the minimum length of a repeating sequence is matters (see 'aaabcabc' check).

What you are searching for is an algorithm to find the 'largest' primitive tandem repeat in a string. Here is a paper describing a linear time algorithm to find all tandem repeats in a string and by extension all primitive tandem repeats. Gusfield. Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String

Algorithm to find the most repetitive (not the most common) sequence in a string (aka tandem repeats)

Tags:

Python

Algorithm

String

Regex

Python 3.X

Related

Recent Posts