Is str.replace(..).replace(..) ad nauseam a standard idiom in Python?

Do you have an application that is running too slow and you profiled it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.

The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(n*m) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(n*log(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to replace in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.

If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.

Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of replace.

How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).

def escape1(input):
        return input.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

translation_table = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    "'": '&#39;',
    '"': '&quot;',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print

Running this gives you:

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms

Looks like just replacing them one after another goes the fastest.

Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.

Is str.replace(..).replace(..) ad nauseam a standard idiom in Python?

Tags:

Python

Performance

Replace

Idioms

Related

Recent Posts