How do you remove duplicates from a list whilst preserving order?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add
to seen_add
instead of just calling seen.add
? Python is a dynamic language, and resolving seen.add
each iteration is more costly than resolving a local variable. seen.add
could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add()
always returns None
, so the or
above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
Edit 2020
As of CPython/PyPy 3.6 (and as a language guarantee in 3.7), plain dict
is insertion ordered, and even more efficient than the (also C implemented) collections.OrderedDict
. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items))
[1, 2, 0, 3]
Like list(set(items))
this pushes all the work to the C layer (on CPython), but since dict
s are insertion ordered, dict.fromkeys
doesn't lose ordering. It's slower than list(set(items))
(takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of set
s in a listcomp).
Edit 2016
As Raymond pointed out, in python 3.5+ where OrderedDict
is implemented in C, the list comprehension approach will be slower than OrderedDict
(unless you actually need the list at the end - and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict
.
Important Edit 2015
As @abarnert notes, the more_itertools
library (pip install more_itertools
) contains a unique_everseen
function that is built to solve this problem without any unreadable (not seen.add
) mutations in list comprehensions. This is also the fastest solution too:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks.
This comes from an implementation of the itertools recipe unique_everseen
which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
In Python 2.7+
the accepted common idiom (which works but isn't optimized for speed, I would now use unique_everseen
) for this uses collections.OrderedDict
:
Runtime: O(N)
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
This looks much nicer than:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
and doesn't utilize the ugly hack:
not seen.add(x)
which relies on the fact that set.add
is an in-place method that always returns None
so not None
evaluates to True
.
Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the OrderedDict
. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']