clone element with beautifulsoup

There is no native clone function in BeautifulSoup in versions before 4.4 (released July 2015); you'd have to create a deep copy yourself, which is tricky as each element maintains links to the rest of the tree.

To clone an element and all its elements, you'd have to copy all attributes and reset their parent-child relationships; this has to happen recursively. This is best done by not copying the relationship attributes and re-seat each recursively-cloned element:

Click to copy

from bs4 import Tag, NavigableString

def clone(el):
    if isinstance(el, NavigableString):
        return type(el)(el)

    copy = Tag(None, el.builder, el.name, el.namespace, el.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(el.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(el, attr))
    for child in el.contents:
        copy.append(clone(child))
    return copy

This method is kind-of sensitive to the current BeautifulSoup version; I tested this with 4.3, future versions may add attributes that need to be copied too.

You could also monkeypatch this functionality into BeautifulSoup:

Click to copy

from bs4 import Tag, NavigableString


def tag_clone(self):
    copy = type(self)(None, self.builder, self.name, self.namespace, 
                      self.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(self.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(self, attr))
    for child in self.contents:
        copy.append(child.clone())
    return copy


Tag.clone = tag_clone
NavigableString.clone = lambda self: type(self)(self)

letting you call .clone() on elements directly:

Click to copy

document2.body.append(document1.find('div', id_='someid').clone())

My feature request to the BeautifulSoup project was accepted and tweaked to use the copy.copy() function; now that BeautifulSoup 4.4 is released you can use that version (or newer) and do:

Click to copy

import copy

document2.body.append(copy.copy(document1.find('div', id_='someid')))

It may not be the fastest solution, but it is short and seems to work...

clonedtag = BeautifulSoup(str(sourcetag)).body.contents[0]

BeautifulSoup creates an extra <html><body>...</body></html> around the cloned tag (in order to make the "soup" a sane html document). .body.contents[0] removes those wrapping tags.

This idea was derived Peter Woods comment above and Clemens Klein-Robbenhaar's comment below.

clone element with beautifulsoup

Tags:

Python

Beautifulsoup

Related

Recent Posts