How delete tag from node in lxml without tail?
While the accepted answer from phlou will work, there are easier ways to remove tags without also removing their tails.
If you want to remove a specific element, then the LXML method you are looking for is drop_tree
.
From the docs:
Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
If you want to remove all instances of a specific tag, you can use the lxml.etree.strip_elements
or lxml.html.etree.strip_elements
with withtails=False
.
Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the
with_tail
keyword argument option to False.
So, for the example in the original post:
>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
... bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
or
>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
Edit:
Please look at @Joshmakers answer https://stackoverflow.com/a/47946748/8055036, which is clearly the better one.
I did the following to safe the tail text to the previous sibling or parent.
def remove_keeping_tail(self, element):
"""Safe the tail text and then delete the element"""
self._preserve_tail_before_delete(element)
element.getparent().remove(element)
def _preserve_tail_before_delete(self, node):
if node.tail: # preserve the tail
previous = node.getprevious()
if previous is not None: # if there is a previous sibling it will get the tail
if previous.tail is None:
previous.tail = node.tail
else:
previous.tail = previous.tail + node.tail
else: # The parent get the tail as text
parent = node.getparent()
if parent.text is None:
parent.text = node.tail
else:
parent.text = parent.text + node.tail
HTH