How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

Improving a bit neurosnap answer by making it general:

def next_elem(element, func):
    new_elem = getattr(element, func)
    if new_elem == "\n":
        return next_elem(new_elem, func)
    else:
        return new_elem

Now you can call any function with it, for example:

next_elem(element, 'previous_sibling')

use find_next_sibling() instead of next_sibling. also find_previous_sibling() instead of previous_sibling.

reason: next_sibling does not only return the next html tag but also the next "soup element". usually that is the whitespace between tags but can be more. find_next_sibling() on the other hand returns the next html tag ignoring whitespace and other crud between the tags.

i restructured your code a bit to make this demonstration. i hope it is semantically the same.

code with next_sibling demonstrating the same behaviour that you described (works for data but not data2)

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.next_sibling
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

code with find_next_sibling() which works for both data and data2

soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.find_next_sibling()
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

the same behaviour (returning all soup elements including unwanted whitespace) in other parts of beautifulsoup: BeautifulSoup .children or .content without whitespace between tags

I could solve this issue with a workaround. The problem is described in the google-group for BeautifulSoup and they suggest to use a preprocessor for html-files:

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

That's not the very best solution but one.

Also not a great solution but this worked for me

def get_sibling(element):
    sibling = element.next_sibling
    if sibling == "\n":
        return get_sibling(sibling)
    else:
        return sibling

How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

Tags:

Python

Html Parsing

Beautifulsoup

Related

Recent Posts