BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

get_text

htmldata = getdata("https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed") 
soup = BeautifulSoup(htmldata, 'html.parser') 
data = '' 
for data in soup.find_all("p"): 
    print(data.get_text())

This works well for specific articles where the text is all wrapped in <p> tags. Since the web is an ugly place, it's not always the case.

Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span> or a <div>, or an <li>).

To find all text nodes in the DOM, you can use soup.find_all(text=True).

This is going to return some undesired text, like the contents of <script> and <style> tags. You'll need to filter out the text contents of elements you don't want.

blocklist = [
  'style',
  'script',
  # other elements,
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blocklist]

If you are working with a known set of tags, you can tag the opposite approach:

allowlist = [
  'p'
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name in allowlist]

You are getting close!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

to zero in on the body of the article.

BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Tags:

Python

Python 2.7

Beautifulsoup

Related

Recent Posts