BeautifulSoup getText from between <p>, not picking up subsequent paragraphs
get_text
htmldata = getdata("https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed")
soup = BeautifulSoup(htmldata, 'html.parser')
data = ''
for data in soup.find_all("p"):
print(data.get_text())
This works well for specific articles where the text is all wrapped in <p>
tags. Since the web is an ugly place, it's not always the case.
Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span>
or a <div>
, or an <li>
).
To find all text nodes in the DOM, you can use soup.find_all(text=True)
.
This is going to return some undesired text, like the contents of <script>
and <style>
tags. You'll need to filter out the text contents of elements you don't want.
blocklist = [
'style',
'script',
# other elements,
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blocklist]
If you are working with a known set of tags, you can tag the opposite approach:
allowlist = [
'p'
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in allowlist]
You are getting close!
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like
soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})
to zero in on the body of the article.