Identifying large bodies of text via BeautifulSoup or other python based extractors

You might look at the python-readability package which does exactly this for you.


You're really not going about it the right way, I would say, as all the comments above would attest to.

That said, this does what you're looking for.

from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])

It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.

As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.