Extract News article content from stored .html pages

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)

There are libraries for this in Python too :)

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

and

https://github.com/grangier/python-goose

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

EDIT: here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

Extract News article content from stored .html pages

Tags:

Python

Urllib2

Bs4

Related

Recent Posts