Extract News article content from stored .html pages
Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.
The quickstart only shows loading from a URL, but you can load from a HTML string with:
import newspaper
# LOAD HTML INTO STRING FROM FILE...
article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)
There are libraries for this in Python too :)
Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe
If you want to use purely python libraries, there are 2 options:
https://github.com/buriy/python-readability
and
https://github.com/grangier/python-goose
Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)
EDIT: here's a sample code using Goose:
from goose import Goose
from requests import get
response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text