lxml / BeautifulSoup parser warning
While using the BeautifulSoup, we always do the things like below:
[variable] = BeautifulSoup([contents you want to analyze])
Here is the problem:
If you have installed "lxml" before, BeautifulSoup will automatically notice that it used it as the praser. It's not the error, just a notification.
So how to remove it?
Just do this like below:
[variable] = BeautifulSoup([contents you want to analyze], features = "lxml")
"Based on the latest version of BeautifulSoup, 4.6.3"
Notice that different versions of BeautifulSoup have different ways, or the grammar, to add this pattern, just look at the notice message carefully.
Good luck!
For others init like:
soup = BeautifulSoup(html_doc)
Use
soup = BeautifulSoup(html_doc, 'html.parser')
instead
I had to read lxml
's and BeautifulSoup's source code to figure this out.
I'm posting my own answer here, in case someone else may need it in the future.
The fromstring
function in question is defined so:
def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):
The **bsargs
arguments ends up being sent forward to the BeautifulSoup constructor, which is called like so (in another function, _parse
):
tree = beautifulsoup(source, **bsargs)
The BeautifulSoup constructor is defined so:
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
Now, back to the warning in the question, which is recommending that the argument "html.parser" be added to BeautifulSoup's contructor. According to this, that would be the argument named features
.
Since the fromstring
function will pass on named arguments to BeautifulSoup's constructor, we can specify the parser by naming the argument to the fromstring
function, like so:
root = fromstring(clean, features='html.parser')
Poof. The warning disappears.