Best library to parse HTML with Python 3 and example?

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

If your HTML is well formed, you have many options, such as sax and dom. If it is not well formed you need a fault tolerant parser such as Beautiful soup, element tidy, or lxml's HTML parser. No parser is perfect, when presented with a variety of broken HTML, sometimes I have to try more then one. Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup.

In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good. In the past I have used Beautiful soup to convert HTML to xml and construct ElementTree for processing the data.

Best library to parse HTML with Python 3 and example?

Tags:

Python 3.X

Related

Recent Posts