Build super fast web scraper with Python x100 than BeautifulSoup
Web scraper is a technique for extracting structured information from a web page. With Python, you can build a efficient web scraper by using BeautifulSoup
, requests
and other libraries. However, these solution is not fast enough. In this article, I will show you some tips to build a super fast web scraper with Python.
Don't use BeautifulSoup4 #
BeautifulSoup4 is friendly and user-friendly, but it is not fast. Even you use external extractor such as lxml
for HTML parsing or use cchardet
to detect the encoding, it is still slow.
Use selectolax instead of BeautifulSoup4 for HTML parsing #
selectolax
is a Python binding to Modest and Lexbor engines.
To install selectolax
with pip:
pip install selectolax
The usage of selectolax
is similar to BeautifulSoup4
.
from selectolax.parser import HTMLParser
html = """
<body>
<h1 class='>Welcome to selectolax tutorial</h1>
<div id="text">
<p class='p3'>Lorem ipsum</p>
<p class='p3'>Lorem ipsum 2</p>
</div>
<div>
<p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
</div>
</body>
"""
# Select all elements with class 'p3'
parser = HTMLParser(html)
parser.select('p.p3')
# Select first match
parser.css_first('p.p3')
# Iterate over all nodes on the current level
for node in parser.css('div'):
for cnode in node.iter():
print(cnode.tag, cnode.html)
For more information, please visit selectolax walkthrough tutorial
Use httpx instead of requests #
Python requests
is a HTTP client for humans. It is easy to use, but it is not fast. It only supports synchronous requests.
httpx
is a fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2.
It offers a standard synchronous API by default, but also gives you the option of an async client if you need it.
To install httpx
with pip:
pip install httpx
httpx
offers the same api with requests
:
import httpx
async def main():
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.status_code)
print(response.json())
import asyncio
asyncio.run(main())
For examples and usage, please visit httpx home page
Use aiofiles for file IO #
aiofiles
is a Python library for asyncio-based file I/O. It provides a high-level API for working with files.
To install aiofiles
with pip:
pip install aiofiles
Basic usage:
import aiofiles
async def main():
async with aiofiles.open('test.txt', 'w') as f:
await f.write('Hello world!')
async with aiofiles.open('test.txt', 'r') as f:
print(await f.read())
import asyncio
asyncio.run(main())
For more information, please visit aiofiles repository