Crawl specific pages and data and make it searchable
The crawling and indexing actions can take a while, but you won't be crawling the same site every 2 minutes, so you can consider an algorithm in which you put more effort in crawling and indexing your data, and another algorithm to help you get a faster search.
You can keep crawling your data all the time and update the rest of the tables in the background (every X minutes/hours), so your search results will be fresh all the time but you won't have to wait for the crawl to end.
Crawling
Just get all the data you can (probably all the HTML code) and store it in a simple table. You'll need this data for the indexing analysis. This table might be big but you don't need good performance while working with it because it's going to be part of a background use and it's not going to be exposed for user's searches.
ALL_DATA
____________________________________________
| Url | Title | Description | HTML_Content |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Tables and Indexing
Create a big table that contains URLs and keywords
KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table will contain most of the words in each URL content (I would remove words like "the", "on", "with", "a" etc...
Create a table with keywords. For each occurrence add 1 to the occurrences column
KEYWORDS
_______________________________
| URL | Keyword | Occurrences |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Create another table with "hot" keywords which will be much smaller
HOT_KEYWORDS
_________________
| URL | Keyword |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
This table content will be loaded later according to search queries.
The most common search words will be store in the HOT_KEYWORDS
table.
Another table will hold cached search results
CACHED_RESULTS
_________________
| Keyword | Url |
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Searching algorithm
First, you'll search the cached result table. In case you have enough results, select them. If you don't, search the bigger KEYWORDS
table. Your data is not that big so searching according to the keyword index won't take too long. If you find more relevant results add them to the cache for later usage.
Note: You have to select an algorithm in order to keep your CACHED_RESULTS
table small (maybe to save the last use of the record and remove the oldest record if the cache is full).
This way the cache table will help you reduce the load on the keywords tables and give you ultra fast results for the common searches.
I have experience building large scale web scrapers and can testify that there will always be big challenges to overcome when undertaking this task. Web scrapers run into problems ranging from CPU issues to storage to network problems and any custom scraper needs to be built modular enough to prevent changes in one part from breaking the application as a whole. In my projects I have taken the following approach:
Figure out where your application can be logically split up
For me this meant building 3 distinct sections:
Web Scraper Manager
Web Scraper
HTML Processor
The work could then be divided up like so:
1) The Web Scraper Manager
The Web Scraper Manager pulls URL's to be scraped and spawns Web Scrapers. The Web Scraper Manager needs to flag all URL's that have been sent to the web scrapers as being "actively scraped" and know not to pull them down again while they are in that state. Upon receiving a message from the scrapers the manager will either delete the row or leave it in the "actively scraped" state if no errors occurred, otherwise it will reset it back to "inactive"
2) The Web Scraper
The web Scraper receives a URL to scrape and goes about CURLing it and downloading the HTML. All of this HTML can then be stored in a relational database with the following structure
ID | URL | HTML (BLOB) | PROCESSING
Processing is an integer flag which indicates whether or not the data is currently being processed. This lets other parsers know not to pull the data if it is already being looked at.
3) The HTML Processor
The HTML Processor will continually read from the HTML table, marking rows as active every time it pulls a new entry. The HTML processor has the freedom to operate on the HTML for as long as needed to parse out any data. This can be links to other pages in the site which could be placed back in the URL table to start the process again, any relevant data (meta tags, etc.), images etc.
Once all relevant data has been parsed out the HTML processor would send all this data into an ElasticSearch cluster. ElasticSearch provides lightning-fast full text searches which could be made even faster by splitting the data into various keys:
{
"url" : "http://example.com",
"meta" : {
"title" : "The meta title from the page",
"description" : "The meta description from the page",
"keywords" : "the,keywords,for,this,page"
},
"body" : "The body content in it's entirety",
"images" : [
"image1.png",
"image2.png"
]
}
Now your website/service can have access to the latest data in real time. The parser would need to be verbose enough to handle any errors so it can set the processing flag to false if it cannot pull data out, or at least log it somewhere so it can be reviewed.
What are the advantages?
The advantage of this approach is that at any time if you want to change the way you are pulling data, processing data or storing data you can change just that piece without having to re-architect the entire application. Further, if one part of the scraper/application breaks the rest can continue to run without any data loss and without stopping other processes
What are the disadvantages?
It's a big complex system. Any time you have a big complex system you are asking for big complex bugs. Unfortunately web scraping and data processing are complex undertaking and in my experience there is no way around having a complex solution to this particularly complex problem.