How often does Google's web spiders crawl the web?

Google's spiders are constantly crawling the web. They have multiple machines which crawl their massive index and add new pages to it all the time.

Reasons it's fast:

  • They have tons of machines doing the crawling at ridiculous speeds
  • They have tons of bandwidth available
  • They already have a giant index of pages to search so it saves time looking for new content. They can request the previously indexed links and parse them for new links to crawl.
  • They have been doing this for years and have fine tuned their crawling algorithm. They continue to work on it to this day to make it even better.
  • Certain sites are indexed more often depending on certain factors, PR (PageRank) being a big one. If your site has a high PR, you'll see it updated quickly. That's why you'll often see Superuser questions turn up in search results minutes after they've been asked.

Edit:

alt text

...among many other factors.

Google has an abundance of space and bandwidth. Don't you worry about them! As of January 2008, Google was sorting (on average) 20PB a day. 20PB (petabytes) is 20,000 terabytes, or 20 million gigabytes. Now that's just sorting, it isn't all of their data, it's a fraction of it.

An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks). To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.

Simply incredible.