Does any open, simply extendible web crawler exists?

I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.

As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.

Whether it's right for you depends on the weighting of factors such as:

  1. How much flexibility you need (+)
  2. How mature it should be (-)
  3. Whether you need the ability to scale (+)
  4. If you're comfortable with Java/Hadoop (+)

I heartily recommend heritrix. It is VERY flexible and I'd argue is the most battle tested freely available open source crawler, as it's the one the Internet Archive uses.


You should be able to find something that fits your needs here.