Data Mining

Top 50 open source web crawlers for data mining

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.
There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

Name Language Platform
Heritrix Java Linux
Nutch Java Cross-platform
Scrapy Python Cross-platform
DataparkSearch C++ Cross-platform
GNU Wget C Linux
GRUB C#, C, Python, Perl Cross-platform
ht://Dig C++ Unix
HTTrack C/C++ Cross-platform
ICDL Crawler C++ Cross-platform
mnoGoSearch C Windows
Norconex HTTP Collector Java Cross-platform
Open Source Server C/C++, Java PHP Cross-platform
PHP-Crawler PHP Cross-platform
YaCy Java Cross-platform
WebSPHINX Java Cross-platform
WebLech Java Cross-platform
Arale Java Cross-platform
JSpider Java Cross-platform
HyperSpider Java Cross-platform
Arachnid Java Cross-platform
Spindle Java Cross-platform
Spider Java Cross-platform
LARM Java Cross-platform
Metis Java Cross-platform
SimpleSpider Java Cross-platform
Grunk Java Cross-platform
CAPEK Java Cross-platform
Aperture Java Cross-platform
Smart and Simple Web Crawler Java Cross-platform
Web Harvest Java Cross-platform
Aspseek C++ Linux
Bixo Java Cross-platform
crawler4j Java Cross-platform
Ebot Erland Linux
Hounder Java Cross-platform
Hyper Estraier C/C++ Cross-platform
OpenWebSpider C#, PHP Cross-platform
Pavuk C Lunix
Sphider PHP Cross-platform
Xapian C++ Cross-platform C# Windows
Crawwwler C++ Java
Distributed Web Crawler C, Java, Python Cross-platform
iCrawler Java Cross-platform
pycreep Java Cross-platform
Opese C++ Linux
Andjing Java
Ccrawler C# Windows
WebEater Java Cross-platform
JoBo Java Cross-platform
  1. me 4 years ago

    This is web scraping, not data mining.

    • corey 10 months ago

      web scraping can be the first step to create a database, which can then be data mined… Not mutually exclusive.

  2. Vijay 3 years ago

    You should also check out some customised web crawlers like &

    • Sankar Prasanth 11 months ago

      we can build customized web crawlers with out paying to them. Like we can create with storm-crawler

  3. Napoleon 3 years ago

    how can one use any of the crawlers to extract website indexes

    • Andyj 1 year ago

      DNS caching?

  4. 3 years ago

    Should update the list with CasperJS and PhantomJS

    • Wildan Fathan 2 years ago

      it’s headless browser, not scraper. But they can be used for scraping.

  5. Harald Hanche-Olsen 3 years ago

    Misspelling alert: Ebot is written in Erlang, not in Erland.

    • Rickety Janes 2 years ago

      Erland is the country where Erlang is spoken.

    • Editor / BDMS 2 years ago

      It is corrected. Thanks.

  6. Katrin Pudikova 3 years ago

    Sorry, but could you please provide the sources from which you have gathered the provided list? And could you be so kind and explain, how this rating was built? I can’t get the metrics why this crawlers were sorted in such a way!
    Even more:
    1) Open Source Server – is called Open SEARCH Server
    2) mnoGoSearch runs under UNIX
    3) many of mantioned crawlers are not actual and simply their development stopped several years ago.
    This article provides only list of some found in internet crawlers and nothing more.

  7. sunstate DState 2 years ago
    Reply may be mine desktop crawler will staisfy somebody ltl need

    • Sankar Prasanth 11 months ago

      No use of that. We can costumize like that with out any payment. waste of money

  8. Matt 2 years ago

    Take a look at nohodo proxy network for crawling. Best I’ve ever used!

  9. Andrew brown 2 years ago

    can anyone create one for me?

  10. clasher 1 year ago

    how to run any of thesse crawlers??

  11. Let's Talk 1 year ago

    Please add to the list

  12. Brian Wilkie 1 year ago

    Best part of this article? It’s not an article because it provides absolutely no support for why these apps are included in the list and why others are not. This is click bait because it provides no value to the reader. A simple Google search of “web crawlers” gives you the same value.

  13. Statistrix 1 year ago

    Great source of knowledge. Moreover that data scraping could be a great source of data for analysis.
    Site recommended by

  14. Data Meets Media 1 year ago

    Cool. This is really helpful. Thanks Baiju NT. I’ve been meaning to do some web crawling of my own, and it seems Scrapy or GRUB is my best bet since Python is my language of choice.

  15. Sercan Turna 1 year ago
    Reply is better solution i think.

  16. Juan Yang 11 months ago

    Can i ask ? how to use above message to my wear resistant alumina ceramic website :

  17. Or Vibes 10 months ago

    i have what a crawler site can do for me please?

  18. Ralf Ritter 10 months ago

    Great List ! Thanks 🙂

  19. Mike Trxx 7 months ago

    Why not to build web scrapers in robust Go without knowing Go? Diggernaut made it possible: Oh, and yes, its completely free if you dont use their cloud

  20. Salim Khalil 7 months ago

    If you are using R software , then i recommend RCrawler package for crawling and data collection
    tutorials :

    • juwa 6 months ago

      Hi, Salim. Am working on a simple web crawling assignment using JAVA and am stuck on it. I have a limited knowledge in programming and thus it has taken me so long without any progress being made on the assignment. I am seeking your help on completing some sections of the program.

  21. asqueu 6 months ago

    Try uCrawler – cloud-platform that allows to create your news aggregator based on artificial intelligence technologies.

  22. Ogudu 5 months ago

    Unimpressed! Came here looking for a simle lightweight tool that can search the content of an entire website for a particular word on the web pages of the site like an email, a name, etc. But all I see is web scrappers, SEO tools, and other similar annoying apps.

Leave a Comment

Your email address will not be published.

You may also like

Crayon Yoda

Pin It on Pinterest