Data Mining

Top 50 open source web crawlers for data mining

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.
There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

Name Language Platform
Heritrix Java Linux
Nutch Java Cross-platform
Scrapy Python Cross-platform
DataparkSearch C++ Cross-platform
GNU Wget C Linux
GRUB C#, C, Python, Perl Cross-platform
ht://Dig C++ Unix
HTTrack C/C++ Cross-platform
ICDL Crawler C++ Cross-platform
mnoGoSearch C Windows
Norconex HTTP Collector Java Cross-platform
Open Source Server C/C++, Java PHP Cross-platform
PHP-Crawler PHP Cross-platform
YaCy Java Cross-platform
WebSPHINX Java Cross-platform
WebLech Java Cross-platform
Arale Java Cross-platform
JSpider Java Cross-platform
HyperSpider Java Cross-platform
Arachnid Java Cross-platform
Spindle Java Cross-platform
Spider Java Cross-platform
LARM Java Cross-platform
Metis Java Cross-platform
SimpleSpider Java Cross-platform
Grunk Java Cross-platform
CAPEK Java Cross-platform
Aperture Java Cross-platform
Smart and Simple Web Crawler Java Cross-platform
Web Harvest Java Cross-platform
Aspseek C++ Linux
Bixo Java Cross-platform
crawler4j Java Cross-platform
Ebot Erland Linux
Hounder Java Cross-platform
Hyper Estraier C/C++ Cross-platform
OpenWebSpider C#, PHP Cross-platform
Pavuk C Lunix
Sphider PHP Cross-platform
Xapian C++ Cross-platform
Arachnode.net C# Windows
Crawwwler C++ Java
Distributed Web Crawler C, Java, Python Cross-platform
iCrawler Java Cross-platform
pycreep Java Cross-platform
Opese C++ Linux
Andjing Java
Ccrawler C# Windows
WebEater Java Cross-platform
JoBo Java Cross-platform
32 Comments
  1. me 4 years ago
    Reply

    This is web scraping, not data mining.

    • corey 10 months ago
      Reply

      web scraping can be the first step to create a database, which can then be data mined… Not mutually exclusive.

  2. Vijay 3 years ago
    Reply

    You should also check out some customised web crawlers like https://www.promptcloud.com/
    http://www.80legs.com & Import.io

    • Sankar Prasanth 11 months ago
      Reply

      we can build customized web crawlers with out paying to them. Like we can create with storm-crawler

  3. Napoleon 3 years ago
    Reply

    how can one use any of the crawlers to extract website indexes

    • Andyj 1 year ago
      Reply

      DNS caching?

  4. Baixa.la 3 years ago
    Reply

    Should update the list with CasperJS and PhantomJS

    • Wildan Fathan 2 years ago
      Reply

      it’s headless browser, not scraper. But they can be used for scraping.

  5. Harald Hanche-Olsen 3 years ago
    Reply

    Misspelling alert: Ebot is written in Erlang, not in Erland.

    • Rickety Janes 2 years ago
      Reply

      Erland is the country where Erlang is spoken.

    • Editor / BDMS 2 years ago
      Reply

      It is corrected. Thanks.

  6. Katrin Pudikova 3 years ago
    Reply

    Sorry, but could you please provide the sources from which you have gathered the provided list? And could you be so kind and explain, how this rating was built? I can’t get the metrics why this crawlers were sorted in such a way!
    Even more:
    1) Open Source Server – is called Open SEARCH Server
    2) mnoGoSearch runs under UNIX
    3) many of mantioned crawlers are not actual and simply their development stopped several years ago.
    This article provides only list of some found in internet crawlers and nothing more.

  7. sunstate DState 2 years ago
    Reply

    http://www.sqrbox.in/scrawler may be mine desktop crawler will staisfy somebody ltl need

    • Sankar Prasanth 11 months ago
      Reply

      No use of that. We can costumize like that with out any payment. waste of money

  8. Matt 2 years ago
    Reply

    Take a look at nohodo proxy network for crawling. Best I’ve ever used!

  9. Andrew brown 2 years ago
    Reply

    can anyone create one for me?

  10. clasher 1 year ago
    Reply

    how to run any of thesse crawlers??

  11. Let's Talk 1 year ago
    Reply

    Please add http://stormcrawler.net/ to the list

  12. Brian Wilkie 1 year ago
    Reply

    Best part of this article? It’s not an article because it provides absolutely no support for why these apps are included in the list and why others are not. This is click bait because it provides no value to the reader. A simple Google search of “web crawlers” gives you the same value.

  13. Statistrix 1 year ago
    Reply

    Great source of knowledge. Moreover that data scraping could be a great source of data for analysis.
    Site recommended by statistrix.com

  14. Data Meets Media 1 year ago
    Reply

    Cool. This is really helpful. Thanks Baiju NT. I’ve been meaning to do some web crawling of my own, and it seems Scrapy or GRUB is my best bet since Python is my language of choice.
    http://datameetsmedia.com/

  15. Sercan Turna 1 year ago
    Reply

    http://www.analysemysite.com is better solution i think.

  16. Juan Yang 11 months ago
    Reply

    Can i ask ? how to use above message to my wear resistant alumina ceramic website : http://www.chemshun.com

  17. Or Vibes 10 months ago
    Reply

    i have https://orvibes.com/ what a crawler site can do for me please?

  18. Ralf Ritter 10 months ago
    Reply

    Great List ! Thanks 🙂
    Ralf
    http://www.ethereumkurs.de

  19. Mike Trxx 7 months ago
    Reply

    Why not to build web scrapers in robust Go without knowing Go? Diggernaut made it possible: https://www.diggernaut.com. Oh, and yes, its completely free if you dont use their cloud

  20. Salim Khalil 7 months ago
    Reply

    If you are using R software , then i recommend RCrawler package for crawling and data collection
    tutorials : https://github.com/salimk/Rcrawler/

    • juwa 6 months ago
      Reply

      Hi, Salim. Am working on a simple web crawling assignment using JAVA and am stuck on it. I have a limited knowledge in programming and thus it has taken me so long without any progress being made on the assignment. I am seeking your help on completing some sections of the program.

  21. asqueu 6 months ago
    Reply

    Try uCrawler – cloud-platform that allows to create your news aggregator based on artificial intelligence technologies. https://ucrawler.newsbot.press/index-en.html

  22. Ogudu 5 months ago
    Reply

    Unimpressed! Came here looking for a simle lightweight tool that can search the content of an entire website for a particular word on the web pages of the site like an email, a name, etc. But all I see is web scrappers, SEO tools, and other similar annoying apps.

Leave a Comment

Your email address will not be published.

You may also like

Crayon Yoda

Pin It on Pinterest