Data Mining

Top 50 open source web crawlers for data mining

A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.

There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

Top 50 open source web crawlers

NameLanguagePlatform
HeritrixJavaLinux
NutchJavaCross-platform
ScrapyPythonCross-platform
DataparkSearchC++Cross-platform
GNU WgetCLinux
GRUBC#, C, Python, PerlCross-platform
ht://DigC++Unix
HTTrackC/C++Cross-platform
ICDL CrawlerC++Cross-platform
mnoGoSearchCWindows
Norconex HTTP CollectorJavaCross-platform
Open Source ServerC/C++, Java PHPCross-platform
PHP-CrawlerPHPCross-platform
YaCyJavaCross-platform
WebSPHINXJavaCross-platform
WebLechJavaCross-platform
AraleJavaCross-platform
JSpiderJavaCross-platform
HyperSpiderJavaCross-platform
ArachnidJavaCross-platform
SpindleJavaCross-platform
SpiderJavaCross-platform
LARMJavaCross-platform
MetisJavaCross-platform
SimpleSpider>JavaCross-platform
GrunkJavaCross-platform
CAPEKJavaCross-platform
ApertureJavaCross-platform
Smart and Simple Web CrawlerJavaCross-platform
Web HarvestJavaCross-platform
AspseekC++Linux
BixoJavaCross-platform
crawler4jJavaCross-platform
EbotErlandLinux
HounderJavaCross-platform
Hyper EstraierC/C++Cross-platform
OpenWebSpiderC#, PHPCross-platform
PavukCLunix
SphiderPHPCross-platform
XapianC++Cross-platform
Arachnode.netC#Windows
CrawwwlerC++Java
Distributed Web CrawlerC, Java, PythonCross-platform
iCrawlerJavaCross-platform
pycreepJavaCross-platform
OpeseC++Linux
AndjingJava
CcrawlerC#Windows
WebEaterJavaCross-platform
JoBoJavaCross-platform
3 Comments
  1. Right here is the perfect blog for everyone who really wants to find out about this topic.
    You understand so much its almost hard to argue with you (not that I really will need to?HaHa).
    You certainly put a brand new spin on a topic which has been discussed for
    a long time. Great stuff, just wonderful!

  2. Greetings! Very helpful advice within this article!

    It is the little changes that make the most significant changes.
    Thanks for sharing!

  3. Hi there! This post couldn’t be written any better!

    Reading through this post reminds me of my good old room mate!
    He always kept chatting aabout this. I will forward this post tto him.

    Fairly certain he will have a good read.
    Thank you for sharing!

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest