<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Big Data Made Simple - One source. Many perspectives. &#187; Data Mining</title>
	<atom:link href="http://bigdata-madesimple.com/category/tech-and-tools/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://bigdata-madesimple.com</link>
	<description>One source. Many perspectives.</description>
	<lastBuildDate>Sat, 08 Jul 2017 05:11:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.2</generator>
		<item>
		<title>Top 50 open source web crawlers for data mining</title>
		<link>http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/</link>
		<comments>http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/#comments</comments>
		<pubDate>Thu, 15 Jun 2017 05:30:00 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://www.bigdata-madesimple.com/?p=12407</guid>
		<description><![CDATA[<p>A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/">Top 50 open source web crawlers for data mining</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>A web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or &#8220;crawls&#8221; through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.</p>
<p>There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.</p>
<table>
<tbody>
<tr>
<td><strong>Name</strong></td>
<td><strong>Language</strong></td>
<td><strong>Platform</strong></td>
</tr>
<tr>
<td><a href="http://crawler.archive.org/" target="_blank">Heritrix</a></td>
<td>Java</td>
<td>Linux</td>
</tr>
<tr>
<td><a href="http://nutch.apache.org/" target="_blank">Nutch</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://scrapy.org/" target="_blank">Scrapy</a></td>
<td>Python</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.dataparksearch.org/" target="_blank">DataparkSearch</a></td>
<td>C++</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="https://www.gnu.org/software/wget/" target="_blank">GNU Wget</a></td>
<td>C</td>
<td>Linux</td>
</tr>
<tr>
<td><a href="http://freecode.com/projects/grubng" target="_blank">GRUB</a></td>
<td>C#, C, Python, Perl</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.htdig.org/" target="_blank">ht://Dig</a></td>
<td>C++</td>
<td>Unix</td>
</tr>
<tr>
<td><a href="http://www.httrack.com/" target="_blank">HTTrack</a></td>
<td>C/C++</td>
<td>Cross-platform</td>
</tr>
<tr>
<td>ICDL Crawler</td>
<td>C++</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.mnogosearch.org/" target="_blank">mnoGoSearch</a></td>
<td>C</td>
<td>Windows</td>
</tr>
<tr>
<td><a href="http://www.norconex.com/collectors/" target="_blank">Norconex HTTP Collector</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.opensearchserver.com/" target="_blank">Open Source Server</a></td>
<td>C/C++, Java PHP</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://phpcrawl.cuab.de/" target="_blank">PHP-Crawler</a></td>
<td>PHP</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://yacy.net/en/index.html" target="_blank">YaCy</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.cs.cmu.edu/~rcm/websphinx/" target="_blank">WebSPHINX</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://weblech.sourceforge.net/" target="_blank">WebLech</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://flavio.tordini.org/arale" target="_blank">Arale</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://sourceforge.net/projects/j-spider/" target="_blank">JSpider</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://hyperspider.sourceforge.net/" target="_blank">HyperSpider</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://arachnid.sourceforge.net/" target="_blank">Arachnid</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.bitmechanic.com/projects/spindle/" target="_blank">Spindle</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.tempeststrings.com/spider/index.shtml" target="_blank">Spider</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://larm.sourceforge.net/" target="_blank">LARM</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.severus.org/sacha/metis/" target="_blank">Metis</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://staff.develop.com/halloway/SimpleSpider.html" target="_blank">SimpleSpider</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://dlt.ncsa.uiuc.edu/archive/emerge/components_grunk.html" target="_blank">Grunk</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.egothor.org/c124.html" target="_blank">CAPEK</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://aperture.sourceforge.net/" target="_blank">Aperture</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="https://crawler.dev.java.net/" target="_blank">Smart and Simple Web Crawler</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://web-harvest.sourceforge.net/" target="_blank">Web Harvest</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/aspseek" target="_blank">Aspseek</a></td>
<td>C++</td>
<td>Linux</td>
</tr>
<tr>
<td><a href="http://openbixo.org/" target="_blank">Bixo</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="https://code.google.com/p/crawler4j/" target="_blank">crawler4j</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="https://github.com/matteoredaelli/ebot" target="_blank">Ebot</a></td>
<td>Erland</td>
<td>Linux</td>
</tr>
<tr>
<td><a href="https://code.google.com/p/hounder/" target="_blank">Hounder</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://sourceforge.net/projects/hyperestraier/" target="_blank">Hyper Estraier</a></td>
<td>C/C++</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.openwebspider.org/" target="_blank">OpenWebSpider</a></td>
<td>C#, PHP</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.pavuk.org/man.html" target="_blank">Pavuk</a></td>
<td>C</td>
<td>Lunix</td>
</tr>
<tr>
<td><a href="http://www.sphider.eu/index.php" target="_blank">Sphider</a></td>
<td>PHP</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://xapian.org/" target="_blank">Xapian</a></td>
<td>C++</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/arachnode" target="_blank">Arachnode.net</a></td>
<td>C#</td>
<td>Windows</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/crawwwler" target="_blank">Crawwwler</a></td>
<td>C++</td>
<td>Java</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/distributed-web-crawler" target="_blank">Distributed Web Crawler</a></td>
<td>C, Java, Python</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/iwebcrawler" target="_blank">iCrawler</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/pycreep" target="_blank">pycreep</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/opese" target="_blank">Opese</a></td>
<td>C++</td>
<td>Linux</td>
</tr>
<tr>
<td><a href="https://code.google.com/p/andjing/" target="_blank">Andjing</a></td>
<td>Java</td>
<td></td>
</tr>
<tr>
<td><a href="http://www.findbestopensource.com/product/ccrawler" target="_blank">Ccrawler</a></td>
<td>C#</td>
<td>Windows</td>
</tr>
<tr>
<td><a href="http://webeater.sourceforge.net/" target="_blank">WebEater</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
<tr>
<td><a href="http://www.matuschek.net/jobo/" target="_blank">JoBo</a></td>
<td>Java</td>
<td>Cross-platform</td>
</tr>
</tbody>
</table>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/">Top 50 open source web crawlers for data mining</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Top 20 web crawler tools to scrape the websites</title>
		<link>http://bigdata-madesimple.com/top-20-web-crawler-tools-scrape-websites/</link>
		<comments>http://bigdata-madesimple.com/top-20-web-crawler-tools-scrape-websites/#comments</comments>
		<pubDate>Sat, 03 Jun 2017 05:30:52 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=21474</guid>
		<description><![CDATA[<p>Web crawling (also known as web scraping) is a process in which a program or automated script browses...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-20-web-crawler-tools-scrape-websites/">Top 20 web crawler tools to scrape the websites</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p align="left">Web crawling (also known as web scraping) is a process in which a program or automated script browses the World Wide Web in a methodical, automated manner and targets at fetching new or updated data from any websites and store the data for an easy access. Web crawler tools are very popular these days as they have simplified and automated the entire crawling process and made the data crawling easy and accessible to everyone. In this post, we will look at top 20 popular web crawlers around the web.</p>
<p><strong>1. <span style="text-decoration: underline;"><a href="https://www.cyotek.com/cyotek-webcopy">Cyotek WebCopy</a></span></strong></p>
<p>WebCopy is a free website crawler that allows you to copy partial or full websites locally in to your hard disk for offline reading.</p>
<p>It will scan the specified website before downloading the website content onto your hard disk and auto-remap the links to resources like images and other web pages in the site to match its local path, excluding a section of the website. Additional options are also available such as downloading a URL to include in the copy, but not crawling it.</p>
<p>There are many settings you can make to configure how your website will be crawled, in addition to rules and forms mentioned above, you can also configure domain aliases, user agent strings, default documents and more.</p>
<p>However, WebCopy does not include a virtual DOM or any form of JavaScript parsing. If a website makes heavy use of JavaScript to operate, it is unlikely WebCopy will be able to make a true copy if it is unable to discover all the website due to JavaScript being used to dynamically generate links.</p>
<p><strong>2.  </strong><strong><a href="https://www.httrack.com/">HTTrack</a></strong></p>
<p>As a website crawler freeware, HTTrack provides functions well suited for downloading an entire website from the Internet to your PC. It has provided versions available for Windows, Linux, Sun Solaris, and other Unix systems. It can mirror one site, or more than one site together (with shared links). You can decide the number of connections to opened concurrently while downloading web pages under “Set options”. You can get the photos, files, HTML code from the entire directories, update current mirrored website and resume interrupted downloads.</p>
<p>Plus, Proxy support is available with HTTTrack to maximize speed, with optional authentication.</p>
<p>HTTrack Works as a command-line program, or through a shell for both private (capture) or professional (on-line web mirror) use. With that saying, HTTrack should be preferred and used more by people with advanced programming skills.</p>
<p><strong>3.</strong><b> <strong><a href="http://www.octoparse.com/">Octoparse</a></strong></b></p>
<p>Octoparse is a free and powerful website crawler used for extracting almost all kind of data you need from the website. You can use Octoparse to rip a website with its extensive functionalities and capabilities. There are two kinds of learning mode &#8211; Wizard Mode and Advanced Mode &#8211; for non-programmers to quickly get used to Octoparse. After downloading the freeware, its point-and-click UI allows you to grab all the text from the website and thus you can download almost all the website content and save it as a structured format like EXCEL, TXT, HTML or your databases.</p>
<p>More advanced, it has provided Scheduled Cloud Extraction which enables you to refresh the website and get the latest information from the website.</p>
<p>And you could extract many tough websites with difficult data block layout using its built-in Regex tool, and locate web elements precisely using the XPath configuration tool. You will not be bothered by IP blocking any more, since Octoparse offers IP Proxy Servers that will automates IP’s leaving without being detected by aggressive websites.</p>
<p>To conclude, Octoparse should be able to satisfy users’ most crawling needs, both basic or high-end, without any coding skills.</p>
<p><strong>4</strong>. <strong><span style="text-decoration: underline;"><a href="https://sourceforge.net/projects/getleftdown/">Getleft</a></span></strong></p>
<p>Getleft is a free and easy-to-use website grabber that can be used to rip a website. It downloads an entire website with its easy-to-use interface and multiple options. After you launch the Getleft, you can enter a URL and choose the files that should be downloaded before begin downloading the website. While it goes, it changes the original pages, all the links get changed to relative links, for local browsing. Additionally, it offers multilingual support, at present Getleft supports 14 languages. However, it only provides limited Ftp supports, it will download the files but not recursively. Overall, Getleft should satisfy users’ basic crawling needs without more complex tactical skills.</p>
<p><strong>5</strong>. <strong><span style="text-decoration: underline;"><a href="https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd">Scraper</a></span></strong></p>
<p>Scraper is a Chrome extension with limited data extraction features but it’s helpful for making online research, and exporting data to Google Spreadsheets. This tool is intended for beginners as well as experts who can easily copy data to the clipboard or store to the spreadsheets using OAuth. Scraper is a free web crawler tool, which works right in your browser and auto-generates smaller XPaths for defining URLs to crawl. It may not offer all-inclusive crawling services, but novices also needn’t tackle messy configurations.</p>
<p><strong>6</strong>. <strong><span style="text-decoration: underline;"><a href="https://addons.mozilla.org/en-US/firefox/addon/outwit-hub/">OutWit Hub</a></span></strong></p>
<p>OutWit Hub is a Firefox add-on with dozens of data extraction features to simplify your web searches. This web crawler tool can browse through pages and store the extracted information in a proper format.</p>
<p>OutWit Hub offers a single interface for scraping tiny or huge amounts of data per needs. OutWit Hub lets you scrape any web page from the browser itself and even create automatic agents to extract data and format it per settings.</p>
<p>It is one of the simplest web scraping tools, which is free to use and offers you the convenience to extract web data without writing a single line of code.</p>
<p><strong>7. </strong><strong><span style="text-decoration: underline;"><a href="https://www.parsehub.com/">ParseHub</a></span></strong></p>
<p>Parsehub is a great web crawler that supports collecting data from websites that use AJAX technologies, JavaScript, cookies etc. Its machine learning technology can read, analyze and then transform web documents into relevant data.</p>
<p>The desktop application of Parsehub supports systems such as windows, Mac OS X and Linux, or you can use the web app that is built within the browser.</p>
<p>As a freeware, you can set up no more than five public projects in Parsehub. The paid subscription plans allow you to create at least 20 private projects for scraping websites.</p>
<p><strong>8</strong>.<strong> <span style="text-decoration: underline;"><a href="http://visualscraper.blogspot.hk/">Visual Scraper</a></span></strong></p>
<p>VisualScraper is another great free and non-coding web scraper with simple point-and-click interface and could be used to collect data from the web. You can get real-time data from several web pages and export the extracted data as CSV, XML, JSON or SQL files. Besides the SaaS, VisualScraper offer web scraping service such as data delivery services and creating software extractors services.</p>
<p>Visual Scraper enables users to schedule their projects to be run on specific time or repeat the sequence every minute, days, week, month, year. Users could use it to extract news, updates, forum frequently.</p>
<p><strong>9.</strong> <strong><span style="text-decoration: underline;"><a href="https://scrapinghub.com/">Scrapinghub</a></span></strong></p>
<p>Scrapinghub is a cloud-based data extraction tool that helps thousands of developers to fetch valuable data. Its open source visual scraping tool, allows users to scrape websites without any programming knowledge.</p>
<p>Scrapinghub uses Crawlera, a smart proxy rotator that supports bypassing bot counter-measures to crawl huge or bot-protected sites easily. It enables users to crawl from multiple IPs and locations without the pain of proxy management through a simple HTTP API.</p>
<p>Scrapinghub converts the entire web page into organized content. Its team of experts are available for help in case its crawl builder can’t work your requirements.</p>
<p><strong>10.</strong><span style="text-decoration: underline;"> <strong><a href="https://dexi.io/">Dexi.io</a></strong></span></p>
<p>As a browser-based web crawler, Dexi.io allows you to scrape data based on your browser from any website and provide three types of robot for you to create a scraping task &#8211; Extractor, Crawler and Pipes. The freeware provides anonymous web proxy servers for your web scraping and your extracted data will be hosted on Dexi.io’s servers for two weeks before the data is archived, or you can directly export the extracted data to JSON or CSV files. It offers paid services to meet your needs for getting real-time data.</p>
<p><strong>11.</strong> <strong><a href="https://webhose.io/">Webhose.io</a></strong></p>
<p>Webhose.io enables users to get real-time data from crawling online sources from all over the world into various, clean formats. This web crawler enables you to crawl data and further extract keywords in many different languages using multiple filters covering a wide array of sources.</p>
<p>And you can save the scraped data in XML, JSON and RSS formats. And users can access the history data from its Archive. Plus, webhose.io supports at most 80 languages with its crawling data results. And users can easily index and search the structured data crawled by Webhose.io.</p>
<p>Overall, Webhose.io could satisfy users’ elementary crawling requirements.</p>
<p><strong>12</strong>. <strong><span style="text-decoration: underline;"><a href="https://www.import.io/">Import. io</a></span></strong></p>
<p>Users can form their own datasets by simply importing the data from a web page and exporting the data to CSV.</p>
<p>You can easily scrape thousands of web pages in minutes without writing a single line of code and build 1000+ APIs based on your requirements. Public APIs has provided powerful and flexible capabilities to control Import.io programmatically and gain automated access to the data, Import.io has made crawling easier by integrating web data into your own app or web site with just a few clicks.</p>
<p>To better serve users&#8217; crawling requirements, it also offers a free app for Windows, Mac OS X and Linux to build data extractors and crawlers, download data and sync with the online account. Plus, users can schedule crawling tasks weekly, daily or hourly.</p>
<p><strong>13</strong>. <strong><span style="text-decoration: underline;"><a href="http://80legs.com/">80legs</a></span></strong></p>
<p>80legs is a powerful web crawling tool that can be configured based on customized requirements. It supports fetching huge amounts of data along with the option to download the extracted data instantly. 80legs provides high-performance web crawling that works rapidly and fetches required data in mere seconds</p>
<p><strong>14</strong>. <span style="text-decoration: underline;"><strong><a href="https://www.spinn3r.com/">Spinn3r</a></strong></span></p>
<p>Spinn3r allows you to fetch entire data from blogs, news &amp; social media sites and RSS &amp; ATOM feeds. Spinn3r is distributed with a firehouse API that manages 95% of the indexing work. It offers an advanced spam protection, which removes spam and inappropriate language uses, thus improving data safety.</p>
<p>Spinn3r indexes content like Google and saves the extracted data in JSON files. The web scraper constantly scans the web and finds updates from multiple sources to get you real-time publications. Its admin console lets you control crawls and full-text search allows making complex queries on raw data.</p>
<p><strong>15. </strong><span style="text-decoration: underline;"><strong><a href="https://contentgrabber.com/">Content Grabber</a></strong></span></p>
<p>Content Graber is a web crawling software targeted at enterprises. It allows you to create a stand-alone web crawling agents. It can extract content from almost any website and save it as structured data in a format of your choice, including Excel reports, XML, CSV and most databases.</p>
<p>It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users can use C# or VB.NET to debug or write script to control the crawling programming. For example, Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit test for a advanced and tactful customized crawler based on users’ particular needs.</p>
<p><strong>16.</strong> <strong><span style="text-decoration: underline;"><a href="http://www.heliumscraper.com/en/index.php?p=home">Helium Scraper</a></span></strong></p>
<p>Helium Scraper is a visual web data crawling software that works well when the association between elements is small. It’s non-coding, non-configuration. And users can get access to the online templates based for various crawling needs. Basically, it could satisfy users’ crawling needs within an elementary level.</p>
<p><strong>17.</strong> <strong><span style="text-decoration: underline;"><a href="http://www.uipath.com/">UiPath</a></span></strong></p>
<p>UiPath is a robotic process automation software for free web scraping. It automates web and desktop data crawling out of most third-party Apps. You can install the robotic process automation software if you run Windows system. Uipath can extract tabular and pattern-based data across multiple web pages.</p>
<p>Uipath has provided the built-in tools for further crawling. This method is very effective when dealing complex UIs. The Screen Scraping Tool can handle both individual text elements, groups of text and blocks of text, such as data extraction in table format.</p>
<p>Plus, no programming is needed to create intelligent web agents, but the .NET hacker inside you will have complete control over the data.</p>
<p><strong>18</strong>. <strong><span style="text-decoration: underline;"><a href="http://scrape.it/">Scrape. it</a></span></strong></p>
<p>Scrape.it is a node.js web scraping software for humans. It’s a cloud-base web data extraction tool. It’s designed towards those with advanced programming skills, since it has offered both public and private packages to discover, reuse, update, and share code with millions of developers worldwide. Its powerful integration will help you build a customized crawler based on your needs.</p>
<p><strong>19.</strong> <strong><span style="text-decoration: underline;"><a href="https://www.webharvy.com/">WebHarvy</a></span></strong></p>
<p>WebHarvy is a point-and-click web scraping software. It’s designed for non-programmers. WebHarvy can automatically scrape Text, Images, URLs &amp; Emails from websites, and save the scraped content in various formats. It also provides built-in scheduler and proxy support which enables anonymously crawling and prevents the web scraping software from being blocked by web servers, you have the option to access target websites via proxy servers or VPN.</p>
<p>Users can save the data extracted from web pages in a variety of formats. The current version of WebHarvy Web Scraper allows you to export the scraped data as an XML, CSV, JSON or TSV file. User can also export the scraped data to an SQL database.</p>
<p><strong>20.</strong> <strong><span style="text-decoration: underline;"><a href="http://www.connotate.com/">Connotate</a></span></strong></p>
<p>Connotate is an automated web crawler designed for Enterprise-scale web content extraction which needs an enterprise-scale solution. Business users can easily create extraction agents in as little as minutes – without any programming. User can easily create extraction agents simply by point-and-click.</p>
<p>It can automatically extract over 95% of sites without programming, including complex JavaScript-based dynamic site technologies, such as Ajax. And Connotate supports any language for data crawling from most sites.</p>
<p>Additionally, Connotate also offers the function to integrate webpage and database content, including content from SQL databases and MongoDB for database extraction.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-20-web-crawler-tools-scrape-websites/">Top 20 web crawler tools to scrape the websites</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/top-20-web-crawler-tools-scrape-websites/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The ultimate guide to web data extraction</title>
		<link>http://bigdata-madesimple.com/ultimate-guide-web-data-extraction/</link>
		<comments>http://bigdata-madesimple.com/ultimate-guide-web-data-extraction/#comments</comments>
		<pubDate>Sat, 27 May 2017 05:30:13 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=21419</guid>
		<description><![CDATA[<p>Web data extraction (also known as web scraping, web harvesting screen scraping, etc.) is a technique for extracting...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/ultimate-guide-web-data-extraction/">The ultimate guide to web data extraction</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Web data extraction (also known as web scraping, web harvesting screen scraping, etc.) is a technique for extracting huge amounts of data from websites on the internet. The data available on websites is generally not available to download easily and can only be accessed by using a web browser. However, web is the largest repository of open data and this data has been growing at exponential rates since the inception of internet.</p>
<p>Web data is of great use to Ecommerce portals, media companies, research firms, data scientists, government and can even help the healthcare industry with ongoing research and making predictions on the spread of diseases.</p>
<p>Consider the data available on classifieds sites, real estate portals, social networks, retail sites, and online shopping websites etc. being easily available in a structured format, ready to be analyzed. Most of these sites don’t provide the functionality to save their data to a local or cloud storage. Some sites provide APIs, but they typically come with restrictions and aren’t reliable enough. Although it’s technically possible to copy and paste data from a website to your local storage, this is inconvenient and out of question when it comes to practical use cases for businesses.</p>
<p>Web scraping helps you do this in an automated fashion and does it far more efficiently and accurately. A web scraping setup interacts with websites in a way similar to a web browser, but instead of displaying it on a screen, it saves the data to a storage system.</p>
<h3><b>Applications of web data extraction</b></h3>
<p><b>1. Pricing intelligence</b></p>
<p>Pricing intelligence is an application that’s gaining popularity by each passing day given the tightening of competition in the online space. E-commerce portals are always watching out for their competitors using web crawling to have real time pricing data from them and to fine tune their own catalogues with competitive pricing. This is done by deploying web crawlers that are programmed to pull product details like product name, price, variant and so on. This data is plugged into an automated system that assigns ideal prices for every product after analyzing the competitors’ prices.</p>
<p>Pricing intelligence is also used in cases where there is a need for consistency in pricing across different versions of the same portal. The capability of web crawling techniques to extract prices in real time makes such applications a reality.</p>
<p><b>2. Cataloging</b></p>
<p>Ecommerce portals typically have a huge number of product listings. It’s not easy to update and maintain such a big catalog. This is why many companies depend on web date extractions services for gathering data required to update their catalogs. This helps them discover new categories they haven’t been aware of or update existing catalogs with new product descriptions, images or videos.</p>
<p><b>3. Market research</b></p>
<p>Market research is incomplete unless the amount of data at your disposal is huge. Given the limitations of traditional methods of data acquisition and considering the volume of relevant data available on the web, web data extraction is by far the easiest way to gather data required for market research. The shift of businesses from brick and mortar stores to online spaces has also made web data a better resource for market research.</p>
<p><b>4. Sentiment analysis</b></p>
<p>Sentiment analysis requires data extracted from websites where people share their reviews, opinions or complaints about services, products, movies, music or any other consumer focused offering. Extracting this user generated content would be the first step in any sentiment analysis project and web scraping serves the purpose efficiently.</p>
<p><b>5. Competitor analysis</b></p>
<p>The possibility of monitoring competition was never this accessible until web scraping technologies came along. By deploying web spiders, it’s now easy to closely monitor the activities of your competitors like the promotions they’re running, social media activity, marketing strategies, press releases, catalogs etc. in order to have the upper hand in competition. Near real time crawls take it a level further and provides businesses with real time competitor data.</p>
<p><b>6. Content aggregation</b></p>
<p>Media websites need instant access to breaking news and other trending information on the web on a continuous basis. Being quick at reporting news is a deal breaker for these companies. Web crawling makes it possible to monitor or extract data from popular news portals, forums or similar sites for trending topics or keywords that you want to monitor. Low latency web crawling is used for this use case as the update speed should be very high.</p>
<p><b>7. Brand monitoring</b></p>
<p>Every brand now understands the importance of customer focus for business growth. It would be in their best interests to have a clean reputation for their brand if they want to survive in this competitive market. Most companies are now using web crawling solutions to monitor popular forums, reviews on ecommerce sites and social media platforms for mentions of their brand and product names. This in turn can help them stay updated to the voice of the customer and fix issues that could ruin brand reputation at the earliest. There’s no doubt about a customer-focused business going up in the growth graph.</p>
<h3><b>Different approaches to web data extraction</b></h3>
<p>There are businesses that function solely based on data, others use it for business intelligence, competitor analysis and market research among other countless use cases. However, extracting massive amounts of data from the web is still a major roadblock for many companies, more so because they are not going through the optimal route. Here is a detailed overview of different ways by which you can extract data from the web.</p>
<p><b>1. DaaS</b></p>
<p>Outsourcing your web data extraction project to a DaaS provider is by far the best way to extract data from the web. When depending on a data provider, you are completely relieved from the responsibility of crawler setup, maintenance and quality inspection of the data being extracted. Since DaaS companies would have the necessary expertise and infrastructure required for a smooth and seamless data extraction, you can avail their services at a much lower cost than what you’d incur by doing it yourself.</p>
<p>Providing the DaaS provider with your exact requirements is all you need to do and rest is assured. You would have to send across details like the data points, source websites, frequency of crawl, data format and delivery methods. With DaaS, you get the data exactly the way you want, and you can rather focus on utilizing the data to improve your business bottom lines, which should ideally be your priority. Since they are experienced in scraping and possess domain knowledge to get the data efficiently and at scale, going with a DaaS provider is the right option if your requirement is large and recurring.</p>
<p>One of the biggest benefits of outsourcing is the data quality assurance. Since the web is highly dynamic in nature, data extraction requires constant monitoring and maintenance to work smoothly. Web data extraction services tackle all these challenges and deliver noise-free data of high quality.</p>
<p>Another benefit of going with a data extraction service is the customization and flexibility. Since these services are meant for enterprises, the offering is completely customizable according to your specific requirements.</p>
<p><b>Pros:</b></p>
<ul>
<li>Completely customisable for your requirement</li>
<li>Takes complete ownership of the process</li>
<li>Quality checks to ensure high quality data</li>
<li>Can handle dynamic and complicated websites</li>
<li>More time to focus on your core business</li>
</ul>
<p><b>Cons:</b></p>
<ul>
<li>Might have to enter into a long-term contract</li>
<li>Slightly costlier than DIY tools</li>
</ul>
<p><b>2. In house data extraction</b></p>
<p>You can go with in house data extraction if your company is technically rich. Web scraping is a technically niche process and demands a team of skilled programmers to code the crawler, deploy them on servers, debug, monitor and do the post processing of extracted data. Apart from a team, you would also need high end infrastructure to run the crawling jobs.</p>
<p>Maintaining the in-house crawling setup can be a bigger challenge than building it. Web crawlers tend to be very fragile. They break even with small changes or updates in the target websites. You would have to setup a monitoring system to know when something goes wrong with the crawling task, so that it can be fixed to avoid data loss. You will have to dedicate time and labour into the maintenance of the in-house crawling setup.</p>
<p>Apart from this, the complexity associated with building an in-house crawling setup would go up significantly if the number of websites you need to scrape is high or the target sites are using dynamic coding practices. An in-house crawling setup would also take a toll on the focus and dilute your results as web scraping itself is something that needs specialization. If you aren’t cautious, it could easily hog your resources and cause friction in your operational workflow.</p>
<p><b>Pros:</b></p>
<ul>
<li>Total ownership and control over the process</li>
<li>Ideal for simpler requirements</li>
</ul>
<p><b>Cons:</b></p>
<ul>
<li>Maintenance of crawlers is a headache</li>
<li>Increased cost</li>
<li>Hiring, training and managing a team might be hectic</li>
<li>Might hog on the company resources</li>
<li>Could affect the core focus of the organisation</li>
<li>Infrastructure is costly</li>
</ul>
<p><b>3. Vertical specific solutions</b></p>
<p>There are data providers that cater to only a specific industry vertical. Vertical specific data extraction solutions are great if you could find one that’s catering to the domain you are targeting and covers all your necessary data points. The benefit of going with a vertical specific solution is the comprehensiveness of data that you would get. Since these solutions cater to only one specific domain, their expertise in that domain would be very high.</p>
<p>The schema of data sets you would get from vertical specific data extraction solutions are typically fixed and won’t be customizable. Your data project will be limited to the data points provided by such solutions, but this may or may not be a deal breaker depending on your requirements. These solutions typically give you datasets that are already extracted and is ready to use. A good example for a vertical specific data extraction solution is JobsPikr, which is a<a href="https://www.jobspikr.com/?utm_source=ultimate-guide"> job listings data</a> solution that extracts data directly from career pages of company websites from across the world.</p>
<p><b>Pros:</b></p>
<ul>
<li>Comprehensive data from the industry</li>
<li>Faster access to data</li>
<li>No need to handle the complicated aspects of extraction</li>
</ul>
<p><b>Cons:</b></p>
<ul>
<li>Lack of customisation options</li>
<li>Data is not exclusive</li>
</ul>
<p><b>4. DIY data extraction tools</b></p>
<p>If you don’t have the budget for building an in-house crawling setup or outsourcing your data extraction process to a vendor, you are left with DIY tools. These tools are easy to learn and often provide a point and click interface to make data extraction simpler than you could ever imagine. These tools are an ideal choice if you are just starting out with no budgets for data acquisition. DIY web scraping tools are usually priced very low and some are even free to use.</p>
<p>However, there are serious downsides to using a DIY tool to extract data from the web. Since these tools wouldn’t be able to handle complex websites, they are very limited in terms of functionality, scale, and the efficiency of data extraction. Maintenance will also be a challenge with DIY tools as they are made in a rigid and less flexible manner. You will have to make sure that the tool is working and even make changes from time to time.</p>
<p>The only good side is that it doesn’t take much technical expertise to configure and use such tools, which might be right for you if you aren’t a technical person. Since the solution is readymade, you will also save the costs associated with building your own infrastructure for scraping. With the downsides apart, DIY tools can cater to simple and small scale data requirements.</p>
<p><b>Pros:</b></p>
<ul>
<li>Full control over the process</li>
<li>Prebuilt solution</li>
<li>You can avail support for the tools</li>
<li>Easier to configure and use</li>
</ul>
<p><b>Cons:</b></p>
<ul>
<li>They get outdated often</li>
<li>More noise in the data</li>
<li>Less customization options</li>
<li>Learning curve can be high</li>
<li>Interruption in data flow in case of structural changes</li>
</ul>
<h3><b>How web data extraction works</b></h3>
<p>There are several different methods and technologies that can be used to build a crawler and extract data from the web.</p>
<p><b>1. The seed</b></p>
<p>A seed URL is where it all starts. A crawler would start its journey from the seed URL and start looking for the next URL in the data that’s fetched from the seed. If the crawler is programmed to traverse through the entire website, the seed URL would be same as the root of the domain. The seed URL is programmed into the crawler at the time of setup and would remain the same throughout the extraction process.</p>
<p><b>2. Setting directions</b></p>
<p>Once the crawler fetches the seed URL, it would have different options to proceed further. These options would be hyperlinks on the page that it just loaded by querying the seed URL. The second step is to program the crawler to identify and take different routes by itself from this point. At this point, the bot knows where to start and where to go from there.</p>
<p><b>3. Queueing</b></p>
<p>Now that the crawler knows how to get into the depths of a website and reach pages where the data to be extracted is, the next step is to compile all these destination pages to a repository that it can pick the URLs to crawl. Once this is complete, the crawler starts fetching the URLs from the repository. It saves these pages as HTML files on either a local or cloud based storage space. The final scraping happens at this repository of HTML files.</p>
<p><b>4. Data extraction</b></p>
<p>Now that the crawler has saved all the pages that needs to be scraped, it’s time to extract only the required data points from these pages. The schema used will be in accordance with your requirement. Now is the time to instruct the crawler to pick only the relevant data points from these HTML files and ignore the rest. The crawler can be taught to identify data points based on the HTML tags or class names associated with the data points.</p>
<p><b>6. Deduplication and cleansing</b></p>
<p>Deduplication is a process done on the extracted records to eliminate the chances of duplicates in the extracted data. This will require a separate system that can look for duplicate records and remove them to make the data concise. The data could also have noise in it, which needs to be cleaned too. Noise here refers to unwanted HTML tags or text that got scraped along with the relevant data.</p>
<p><b>6. Structuring</b></p>
<p>Structuring is what makes the data compatible with databases and analytics systems by giving it a proper, machine readable syntax. This is the final process in data extraction and post this, the data is ready for delivery. With structuring done, the data is ready to be consumed either by importing it to a database or plugging it to an analytics system.</p>
<h3><b>Best practices in web data extraction </b></h3>
<p>As a great tool for deriving powerful insights, web data extraction has become imperative for businesses in this competitive market. As is the case with most powerful things, web scraping must be used responsibly. Here is a compilation of the best practices that you must follow while scraping websites.</p>
<p><b>1. Respect the robots.txt</b></p>
<p>You should always check the Robots.txt file of a website you are planning to extract data from. Websites set rules on how bots should interact with the site in their robots.txt file. Some sites even block crawler access completely in their robots file. Extracting data from sites that disallow crawling is can lead to legal ramifications and should be avoided. Apart from outright blocking, every site would have set rules on good behavior on their site in the robots.txt. You are bound to follow these rules while extracting data from the target site.</p>
<p><b>2. Do not hit the servers too frequently</b></p>
<p>Web servers are susceptible to downtimes if the load is very high. Just like human users, bots can also add load to the website’s server. If the load exceeds a certain limit, the server might slow down or crash, rendering the website unresponsive for the users. This creates a bad user experience for the human visitors on the website which defies the whole purpose of that site. It should be noted that the human visitors are of higher priority for the website than bots. To avoid such issues, you should set your crawler to hit the target site with a reasonable interval and limit the number of parallel requests. This will give the website some breathing space, which it should indeed have.</p>
<p><b>3. Scrape during off peak hours</b></p>
<p>To make sure that the target website doesn’t slow down due to a high traffic from humans as well as bots, it is better to schedule your web crawling tasks to run in the off-peak hours. The off-peak hours of the site can be determined by the geo location of where the site’s majority of traffic is from. You can avoid possible overload on the website’s servers by scraping during off-peak hours. This will also have a positive effect on the speed of your data extraction process as the server would respond faster during this time.</p>
<p><b>4. Use the scraped data responsibly</b></p>
<p>Extracting data from the web has become an important business process. However, this doesn’t mean you own the data you extracted from a website on the internet. Publishing the data elsewhere without the consent of the website you are scraping can be considered unethical and you could be violating copyright laws. Using the data responsibly and in line with the target website’s policies is something you should practice while extracting data from the web.</p>
<h3><b>Finding reliable sources</b></h3>
<p><b>1. Avoid sites with too many Broken links</b></p>
<p>Links are like the connecting tissue of the internet. A website that has too many broken links is a bad choice for a web data extraction project. This is an indicator of the poor maintenance of the site and crawling such a site won’t be a good experience for you. For one, a scraping setup can come to a halt if it encounters a broken link during the fetching process. This would eventually tamper the data quality, which should be a deal breaker for anyone who’s serious about the data project. You are better off with a different source website that has similar data and better housekeeping.</p>
<p><b>2. Avoid sites with highly dynamic coding practices</b></p>
<p>This might not always be an option; however, it is better to avoid sites with complex and dynamic practices to have a stable crawling job running. Since dynamic sites tend to be difficult to extract data from and change very frequently, maintenance could become a huge bottleneck. It’s always better to find less complex sites when it comes to web crawling.</p>
<p><b>3. Quality and freshness of the Data</b></p>
<p>The quality and freshness of data must be one of your most important criteria while choosing sources for data extraction. The data that you acquire should be fresh and relevant to the current time-period for it to be of any use at all. Always look for sites that are updated frequently with fresh and relevant data when selecting sources for your data extraction project. You could check the last modified date on the site’s source code to get an idea of how fresh the data is.</p>
<h3><b>Legal aspects of web crawling</b></h3>
<p>Web data extraction is sometimes seen with clouded eye by people who aren’t very familiar with the concept. To clear the air, web scraping/crawling is not an unethical or illegal activity. The way a crawler bot fetches information from a website is in no different from a human visitor consuming the content on a webpage. Google search, for example runs of web crawling and we don’t see anyone accusing Google of doing something even remotely illegal. However, there are some ground rules you should follow while scraping websites. If you follow these rules and operate as a good bot on the internet, you aren’t doing anything illegal. Here are the rules to follow:</p>
<ol>
<li>  Respect the robots.txt file of the target site</li>
<li>  Make sure you are staying compliant to the TOS page</li>
<li>  Do not reproduce the data elsewhere, online or offline without prior permission from the site</li>
</ol>
<p>If you follow these rules while crawling a website, you are completely in the safe zone.</p>
<h3><strong>Conclusion</strong></h3>
<p>We covered the importance aspects of web data extraction here like the different routes you can take to web data, best practices, various business applications and the legal aspects of the process. As the business world is rapidly moving towards a data-centric operational model, it’s high time to evaluate your data requirements and get started with extracting relevant data from the web to improve your business efficiency and boost the revenues. This guide should help you get going in case you get stuck during the journey.</p>
<p>Source: <a href="https://www.promptcloud.com/blog/ultimate-web-data-extraction-guide" target="_blank">promptcloud.com</a></p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/ultimate-guide-web-data-extraction/">The ultimate guide to web data extraction</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/ultimate-guide-web-data-extraction/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automating workflow management and processes through streamlined data analysis</title>
		<link>http://bigdata-madesimple.com/automating-workflow-management-and-processes-through-streamlined-data-analysis-2/</link>
		<comments>http://bigdata-madesimple.com/automating-workflow-management-and-processes-through-streamlined-data-analysis-2/#comments</comments>
		<pubDate>Tue, 13 Dec 2016 11:04:35 +0000</pubDate>
		<dc:creator>Ahamed Meeran</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://54.179.177.208/?p=20679</guid>
		<description><![CDATA[<p>Spending on business process management (BPM) software was predicted to grow by 4.4% in 2015 to reach a...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/automating-workflow-management-and-processes-through-streamlined-data-analysis-2/">Automating workflow management and processes through streamlined data analysis</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Spending on business process management (BPM) software was predicted to grow by <a href="http://www.gartner.com/newsroom/id/3064717">4.4% in 2015 to reach a worldwide spend of $2.7 billion</a>.</p>
<p>Even with these bullish figures, Rob Dunie, Gartner Research director, still stated that,</p>
<p><em>Managing business processes effectively is a difficult challenge for today&#8217;s business leaders, because many of the systems that are used within processes are rigid and difficult to change rapidly.</em></p>
<p>He further stated that,</p>
<p><em>The ability to provide more &#8216;joined up&#8217; insight into business processes through the use of analytics — combined with support for the people involved in processes, allowing them to take advantage of this insight — is what differentiates today&#8217;s iBPMS (Intelligent Business Process Management) market from earlier BPMS technology markets.</em></p>
<p>Undeniably, any process that hinders innovation is bound to have a difficult time sticking around let alone succeeding in this age of social, mobile and cloud technologies.</p>
<p><strong>How can on-demand fulfillment be satiated?</strong></p>
<p>To avoid having rigid and slow business processes; speed, adaptability, insightfulness and better engagement are fundamental.</p>
<p><a href="http://54.179.177.208/wp-content/uploads/2017/03/On-demand-business-framework.png"><img class="alignnone size-full wp-image-20680" alt="On-demand-business-framework" src="http://54.179.177.208/wp-content/uploads/2017/03/On-demand-business-framework.png" width="460" height="466" /></a></p>
<p><a href="http://www.businessinsider.com/the-on-demand-economy-2014-7">The on-demand business framework</a></p>
<p>Today’s reality is that advancements and increased adoption of technologies including; social, mobile, analytics and cloud <a href="http://www.slideshare.net/lovegod1/smac-and-innovation-transformation">(SMAC) which fuel innovation</a>, continue to be the reason why todays businesses have a greater competitive advantage and therefore a better chance of success in their endeavors.</p>
<p>What’s more is that when businesses (especially, rigid and slow ones) don’t maximize and leverage on the use of big data and analytics, they crutch themselves and forfeit a huge competitive advantage. In an increasingly on-demand economy where instant customer satisfaction is increasingly demanded, instantaneous insights, through big data analytics, can make all the difference.</p>
<p>Big data analytics enhances a business’s ability to:</p>
<ul>
<li>Gain almost instantaneous insights of information.</li>
<li>Quickly and appropriately adjust business rules and processes and adapt them to changing circumstances and</li>
<li>Provide more engagement and an overall better experience for their customers whether internally or externally.</li>
</ul>
<p><strong>How does data analytics relate to workflow management?</strong></p>
<p>Workflow management or business process management (BPM) can be broadly understood as the automation of; business processes, administrative tasks and the management of user interactions with a view to improve an organization’s processing efficiency.</p>
<p>Effective automated workflows, allow businesses to for example:</p>
<ul>
<li>Assign and apportion tasks to workers while giving provision to monitor and track the state of all assignments,</li>
<li>Send notifications when material is modified and</li>
<li>Confirm that documentation has been reviewed and approved by appropriate workers before it’s published</li>
</ul>
<p>Through SMAC technologies, business processes are transformed from being ends in themselves to being a means of providing a more sophisticated system of engagement.</p>
<p>In essence, SMAC technologies allow businesses to understand how workers connect, share and interact with data, their coworkers and their customers. Therefore, <a href="http://www.ebizq.net/blogs/bpm_theory/">data and process are interrelated and management should be about both</a>.</p>
<p>When an organization uses workflow management to understand data and its stakeholders, (including how the data is used to make informed decisions; easily, rapidly, measurably and routinely) business rules and processes can then be adapted almost instantaneously to provide better customer experiences.</p>
<p><strong>Data analytics, the cornerstone for automatic workflow management</strong></p>
<p>By analyzing collected data, rules can be developed and these rules are the basis for developing intelligent business processes that allow for better execution of interactions with customers in social, mobile and cloud environments.</p>
<p>In essence, since information is used to make decisions in workflows and processes, big data analytics is fundamental in filtering the information and adding value to it especially in the decision steps of business processes.</p>
<p>For example, it is only when data from social media is processed through data analytics that it begins to make more sense and allows for business rules and processes to be executed. Without the insights from analytically supported business processes, all that you have is an unfiltered disparate data repository that can hardly be used to understand the customer or take appropriate action.</p>
<p><strong>Which technologies and assets are used in workflow management? </strong></p>
<p>In as much as most organizations have gone digital, workflows are still quite manual. For example, documents are still manually uploaded to the cloud and then manually attached to emails for sharing.</p>
<p>Tools like SharePoint and <a href="http://blog.templafy.com/microsofts-new-world-of-work-how-office-365-shapes-the-future-of-workstyles-in-a-world-of-mobile">Office 365 are shaping the future of workstyles in a world of mobile</a>, and are helping to bridge this gap. However, there is still a lot of room for tools to bring such technologies together for a more streamlined workflow.</p>
<p>A good workflow management suite or BPM suite can vary considerably according to the technologies and assets that a business decides to integrate. However, typically, most BPM suites are used for digital workflows, system monitoring and reporting.</p>
<p>Capturing data from social media, bar codes, digital forms on websites, emails and line-of-business software’s, (e.g. accounting software’s), IOT devices, among other sources will determine the breadth and scope of the process automation.</p>
<p>All these technologies are touchpoints for how data is gathered and then analyzed in order to develop processes, adapt them and use them to gain insights to serve customers better.</p>
<p><strong>Conclusion</strong></p>
<p>When social, mobile, on premise, and cloud technologies are put through the lens of streamlined data analytics, workflows can be more easily automated. Through workflow automation, businesses can efficiently and effectively connect their workers to any job or project that needs attention, in order to deliver the best results to customers.</p>
<p>Through streamlined data analysis, better processes can be put in place to take advantage of workflows that bring all stakeholders together, through several different devices, across diverse networks, traversing different locations and all this happening almost instantly.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/automating-workflow-management-and-processes-through-streamlined-data-analysis-2/">Automating workflow management and processes through streamlined data analysis</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/automating-workflow-management-and-processes-through-streamlined-data-analysis-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Mining: commodity or necessity of the 21st century?</title>
		<link>http://bigdata-madesimple.com/data-mining-commodity-or-necessity-of-the-21st-century/</link>
		<comments>http://bigdata-madesimple.com/data-mining-commodity-or-necessity-of-the-21st-century/#comments</comments>
		<pubDate>Mon, 30 May 2016 12:38:04 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18614</guid>
		<description><![CDATA[<p>Forbes once reported that even the slightest increase of investments in Big Data related projects (~ 10%) improves average...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/data-mining-commodity-or-necessity-of-the-21st-century/">Data Mining: commodity or necessity of the 21st century?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/http:/www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/">Forbes once reported</a> that even the slightest increase of investments in Big Data related projects (~ 10%) improves average net income by $65 million for all typical Fortune 1000 companies.</p>
<p>Knowledge is power. Fortunately, the amount of power in this world is immense. All that’s left for you to do is to find efficient ways of using that power.</p>
<p>A typical business acquires lots of information and is obligated process it correctly. Data volumes skyrocket, hence more analytical power is required. You simply can&#8217;t afford to lose your grip of adequate analysis, because you’ll miss out on profits.</p>
<p>Here is some food for your thought: Red Roof Inn, a well-known hotel chain applied data mining methods to a diversified set of available content that included weather information, flight cancellation, airport locations, hotel locations, etc.</p>
<p>This set of data allowed them to find clients that needed a room since their flights were canceled.</p>
<p>Usually, bad weather is a bad sign for hotels, as it reduces traveling. However, Red Roof Inn found a way to improve their income. <a href="http://www.crmsearch.com/retail-big-data.php">Reportedly, their business increased by 10% in 2014</a>! Go data!</p>
<p><b>Yes, you have to invest in BD, but just look at the returns!</b><b></b></p>
<p>Data mining is a very interesting term that many specialists claim as “misused”. The term refers to the process of gathering data, which is not entirely correct. When we say Data Mining, we usually mean a plethora of data curation and management processes performed to summarize the data into something meaningful.</p>
<p>In simple words – you have assorted information about your customers. They perform various actions and leave a digital trail via your technical support system or in any other way. Many believe that this information stands for the quintessence of BD.</p>
<p>Yet, sadly, all that data is just a dumpster of random facts. Unless it’s properly categorized.</p>
<p><b>Alas, there’s simply too many data to handle without efficient computing force.</b></p>
<p>Modern manufacturers have to consider various machinery readings to avoid unnecessary maintenance expenses:</p>
<ul>
<li>Retailers need to track purchase histories to develop new marketing methods;</li>
<li>Websites keep an eye on various sources of useful information to improve their monetization mechanisms.</li>
</ul>
<p>Don’t waste information, systemize it and see which products are more popular and what technical issues they cause.</p>
<p>The total amount of data has grown to an unimaginable scale and it doesn’t plan to stop growing. The “dumpsters” appear more often and become more complicated. Fortunately, our computing capabilities are keeping up with that growth. There are various solutions such as DMP (Data Management Platforms). The best Big Data management platforms operate with an immense amount of information coming from a wide variety of sources.</p>
<p><b>Big Data analysis: proven benefits of the smarter approach</b></p>
<p>Data mining has a set of certain goals and one of these goals is to identify patterns in data and make these patterns visible. This method of research is helpful for clustering and systemizing purchasing behaviors. I cannot name a single industry where such information is essentially useless. Au contraire, I believe that any company will benefit from data mining.</p>
<p>There are numerous examples illustrating how big data and proper data management immensely improved business efficiency.</p>
<p>Every single successful business nowadays uses big data analysis and <a href="http://www.gartner.com/newsroom/id/3130817">some surveys claim that more than 70% of companies are ready to invest</a> more in their data management projects!</p>
<p>In fact, there are many inspiring stories regarding Big Data and how it affects us. For many people, those are but shady numbers or random big figures they can awe at. However, smart businesses are looking at the big picture from a different angle.</p>
<p>There are millions of stunning examples. But don’t take my word for it. Here’s a shining gem that could have never been revealed without appropriate analysis.</p>
<p>Not long ago, Midwest Grocery network used Oracle software computing capabilities to get a more comprehensive picture of local buying behaviors. They identified a couple of mind-boggling details about local purchasing habits.</p>
<p>A curious discovery improved beer sales. Yeah, people love beer. What’s the big news? <b>The new beer sale pattern was invented from careful analysis of… diaper sales</b>.</p>
<p>Oracle highlighted an interesting connection between beer and diapers. Men who shopped for diapers on Thursdays and Saturdays bought some ice-cold beverages as well.</p>
<p>Deeper analysis showed that these men usually made more purchases on Saturdays and less on Thursdays. Midwest Grocery chain used that information. They moved their beer displays towards diapers and never put discounts on these items on Thursdays and Saturdays.</p>
<p><b>Should YOU Use Data Mining?</b><b></b></p>
<p><a href="https://qarea.us/expertise/big-data">Bid Data development services</a> include a wide array of offerings. Most of them relate to analysis and curation of various information types. Or, at least the ones that actually affect your business do.</p>
<p>Companies that do not use data for improvement lag behind their competitors. Often, their business model stagnates and quickly becomes less profitable than it could be.</p>
<p>Now it’s time to talk about Data Mining reality instead of wannabee, “good enough” fiction.  Specialists have named 5 of its core elements:</p>
<ul>
<li>Storage. You need to gather, modify, and transmit data from various sources to one certain location in order to work with it.</li>
<li>Sorting. The information in our storages must be properly sorted and systemized within multilayered databases.</li>
<li>Assesment. Raw data should be accessible by both technologists and business analysts.</li>
<li>Computing Tools. The information is still assorted and you will use it inefficiently without comprehensive methods of analysis. This is where smart businesses use the best big data management platforms and contextualize their data.</li>
<li>Representation. We want our information to be clean and presented in a usable format like tables and graphs.</li>
</ul>
<p><i>Pro tip:</i><i> </i><i><a href="https://qarea.us/articles/15-things-about-big-data">Great specialists</a></i><i> </i><i>think that business understanding is also a necessary part of data mining.</i></p>
<p>As you see, data goes a long way before it can be transformed into something useful.</p>
<p>Just think about you own business. If you are not using the latest technological solutions, you quickly become irrelevant.</p>
<p>Data mining is not only a time-tested method in business, it is also improving constantly. It helps you in understanding how seemingly irrelevant information can directly affect your income. With proper data management, you will be able to see the context of your customers’ actions!</p>
<p>This will help you to develop a better marketing strategy!</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/data-mining-commodity-or-necessity-of-the-21st-century/">Data Mining: commodity or necessity of the 21st century?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/data-mining-commodity-or-necessity-of-the-21st-century/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data Mining tips for financial analysis of the existing business</title>
		<link>http://bigdata-madesimple.com/data-mining-tips-for-financial-analysis-of-the-existing-business/</link>
		<comments>http://bigdata-madesimple.com/data-mining-tips-for-financial-analysis-of-the-existing-business/#comments</comments>
		<pubDate>Fri, 27 May 2016 11:27:25 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18597</guid>
		<description><![CDATA[<p>Data mining drills the static data deeper and examines the historic business activities. Ad hoc reporting spotlights analysis...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/data-mining-tips-for-financial-analysis-of-the-existing-business/">Data Mining tips for financial analysis of the existing business</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Data mining drills the static data deeper and examines the historic business activities. Ad hoc reporting spotlights analysis of both. Thereby, the pattern and trends are tracked. Mining software spotlights the algorithms thereafter. This way, unknown business strategies are identified. And hence, rosier picture of business intelligence is developed.</p>
<p>Consider these examples. The mined data assists in discovering <a href="http://10ecommercetrends.com/" target="_blank">ecommerce trends</a>, avoiding customers’ attrition and introducing loyal customers. The outcome of tracking patterns reveals complexities in manufacturing and profile the audience accurately.</p>
<p>Likewise, the intensive study of browsed data can help in understanding health of the business. Comprehending financial status of any commercial entity easily indicates whether it’ll be profitable or not. Take the roundup of any company’s financial activities through these data mining strategies:</p>
<ol>
<li><b>Inventory Check:</b> Inventory stands for stock. It mirrors the exact picture of the product based company. Thoroughly check the entire stock. It can be obsolete. The investor must incur expenses in storing it. Investing in such business means stranding in financial crisis. The stockpiled inventory does not show rosier picture always. Rather, it unfolds the grief of unsatisfied customers. Perhaps, their order is in lag. And if it is service based, examine invoices.They present the crystal clear picture of the company’s health.</li>
<li> <strong>Dive into the brief of receivables:</strong> Before inking the deal, one must opt for data mining services. Receivables are accountable for business growth. So, check the receipts of account receivable turnover, credit policies and history of loan and cash. <a href="http://www.eminenture.com/blog/how-ad-hoc-analysis-of-data-mining-helps-in-business-intelligence/" target="_blank">Ad hoc analysis of data mining</a> sheds light on this aspect. And hence, company’s former income can be anticipated.</li>
<li><strong>Net income:</strong> Examine the ratio of gross profit to net sales. It determines the company’s net income. But data study terminates when the ratio of net income to net worth is comprehended. During this examination, prospective interest appreciation, total purchase price and other similar factors are also attended. The entire study concluded productive or non-productive picture via ROI.</li>
<li><strong>Working capital:</strong> The deduction of current liabilities from current assets gives out working capital. Capital fuels a business to run. Extract the idea of how working capital is being utilized. If the current liabilities exceed current assets, the business can encounter bankruptcy. Therefore, this working capital gives ideas of how the company’s efficacies are performing and how much it gains in short interval. Operational efficiency can be achieved with the forecast that mined data provides. The way of achieving goals can be identified this way.</li>
<li><strong>Learn about Sales:</strong> Sales represent the profit-earning capacity. It generates revenues. And the frequent appreciation in it is converted into profit. Go for intricate study of the sales. Identify the reason of enhancing sales. It can be either explosive sale or higher price. And finally, don’t forget examining the market as well. Mature market results in static sale. So, always keep this fact fore in the memory. Keenly observe the price of substitute as well as complementary goods. Rise in their price can prove helpful in projecting million dollars marketing strategies.</li>
</ol>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/data-mining-tips-for-financial-analysis-of-the-existing-business/">Data Mining tips for financial analysis of the existing business</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/data-mining-tips-for-financial-analysis-of-the-existing-business/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to tell if correlation implies causation</title>
		<link>http://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/</link>
		<comments>http://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/#comments</comments>
		<pubDate>Tue, 10 May 2016 11:14:22 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18468</guid>
		<description><![CDATA[<p>You’ve probably heard the admonition: Correlation Does Not Imply Causation. Everyone agrees that correlation is not the same...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/">How to tell if correlation implies causation</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>You’ve probably heard the admonition:</p>
<p><a href="http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation">Correlation Does Not Imply Causation</a>.</p>
<p>Everyone agrees that correlation is not the same as causation. However, those two words — correlation and causation — have generated quite a bit of discussion.</p>
<p><strong>Why Causality Matters</strong></p>
<p>No one gets perturbed if you say two conditions or events are correlated but even suggest that causation is possible and you’ll get the clichéd admonition and perhaps with even harsher criticism. It’s not easy to prove causality, though, so there must be a reason for putting in the effort. For example, if you can figure out what causes a condition or event, you can:</p>
<ul>
<li>Promote the relationship to reap benefits, such as between agricultural methods and crop production or pharmaceuticals and recovery from illnesses.</li>
<li><em>Prevent</em> the cause to avoid harmful consequences, such as airline crashes and manufacturing defects.</li>
<li><em>Prepare</em> for unavoidable harmful consequences, such as natural disasters, like floods.</li>
<li><em>Prosecute</em> the perpetrator of the cause, as in law, or lay blame, as in politics.</li>
<li><em>Pontificate</em> about what might happen in the future if the same relationship occurs, such as in economics.</li>
<li><em>Probe</em> for knowledge based on nothing more than curiosity, such as how cats purr.</li>
</ul>
<p>So how can you tell if correlation does in fact imply causation?</p>
<p><img class="aligncenter size-full wp-image-18470" alt="correlation" src="http://bigdata-madesimple.com/wp-content/uploads/2016/05/correlation.png" width="459" height="185" /></p>
<p><strong>Criteria for Causality</strong></p>
<p>Sometimes it’s next to impossible to convince skeptics of a causal relationship. Sometimes it’s even tough to convince your supporters. Developing criteria for causality has been a topic of concern in medicine for centuries. Several sets of criteria have been proffered over those years, the most widely cited of which are the criteria described in 1965 by Austin Bradford Hill, a British medical statistician. <a href="http://www.drabruzzi.com/hills_criteria_of_causation.htm">Hill’s criteria for causation</a> specify the minimal conditions necessary to accept the likelihood of a causal relationship between two measures as:</p>
<ol>
<li><em><b>Strength</b></em>: A relationship is more likely to be causal if the correlation coefficient is large and statistically significant.</li>
<li><em><b>Consistency</b></em>: A relationship is more likely to be causal if it can be replicated.</li>
<li><em><b>Specificity</b></em>: A relationship is more likely to be causal if there is no other likely explanation.</li>
<li><em><b>Temporality</b></em>: A relationship is more likely to be causal if the effect always occurs after the cause.</li>
<li><em><b>Gradient</b></em>: A relationship is more likely to be causal if a greater exposure to the suspected cause leads to a greater effect.</li>
<li><em><b>Plausibility</b></em>: A relationship is more likely to be causal if there is a plausible mechanism between the cause and the effect</li>
<li><em><b>Coherence</b></em>: A relationship is more likely to be causal if it is compatible with related facts and theories.</li>
<li><em><b>Experiment</b></em>: A relationship is more likely to be causal if it can be verified experimentally.</li>
<li><em><b>Analogy</b></em>: A relationship is more likely to be causal if there are proven relationships between similar causes and effects.</li>
</ol>
<p>These criteria are sound principles for establishing whether some condition or event causes another condition or event. No individual criterion is foolproof, however. That’s why it’s important to meet as many of the criteria as is possible. Still, sometimes causality is unprovable.</p>
<h2>Three Steps to Decide if Correlation Implies Causation</h2>
<p>Hill’s criteria can be thought of as aspects of the process of <a href="https://statswithcats.wordpress.com/2012/07/14/the-best-super-power-of-all/">critical thinking</a> or considerations in the <a href="http://en.wikipedia.org/wiki/Scientific_method">scientific method</a> or a <a href="https://statswithcats.wordpress.com/2010/08/08/the-zen-of-modeling/">model</a> for deciding if a relationship involves causation. The criteria don’t all have to be met to suggest causality and some may not even be possible to meet in every case. The important point is to consider the criteria in a careful and unbiased process.</p>
<p><strong>Step 1 — Check the Metrics</strong></p>
<p>The admonition that <em>correlation does not imply causation</em> is used to remind everyone that a correlation coefficient may actually be characterizing a non-causal <a href="https://statswithcats.wordpress.com/2014/12/26/types-and-patterns-of-data-relationships/">influence or association</a> rather than a causal relationship. A large correlation coefficient does not necessarily indicate that a relationship is causal. On the other hand, saying that correlation is a <a href="http://en.wikipedia.org/wiki/Necessity_and_sufficiency">necessary but not sufficient</a> condition for causality, or in other words, causation cannot occur without correlation, is also not necessarily true. There are quite a few reasons for a <a href="https://statswithcats.wordpress.com/2014/11/02/why-you-dont-always-get-the-correlation-you-expect/">lack of correlation</a>.</p>
<p>So, before you get too excited about some causal relationship, make sure the correlation is statistically legitimate. You can’t assess the relationship’s <em>gradient</em> (i.e., the sign of the correlation coefficient) and <em>strength</em> (i.e., the value of the correlation coefficient) if the correlation is erroneous. Make sure to:</p>
<ul>
<li>Use metrics (variables) that are appropriate for quantifying the relationship. For example, don’t use an index that is a ratio of the other metric in the relationship.</li>
<li>Use an appropriate <a href="https://statswithcats.files.wordpress.com/2010/11/types-of-correlations.jpg">correlation coefficient</a> based on the scales of the relationship metrics</li>
<li>Confirm that the samples are representative of the population being analyzed and that the relationship is linear (or you are using non-linear methods for analysis).</li>
<li>Make sure that there are no outliers or excessive uncontrolled variance.</li>
</ul>
<p>The gradient of most causal relationships is positive. Inverse relationships will have a negative gradient. The strength of causal relationships could be almost anything; it depends on what you expect. If you don’t know what to expect, look at the square of the correlation coefficient, called the coefficient of determination, R-square, or R<sup>2</sup>. R-square is an estimate of the proportion of variance shared by two variables. It is used commonly to interpret the <em>strength</em> of the relationship between variables. Be aware, though, that even causal relationships may show <a href="https://statswithcats.wordpress.com/2014/11/02/why-you-dont-always-get-the-correlation-you-expect/">smaller than expected correlations</a>.</p>
<p><strong>Step 2 — Explain the Relationship</strong></p>
<p>If you are comfortable with the <em>gradient</em> and <em>strength</em> of the correlation coefficient, the next step is to define the pattern of the relationship. The correlation may not be of any help in exploring the pattern of the relationship because data plots for different patterns can look similar. Nonetheless, there’s no sense expending more effort if the correlation is in any manner suspect.</p>
<p>First, check for <em>temporality</em> in the data. If the cause doesn’t always precede the effect then either the relationship is a <a href="https://statswithcats.wordpress.com/2014/11/02/why-you-dont-always-get-the-correlation-you-expect/">feedback relationship</a> or is not causal. If cause and effect are not measured simultaneously, <em>temporality </em>may be obscured.</p>
<p>Next, try to determine what pattern of relationship is likely. This is not easy but it’s also not a permanent determination. If you are uncertain, start with either a direct or an inverse relationship, which can be determined from data plots. Then as you study the relationship further, you can assess whether the relationship may be based on feedback, common-source, mediation, stimulation, suppression, threshold, or multiple complexities.</p>
<p>Consider your relationship in terms of Hill’s criteria of <em>Plausibility</em>, <em>Coherence</em>,<em>Analogy</em>, and <em>Specificity</em>. <em>Plausibility</em><i> </i>and<em>Coherence</em> are perhaps the easiest of the criteria to meet because it is all too easy to rationalize explanations for observed phenomenon. They may also rely on<em>related facts and theories</em> that can change over time. <em>Analogy</em> is a bit more difficult to meet but not impossible for a fertile mind. However, analogous relationships may appear to be similar but in fact be attributable to very different underlying mechanisms. Narrow minded people rely on <em>Specificity</em> in their arguments. Then again, relationships may have no other likely explanation because a phenomenon is not well understood.</p>
<p><strong>Step 3 — Validate the Explanation</strong></p>
<p>Perhaps the most important of Hill’s criteria are <em>Experiment</em> and <em>Consistency</em>. If you’re serious about proving there is a causal relationship between two conditions or events, you have to verify the relationship using an effective research design. Such an experiment usually requires a model of the relationship, a testable hypothesis based on the model, incorporation of variance control measures, collection of suitable metrics for the relationship, and an appropriate analysis. An appropriate analysis may be statistical (using multiple samples from a well-defined population and analyses like ANOVA to assess effects) or deterministic (using a representative example of a component of the relationship to demonstrate the effect). If the experiment verifies the relationship, especially if it can be consistently replicated by independent parties, there will be solid proof of causality and any spurious relationships will be disproved. The two problems are that this validation can involve considerable effort and that not every relationship can be verified experimentally.</p>
<p><img class="aligncenter size-full wp-image-18473" alt="lewontin-quote" src="http://bigdata-madesimple.com/wp-content/uploads/2016/05/lewontin-quote.png" width="300" height="207" /></p>
<p>There are two types of research studies — <a href="http://en.wikipedia.org/wiki/Experiment">experimental</a> and <a href="http://en.wikipedia.org/wiki/Observational_study">observational</a>. In an experimental study, researchers decide what conditions the subjects (the entities being</p>
<p>experimented on) will be exposed to and then measure variables of interest. In an observational study, researchers observe subjects that possess the conditions being assessed and then measure variables of interest. Both types of experimental designs have their challenges. Researchers may not be able to manipulate the conditions under study in an experiment because of cost, logistical, or ethical issues. Observational studies may be subject to <a href="http://en.wikipedia.org/wiki/Confounding">confounding</a>, conditions that interfere with the interpretation of results. Consequently, verifying that a relationship is causal is often easier said than done.</p>
<p><b> </b><strong>Implying Causality</strong></p>
<p>Hills criteria were developed for medicine. Medical research may start with anecdotal observations and progress to statistical observations of occurrence. Add demographics and patterns of occurrence may become apparent. Then the patterns are assessed to look for coherent, plausible explanations and analogues. Some medical hypotheses can be tested and analyzed statistically. Pharmaceutical effectiveness is an example. Psychological and agricultural relationships can often be tested. Other relationships can’t be manipulated so must be analyzed based on observations. Epidemiological studies are examples. Without being able to rely on the <em>Experiment</em> and <em>Consistency</em> criteria, causality can only be argued using the weaker <em>Plausibility</em>, <em>Coherence</em>, <em>Analogy</em>, and <em>Specificity</em> criteria. This is also true with natural phenomena, like landslides and earthquakes. Some conditions are unique or the underlying knowledgebase is insufficient to explain the phenomenon convincingly, so even the <em>Plausibility</em>, <em>Coherence</em>, <em>Analogy</em>, and <em>Specificity</em> criteria aren’t useful. Economic and political relationships often fall into this category.</p>
<p>So, if you hear someone claim that a relationship is causal, consider how Hill’s criteria might apply before you believe the assertion.</p>
<p><img class="aligncenter size-full wp-image-18474" alt="Correlation and causation" src="http://bigdata-madesimple.com/wp-content/uploads/2016/05/Correlation-and-causation.gif" width="640" height="199" /></p>
<p>Originally appeared on <a href="https://statswithcats.wordpress.com/2015/01/01/how-to-tell-if-correlation-implies-causation/" target="_blank">Stats with cats</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/">How to tell if correlation implies causation</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Defining your data quality problems</title>
		<link>http://bigdata-madesimple.com/defining-data-quality-problems/</link>
		<comments>http://bigdata-madesimple.com/defining-data-quality-problems/#comments</comments>
		<pubDate>Wed, 04 May 2016 13:41:20 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18428</guid>
		<description><![CDATA[<p>To tackle any problem in a systematic and effective way, you must be able to break it down...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/defining-data-quality-problems/">Defining your data quality problems</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>To tackle any problem in a systematic and effective way, you must be able to break it down into parts. After all, understanding the problem is the first step to finding the solution.  From there, you can develop a strategic battle plan. With data quality, the same applies: every initiative features many stages and many different angles of attack.</p>
<p>When starting a data quality improvement program, it’s not enough to count the amount of records that are incorrect, or duplicated, in your database. Quantity only goes so far. You also need to know what <em>kind</em> of errors exist to allocate the correct resource.</p>
<p>In this interesting blog by Jim Barker, the different types of data quality are broken down into two parts. In this article, we’ll look closely at defining these ‘types’, and how we can use this to our advantage when developing a budget.</p>
<p><b>Types of Data</b></p>
<p>Jim Barker – known as ‘Dr Data’ to some – has borrowed a simple medical concept to define data quality problems. <a href="http://drdata16.com/2015/07/13/type-i-and-type-ii-data-quality/">His blog explains</a> just how these two types fit together, and will be of interest to anyone who has struggled to find the data quality gremlins in their machine.</p>
<p>On the one hand, there’s the Type I data quality problem: things we can detect using automated tools. On the other hand, Type II is more enigmatic. You know the data quality problem is there, but it’s more difficult to detect and deal with, because it needs to be contextualised to be detected.</p>
<p>The key differences can be simply and quickly defined:</p>
<ul>
<li>Type I data quality problems require “know what” to identify: completeness, consistency, uniqueness and validity. These <a href="https://www.dqglobal.com/products/">attributes can be picked up using data quality software</a>, or even manually. You don’t need to have a lot of background knowledge, or a track record working with that data. It’s there, it’s wrong and you can track it down. For example, if we insert a 3 into a gender field, we can be sure that it is not a valid entry.</li>
</ul>
<ul>
<li>Type II data quality problems require “know how” for detection of timeliness, congruence and accuracy attributes. They require research, insight and experience and are not as simple or straightforward to detect. These datasets may appear free of problems, at least on the surface. The devil is in the detail, and it takes time to correct. Jim’s example is an employee record for someone who has retired. Without knowing the date of retirement, their data would otherwise appear to be correct.</li>
</ul>
<p>The key takeaway is that data quality problems require a complex, strategic approach that is not uniform across a database. Once we break the data down, we start to see that it requires human <em>and</em> automated intervention – a dual attack.</p>
<p><b>Cost to Fix</b></p>
<p>So, how do we deal with Type I and Type II data quality problems? Are the costs comparable, or are they different beasts entirely?</p>
<p>The important thing to remember is that a Type I data validation or verification problem can be logically defined, and that means we can write software to find it and display it. Automated fixes are fast, inexpensive and can be completed with only occasional manual review. Think of Type I data quality problems as form field validation. Once valid, the problem disappears.</p>
<p>We could estimate that Type I data presents 80 per cent of our data quality problems, yet consumes 20 per cent of our budget.</p>
<p>Type II data needs the input of multiple parties so that it can be discovered, flagged up and eradicated. While every person in our CRM may have a date of purchase, that purchase date may be incorrect or not tally with an invoice or shipping manifest. Only specialists will be able to seed out problems and manually improve the CRM by carefully verifying its contents.</p>
<p>Often, businesses find it difficult to allocate the necessary resource – particularly if they have grown rapidly, or have high employee churn. While these Type II problems are fewer – perhaps the remaining 20 per cent of the database – they could require 80 per cent of our data quality budget, or more. If you continually lose staff who have that knowledge, and you fail to retain any of it over time, you will find Type II data much more difficult to deal with because the human detection element is lost.</p>
<p><b>Improving Accuracy</b></p>
<p>In order to improve data accuracy, we must work on Type I and Type II data as separate, but conjoined, problems. Fixing Type I data quality challenges can present quick wins, but Type II presents a challenge that human expertise can solve.</p>
<p>Over time, a <a href="https://www.dqglobal.com/the-longer-you-delay-the-more-the-data-decay/">database will always drift out of date</a>, and this requires on-going and sustained effort. Data can be cleansed in situ, or validated at the point of entry, but Type I errors will still occur for a number of reasons; import/ export, corruption, manual edits, human error. Type II data problems will occur naturally, of their own accord; data that validates and looks correct may now be incorrect, simply because someone’s circumstances have changed.</p>
<p><b>Ensuring Data Integrity</b></p>
<p>Data informs business decisions and helps us get a clear picture of the world. Detecting Type I data quality problems is simple, inexpensive and quick. If your business has not yet adopted some kind of data quality software, there’s no doubt that it should be implemented to avoid waste, brand damage and inaccuracy.</p>
<p>As for Type II, the key is to understand that it exists and to implement new processes to prevent it from occurring. Workarounds and employee diversions from business processes will drag the data down. A failure to allocate subject matter experts could increase the amount of Type II over time. And as the proportion increase, so does the price of fixing it, because you need expert eyes on the data to weed it out. See the <a href="https://www.dqglobal.com/why-data-should-be-a-business-asset-the-1-10-100-rule/">1:10:100 Rule</a> article.</p>
<p>Detecting and eradicating both types of problem is not impossible. One is easier than the other. Data quality vendors are continually looking at new ways to make high quality data simpler to achieve.</p>
<p>Originally appeared on <a href="https://www.dqglobal.com/2015/08/05/defining-your-data-quality-problems/" target="_blank">DqGlobal</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/defining-data-quality-problems/">Defining your data quality problems</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/defining-data-quality-problems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to write data analysis reports. Lesson 3—know your route.</title>
		<link>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-3-know-your-route/</link>
		<comments>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-3-know-your-route/#comments</comments>
		<pubDate>Mon, 21 Mar 2016 05:19:16 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=17773</guid>
		<description><![CDATA[<p>You’ve been taught since high school to start with an outline. Nothing has changed with that. However, there are...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/write-data-analysis-reports-lesson-3-know-your-route/">How to write data analysis reports. Lesson 3—know your route.</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p style="text-align: center;"><img class="aligncenter size-full wp-image-17778" alt="route-cat-in-a-maze" src="http://bigdata-madesimple.com/wp-content/uploads/2016/03/route-cat-in-a-maze.jpg" width="242" height="299" /></p>
<p>You’ve been taught since high school to start with an outline. Nothing has changed with that. However, there are many possible outlines you can follow depending on your audience and what they expect. The first thing you have to decide is what the packaged report will look like.</p>
<p>Will your report be an executive brief (not to be confused with a legal brief), a letter report, a summary report, a comprehensive report, an Internet article or blog, a professional journal article, or a white paper to name a few. Each has its own types of audience, content, and whiting style. Here’s a summary of the differences.</p>
<p><img class="aligncenter size-full wp-image-17776" alt="report-table" src="http://bigdata-madesimple.com/wp-content/uploads/2016/03/report-table.png" width="640" height="362" /></p>
<p>Writing a report is like taking a trip. The <i>message</i> is the asset you want to deliver to the ultimate destination, the <i>audience</i>. The <i>package</i> is the vehicle that holds the <i>message</i>. Now you need a map for how to reach your destination. That’s the <i>outline</i>.</p>
<p>Just as there are several possible routes you could take with a map, there are several possible outline strategies you could use to write your report. Here are six.</p>
<ul>
<li><b><i>The Whatever-Feels-Right Approach</i></b>. This is what inexperienced report writers do when they have no guidelines. They do what they might have done in college or just make it up as they go along. This might work out just fine or be as confusing as <a href="http://en.wikipedia.org/wiki/Maury_(TV_series)#Paternity_tests">The Maury Show</a> on Father’s Day. Considering that the report involves statistics, you can guess which it would be.</li>
<li><b><i>The Historical Approach</i></b>. This is another approach that inexperienced report writers use. They do what was done the last time a similar report was produced. This also might work out fine. Then again, the last report may have been a failure, ineffective in communicating its message.</li>
<li><b><i>The “Standard” Approach</i></b>. Sometimes companies or organizations have standard guidelines for all their reports, even requiring the completion of a formal review process before the report is released. Many academic and professional journals use such a prescriptive approach. The results may or may not be good, but at least they look like all the other reports.</li>
<li><b><i>The Military Approach</i></b>. You tell ‘em what you’re going to tell ‘em, you tell ‘em, and then you tell ‘em what you told ‘em. The military approach may be redundant and boring, but some professions live by it. It works well if you have a critical message that can get lost in details.</li>
<li><b><i>The Follow-the-Data Approach</i></b>. If you have a very structured data analysis it can be advantageous to report on each piece of data in sequence. Surveys often fall into this category. This approach makes it easy to write the report because sections can be segregated and doled out to other people to write, before being reassembled in the original order. The disadvantage is that there usually is no overall synthesis of the results. Readers are left on their own to figure out what it all means.</li>
</ul>
<p><img class="aligncenter size-full wp-image-17777" alt="cat-on-a-map" src="http://bigdata-madesimple.com/wp-content/uploads/2016/03/cat-on-a-map.jpg" width="300" height="179" /></p>
<ul>
<li><b><i>The Tell-a-Story Approach</i></b>. This approach assumes that reading a statistical report shouldn’t be as monotonous as mowing the lawn. Instead, you should pique the reader’s curiosity by exposing the findings like a murder mystery, piece by piece, so that everything fits together when you announce the conclusion. This is almost the opposite of the follow-the-data approach. In the tell-a-story approach, the report starts with the simplest data analyses and builds, section by section, to the great climax—the message of the analysis. Analyses that are not relevant to the message are omitted. There are usually arcs, in which a previously introduced analytical result is reiterated in subsequent sections to show how it supports the story line. <a href="https://statswithcats.wordpress.com/2012/08/18/the-foundation-of-professional-graphs/">Graphics </a>are critical in this approach; outlines are more like storyboards. There may be the equivalent of one page of graphics for every page of text. Telling a story usually takes longer to write than the other approaches but the results are more memorable if your audience has the patience to read everything (i.e., don’t try to tell a story to a Bypasser.)</li>
</ul>
<p>So, be sure that you have an appropriate outline but don’t let it constrain you. Having a map doesn’t mean you can’t change your route along the way, you just need to get to the destination. In building the outline, try to balance sections so the reader has periodic resting points. Within each section, though, make the lengths of subsections correspond to their importance.</p>
<p>Originally appeared on <a href="https://statswithcats.wordpress.com/2013/09/21/how-to-write-data-analysis-reports-lesson-3/" target="_blank">Stats with Cats</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/write-data-analysis-reports-lesson-3-know-your-route/">How to write data analysis reports. Lesson 3—know your route.</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-3-know-your-route/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to write data analysis reports. Lesson -2 know your audience.</title>
		<link>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-2-know-audience/</link>
		<comments>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-2-know-audience/#comments</comments>
		<pubDate>Fri, 11 Mar 2016 05:10:59 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=17692</guid>
		<description><![CDATA[<p>Every self-help article about technical writing starts by telling readers to consider their audience. Even so, probably few...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/write-data-analysis-reports-lesson-2-know-audience/">How to write data analysis reports. Lesson -2 know your audience.</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Every self-help article about technical writing starts by telling readers to consider their audience. Even so, probably few report writers do.</p>
<p>In a statistical analysis, you usually start by considering the characteristics of the population about which you want to make inferences. Similarly, when you begin to write a report on an analysis, you usually start by considering the characteristics of the audience with which you want to communicate. You have to think about the <i>who</i>, <i>what</i>, <i>why</i>, <i>where</i>, <i>when</i>, and <i>how</i> of the key people who will be reading your report. Here are some things to consider about your audience.</p>
<p><strong>Who<br />
</strong>Audience is often defined by the role a reader plays relative to the report. Some readers will use the report to make decisions. Some will learn new information from the report. Others will critique the report in terms of what they already know. Thus, the audience for a statistical report is often defined as decision makers, stakeholders, reviewers, or generally interested individuals.</p>
<p>Some reports are read by only a single individual but most are read by many. All kinds of people may read your report. As a consequent, there can be primary, secondary, and even more levels of audience participation. This is problematical; <a href="https://statswithcats.wordpress.com/2010/10/24/tales-of-the-unprojected/">you can’t please everyone</a>. So in defining your audience, focus first on the most important people to receive your message and second on the largest group of people in the audience.</p>
<p><strong>What<br />
</strong>Once you define who you are targeting with your report, you should try to understand their characteristics. Perhaps the most important audience characteristic for a technical report writer is the audience’s understanding of both the subject matter of the report and the statistical techniques being described. You may not be able to do much about their subject matter knowledge but you can adjust how you present statistical information. For example, audiences a data analyst might encounter include:</p>
<ul>
<li><b><i>Mathphobes</i></b><i style="line-height: 1.71429; font-size: 1rem;">.</i> Fear numbers but may listen to concepts. Don’t use any statistical <a href="https://statswithcats.wordpress.com/2010/07/03/it%E2%80%99s-all-greek/">jargon</a>. Don’t show formulas. Use numbers sparingly. For example, substitute “about half” for any percentage around 50%. The extra precision won’t be important to a Mathphobe.</li>
<li><b><i>Bypassers</i></b>. Understand some but have little interest. Don’t worry about Bypassers, they won’t read past the summary. Be sure to make the summary pithy and highlight the most important finding otherwise they might key on something relatively inconsequential</li>
<li><b><i>Tourists</i></b><i>.</i> Understand some and are interested. Be gentle. Use only essential jargon that you define clearly. Using numbers is fine just don’t use too many in a single table. Round off values so you’re not implying false precision. Stick with nothing more sophisticated than pie charts, bar graphs, and maybe an occasional scatter chart. Don’t use any formulas.</li>
<li><b><i>Hot Dogs</i></b>. Know less than they think and want to show it. Using jargon is fine so long as you define what you mean. Even a Hot Dog may learn something. In the same vein, using numbers, statistical graphics, and formulas is fine so long as you clearly explain their meanings. Hot Dogs may come to erroneous conclusions if not guided.</li>
<li><b><i>Associates</i></b><i>.</i> Other analysts who understand the basic jargon. Anything is fine so long as you clearly explain what you mean.</li>
<li><b><i>Peers</i></b>. Other data analysts who understand all the jargon. Anything goes.<br />
The audience characteristics provide guidance for report length and writing tone and style</li>
</ul>
<p><b>Why<br />
</b>Are readers likely to be very interested in your report or just curious about it (if they have no interest, they won’t be readers)? Be honest with yourself. Why would anyone be interested in reading your report? What is the <a href="https://statswithcats.wordpress.com/2010/10/10/perspectives-on-objectives/">objective </a>of the <i>who</i> you defined as your audience? What will they do with your findings? Will they get informed? Will they make a decision or take an action? Is this a big thing for them or just something they have to tune in to?</p>
<p><strong>Where<br />
</strong>Is the report aimed at a finite, confined group, like the organization the analysis was conducted for, or will anyone be able to read it? Is the report aimed at the upper levels of the organization or the rank-and-file (i.e., bottom up or top down)? Are there any concerns for security or confidentiality, either on the individual or organizational levels?</p>
<p><strong>When<br />
</strong>When does the population need to see your report? Who has to review the report and how long might they take before the report is released? How firm are the deadlines? How much time does this leave you to write the report? Will there be enough time to think through what you need to write? Will there be time to conduct additional analyses needed to fill in gaps in the report outline? Will you be outraged when the time taken to review your report is twice as long as the time you took to write it?</p>
<p>Here’s some advice you should take to heart. Never, never, never submit a draft report for review that isn’t your fully complete, edited, masterpiece. I tell myself to follow this rule with every report I write. Unfortunately, like most people, I don’t listen to what I say.</p>
<p><strong>How<br />
</strong>Finally, consider how the report should be presented so that the audience will get the most out of it. Here are five considerations:</p>
<p><b><i>Package</i></b><i>.</i> How will your writing be <i>packaged</i> (i.e., assembled into product for distribution)? Will it be a short letter report, a  comprehensive report, a blog or an Internet article, a professional journal article, a white paper, or will your writing be included as part of another document?</p>
<p><b><i>Format</i></b><i>.</i> Will your report be distributed as an electronic file of as a paper document? If it will be an electronic document, will it be available on the Internet? Will it be editable? Will it be restricted somehow, such as with a password?</p>
<p><b><i>Appearance</i></b><i>.</i> Will the report be limited to black-and-white or will color be included? What will be the ratio of graphics to text? Will the report be conventional or glitzy, like a marketing brochure? Will there be 11”x17” foldout pages or oversized inserts like maps.</p>
<p><b><i>Specialty items</i></b><i>.</i> Will you need to provide some items apart from the report, such as electronic data files, analysis scripts or program codes, and outputs? Will you have to create a presentation from the contents of the report? Will your graphics be used for courtroom or public presentations?</p>
<p><b><i>Accessibility</i></b>. Do you need to follow the guidelines of Section 508 of the Rehabilitation Act of 1973, which may affect your use of headings, tables, graphic objects, and special characters? Should you account for common forms of color blindness in your color graphics?</p>
<p><b>Take a Few Moments<br />
</b>You won’t have to address all of these details in evaluating your audience and many will only require a few moments of thought. But, if you think through these considerations, you’ll have a much better idea of who you are writing the report for and how you should write it.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/write-data-analysis-reports-lesson-2-know-audience/">How to write data analysis reports. Lesson -2 know your audience.</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/write-data-analysis-reports-lesson-2-know-audience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to scrape data from web using python</title>
		<link>http://bigdata-madesimple.com/scrape-data-web-using-python/</link>
		<comments>http://bigdata-madesimple.com/scrape-data-web-using-python/#comments</comments>
		<pubDate>Fri, 19 Feb 2016 07:18:03 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=17380</guid>
		<description><![CDATA[<p>Can you guess a simple way you can get data from a web page? It’s through a technique...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/scrape-data-web-using-python/">How to scrape data from web using python</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.</p>
<p>In case you are not familiar with web scraping, here is an explanation:</p>
<p>“Web scraping is a computer software technique of extracting information from websites”</p>
<p>“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”</p>
<p>Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like <a href="http://developer.rottentomatoes.com/member/register" target="_blank">Rotten tomatoes</a> and <a href="https://twittercommunity.com/t/how-to-get-my-api-key/7033" target="_blank">Twitter</a> provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.</p>
<p>I will be using two Python modules for scraping data.</p>
<ul>
<li>Urllib</li>
<li>Beautifulsoup</li>
</ul>
<p>So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:</p>
<p>Understanding HTML Basics</p>
<p>Scarping is all about html tags. So you need to understand html  inorder to scrape data.</p>
<p>This is an example for a minimal webpage defined in HTML tags. The root tag is <i>&lt;html&gt;</i> and then you have the <i>&lt;head&gt;</i> tag. The page includes the title of the page and might also have other meta information like the keywords. The <i>&lt;body&gt;</i> tag includes the actual content of the page. &lt;h1&gt;, &lt;h2&gt; , &lt;h3&gt;, &lt;h4&gt;, &lt;h5&gt; and &lt;h6&gt; are different header levels.</p>
<p><img class="alignnone size-full wp-image-17381" alt="data-science" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/data-science.png" width="293" height="183" /></p>
<p>These are some useful html tags you need to know.</p>
<p><img class="alignnone size-full wp-image-17382" alt="Useful tags" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/Useful-tags.png" width="562" height="172" /></p>
<p>I encourage you to inspect a web page and view its <a href="http://www.wikihow.com/View-Source-Code" target="_blank">source code</a> to understand more about html.</p>
<p><strong>Scraping A Web Page Using Beautiful Soup</strong></p>
<p>I will be scraping data from bigdataexaminer.com. I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.</p>
<p><em> import urllib2</em></p>
<p><em>  import bs4</em></p>
<p><em>  import pandas as pd</em></p>
<p><em>  import numpy as np</em></p>
<p><img class="alignnone size-full wp-image-17383" alt="url lib" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/url-lib.png" width="339" height="68" /></p>
<p><img class="alignnone size-full wp-image-17384" alt="url lib 2" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/url-lib-2.png" width="743" height="110" /></p>
<p>What <i>beautiful = urllib2.urlopen(url).read() </i>does is, it goes to <i>bigdataexaminer.com</i> and gets the whole html text. I then store it in a variable called <i>beautiful</i>.</p>
<p>Now I have to parse and clean the HTML code. <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a> is a really useful Python module for parsing HTML and XML files.  Beautiful Soup gives a<i>BeautifulSoup</i> object, which represents the document as a nested data structure.</p>
<p><strong>Prettify</strong></p>
<p>You can use <i>prettify()</i>  function to show different levels of the HTML code.</p>
<p><img class="alignnone size-full wp-image-17385" alt="beautiful soup" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/beautiful-soup.png" width="294" height="37" /></p>
<p><img class="alignnone size-full wp-image-17386" alt="html language" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/html-language.png" width="349" height="187" /></p>
<p class="MsoNormal"><span style="line-height: 1.71429; font-size: 1rem;">The simplest way to navigate the </span><span style="color: blue;"><a href="http://en.wikipedia.org/wiki/Parse_tree" target="_blank">parse tree</a></span><span style="line-height: 1.71429; font-size: 1rem;"> is to say the name of the tag you want. If you want the </span><i style="line-height: 1.71429; font-size: 1rem;">&lt;h1&gt;</i><span style="line-height: 1.71429; font-size: 1rem;"> tag, just say </span><i style="line-height: 1.71429; font-size: 1rem;">soup.h1.prettify()</i><span style="line-height: 1.71429; font-size: 1rem;">:</span></p>
<p class="MsoNormal"><img class="alignnone size-full wp-image-17388" alt="soup" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/soup.png" width="447" height="154" /></p>
<p><strong>Contents</strong></p>
<p><i>soup.tag.contents</i> will return contents of a tag as a list.</p>
<p>In[18] : soup.head.contents</p>
<p><img class="alignnone size-full wp-image-17389" alt="meta char set" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/meta-char-set.png" width="602" height="138" /></p>
<p>The following function will return the <i>title</i> present inside <i>head</i> tag.</p>
<p>In[45] : <i>x = soup.head.title</i></p>
<p>Out [45]: &lt;title&gt;&lt;/title&gt;</p>
<p>.string will return the string present inside the <i>title tag</i> of big data examiner. As big dataexaminer.com doesn’t have a title, the value returned is None.</p>
<p><img class="alignnone size-full wp-image-17390" alt="string" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/string.png" width="298" height="78" /></p>
<p><strong>Descendants</strong></p>
<p>Descendants lets you iterate over all of a tags children, recursively.</p>
<p><img class="alignnone size-full wp-image-17391" alt="descendants" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/descendants.png" width="283" height="53" /></p>
<p><img class="alignnone size-full wp-image-17392" alt="meta" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/meta.png" width="582" height="167" /></p>
<p>You can also look at the strings using .strings generator</p>
<p><img class="alignnone size-full wp-image-17393" alt="soup strings" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/soup-strings.png" width="225" height="42" /></p>
<p><img class="alignnone size-full wp-image-17394" alt="text string" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/text-string.png" width="967" height="158" /></p>
<p>In[56]: soup.get_text()</p>
<p>extracts all the text from Big data examiner.com</p>
<p><strong>FindALL</strong></p>
<p>You can use Find_all() to find all the <i>‘a’</i> tags on the page.</p>
<p><img class="alignnone size-full wp-image-17395" alt="find all" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/find-all.png" width="606" height="112" /></p>
<p>To get the first four ‘a’ tags you can use limit attribute.</p>
<p><img class="alignnone size-full wp-image-17396" alt="soup-findall" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/soup-findall.png" width="532" height="121" /></p>
<p>To find a particular text on a web page, you can use text attribute along with find All.  Here I am searching for the term ‘data’ on big data examiner.</p>
<p><img class="alignnone size-full wp-image-17397" alt="a tag" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/a-tag.png" width="321" height="65" /></p>
<p>Get me the attribute of  the second <i>‘a’ </i>tag on big data examiner.</p>
<p><img class="alignnone size-full wp-image-17398" alt="big data exam" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/big-data-exam.png" width="439" height="78" /></p>
<p>You can also use a list comprehension to get the attributes of the first 4 <i>a tags</i> on bigdata examiner.</p>
<p><img class="alignnone size-full wp-image-17399" alt="big data examiner" src="http://bigdata-madesimple.com/wp-content/uploads/2016/02/big-data-examiner.png" width="559" height="127" /></p>
<p><strong>Conclusion</strong></p>
<p>A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called <a href="http://www.kdnuggets.com/2011/02/pattern-python-web-mining-module.html" target="_blank">pattern</a> for web scraping. I also found a  good tutorial on web scraping using <a href="https://www.youtube.com/watch?v=3xQTJi2tqgk" target="_blank">Python.</a></p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/scrape-data-web-using-python/">How to scrape data from web using python</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/scrape-data-web-using-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How to implement these 5 powerful probability distributions in Python</title>
		<link>http://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/</link>
		<comments>http://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/#comments</comments>
		<pubDate>Mon, 14 Dec 2015 07:35:18 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=16576</guid>
		<description><![CDATA[<p>R is considered as the de facto programming language for statistical analysis right? But In this post, I...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/">How to implement these 5 powerful probability distributions in Python</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>R is considered as the de facto programming language for statistical analysis right? But In this post, I will show you how to easily implement statistical concepts using Python.</p>
<p>I will implement discrete and continuous probability distributions using Python. I won’t get into the mathematical details of these distributions, but I will mention some of the best resources to learn the math concepts involved in these methods.</p>
<p>Before we jump into these probability distributions, I want to give a glimpse of what a random variable is. A random variable quantifies the outcomes of a number.</p>
<p>For example, a random variable for a coin flip can be represented as</p>
<p>X = { 1 heads</p>
<p>2 if tails}</p>
<p>A random variable is a variable that takes on a set of possible values (<strong>discrete</strong> or <strong>continuous</strong>) and is subject to <em>randomness</em>. Each possible value the random variable can take on is associated with a probability. The possible values the random variable can take on and the associated probabilities is known as <a href="http://en.wikipedia.org/wiki/Probability_distribution" target="_blank">probability distribution</a>.</p>
<p>I encourage you to go through scipy.stats <a href="http://docs.scipy.org/doc/scipy/reference/stats.html" target="_blank">module</a>.</p>
<p>There are two types of probability distributions, discrete and continuous probability distributions.</p>
<p>Discrete probability distributions are also called as <a href="http://en.wikipedia.org/wiki/Probability_mass_function" target="_blank">probability mass functions</a>. Some examples of discrete probability distributions are  Bernoulli distribution,  Binomial distribution, Poisson distribution and Geometric distribution.</p>
<p>Continuous probability distributions also known as <a href="http://en.wikipedia.org/wiki/Probability_density_function" target="_blank">probability density functions</a>, they are functions that take on continuous values (e.g. values on the real line). Examples include the normal distribution, the exponential distribution and the beta distribution.</p>
<p>To understand more about discrete and continuous random variables, watch Khan academies probability distribution <a href="https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/discrete-and-continuous-random-variables" target="_blank">videos</a>.</p>
<p><strong>Binomial Distribution</strong></p>
<p>A random variable X that has a binomial distribution represents the number of successes in a sequence of n independent yes/no trials, each of which yields success with probability p.</p>
<p>E(X) = np, Var(X) = np(1−p)</p>
<p>If you want to know how each function works, you can use <em>help file</em> command in your I python notebook. E(X) is the expected value or mean of the distribution.</p>
<p>Type <em>stats.binom?</em> to know about binom function.</p>
<p><em>Example of binomial distribution: What is the probability of getting 2 heads out of 10 flips of a fair coin? </em></p>
<p>In this experiment the probability of getting a head is 0.3,  this means that on an average you can expect 3 coin flips to be heads. I define all the possible values the coin flip can take, k = <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html" target="_blank">np.arange</a>(0,11), you can observe zero head, one head all the way upto ten heads. I am using <a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom.html" target="_blank"><em>stats.binom.pmf</em></a>  to calculate the probability mass function for each observation. It returns a list of 11 elements, these elements represent the probability associated with each observation.</p>
<p><img class="alignnone size-full wp-image-16577" alt="binomial distribution in python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/binomial-distribution-in-python.png" width="480" height="228" /></p>
<p><img class="alignnone size-full wp-image-16578" alt="binomial distribution graphs in python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/binomial-distribution-graphs-in-python.png" width="631" height="499" /></p>
<p>You can simulate a binomial random variable using <em>.rvs</em>. The parameter size specifies how many simulations you want to do. I ask Python to return 10000 binomial random variables with parameters n and p. I am printing the mean and standard deviation of these 10000 random variables. Then I am going to plot the histogram of all the random variables that I simulated.</p>
<p><img class="alignnone size-full wp-image-16579" alt="binomial simulation" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/binomial-simulation.png" width="509" height="182" /></p>
<p><img class="alignnone size-full wp-image-16580" alt="binomial simulation 1" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/binomial-simulation-1.png" width="645" height="389" /></p>
<p><strong>Poisson Distribution</strong></p>
<p>A random variable X that has a <a href="https://www.statstodo.com/Poisson_Exp.php" target="_blank">Poisson distribution</a> represents the number of events occurring in a fixed time interval with a rate parameters λ. λ tells you the rate at which the number of events occur.  The average and variance is λ.</p>
<p><img class="alignnone size-full wp-image-16581" alt="poisson distribution" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/poisson-distribution.png" width="148" height="62" /></p>
<p>E(X) = λ, Var(X) = λ</p>
<p><img class="alignnone size-full wp-image-16583" alt="poisson distribution one" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/poisson-distribution-one.png" width="481" height="158" /></p>
<p><img class="alignnone size-full wp-image-16584" alt="poisson distribution two" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/poisson-distribution-two.png" width="393" height="105" /></p>
<p><img class="alignnone size-full wp-image-16585" alt="poisson distribution three" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/poisson-distribution-three.png" width="650" height="403" /></p>
<p>You can notice that the  number of accidents peaks around the mean. On an average you can expect lambda number of events. Try different values of lambda and n,  then see how shape of the distribution changes.</p>
<p>Now I am going to simulate 1000 random variables from a Poisson distribution.</p>
<p><img class="alignnone size-full wp-image-16586" alt="poisson random variables" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/poisson-random-variables.png" width="438" height="181" /></p>
<p><img class="alignnone size-full wp-image-16587" alt="simulating poisson random variables" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/simulating-poisson-random-variables.png" width="612" height="438" /></p>
<p><b>Normal</b><strong> </strong><b>Distribution</b><b></b></p>
<p>The <a href="http://www.mathsisfun.com/data/standard-normal-distribution.html" target="_blank">normal distribution</a> is a continuous distribution or a function that can take on values anywhere on the real line. The normal distribution is parameterized by two parameters: the mean of the distribution μ and the variance σ2.</p>
<p><img class="alignnone size-full wp-image-16588" alt="normal distribution" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/normal-distribution.png" width="504" height="68" /></p>
<p><img class="alignnone size-full wp-image-16589" alt="Normal distribution in python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/Normal-distribution.png" width="685" height="175" /></p>
<p><img class="alignnone size-full wp-image-16590" alt="Plotting normal distribution in python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/Plotting-normal-distribution-in-python.png" width="636" height="396" /></p>
<p>Normal distribution can take values from minus infinity to plus infinity. You can notice that I am using <em>stats.norm.pdf </em> as normal distribution is a probability density function.</p>
<p><strong>Beta Distribution</strong></p>
<p>The <a href="http://stats.stackexchange.com/questions/47771/what-is-the-intuition-behind-beta-distribution" target="_blank">beta distribution</a> is a continuous distribution which can take values between 0 and 1. This distribution is parameterized by two shape parameters α and β.</p>
<p><img class="alignnone size-full wp-image-16591" alt="beta distribution" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/beta-distribution.png" width="815" height="107" /></p>
<p>The shape of beta distribution depends on the values of alpha and beta values. Beta distribution is predominantly used in Bayesian analysis.</p>
<p><img class="alignnone size-full wp-image-16592" alt="beta distribution using python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/beta-distribution-using-python.png" width="325" height="162" /></p>
<p><img class="alignnone size-full wp-image-16593" alt="beta distribution plotting using python" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/beta-distribution-plotting-using-python.png" width="625" height="399" /></p>
<p><strong>Exponential Distribution</strong></p>
<p>The exponential distribution represents a process in which events occur continuously and independently at a constant average rate.</p>
<p><img class="alignnone size-full wp-image-16594" alt="exponential distribution" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/exponential-distribution.png" width="378" height="49" /></p>
<p><img class="alignnone size-full wp-image-16595" alt="exponential distribution one" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/exponential-distribution-one.png" width="174" height="26" /></p>
<p>I set the lambda parameter as  0.5 and x in the range of</p>
<p><img class="alignnone size-full wp-image-16596" alt="lambda parameter in exponential distribution" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/lambda-parameter.png" width="495" height="139" /></p>
<p><img class="alignnone size-full wp-image-16597" alt="exponential distribution two" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/exponential-distribution-two.png" width="641" height="393" /></p>
<p>Then I simulate 1000 random variables from an exponential distribution.<em> scale</em> is the inverse of lambda parameter. <em>ddof </em> in <em>np.std </em>is equal to dividing the standard deviation by n-1.</p>
<p><img class="alignnone size-full wp-image-16598" alt="exponential random variables" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/exponential-random-variables.png" width="431" height="167" /></p>
<p><img class="alignnone size-full wp-image-16600" alt="simulating exponential random variables" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/simulating-exponential-random-variables.png" width="621" height="420" /></p>
<p><strong>Conclusion</strong></p>
<p>Distributions are like blue print for building a  house, and random variable is summary of what happen in  an experiment.  I would recommend you to watch the lecture from <a href="http://cm.dce.harvard.edu/2014/01/14328/L05/index_H264SingleHighBandwidth-16x9.shtml" target="_blank">harvard data science course</a>, professor Joe Blitzstein gives a summary of everything you need to know about statistical models and distributions.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/">How to implement these 5 powerful probability distributions in Python</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
