<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Big Data Made Simple - One source. Many perspectives. &#187; Hadoop</title>
	<atom:link href="http://bigdata-madesimple.com/category/tech-and-tools/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://bigdata-madesimple.com</link>
	<description>One source. Many perspectives.</description>
	<lastBuildDate>Sat, 08 Jul 2017 05:11:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.2</generator>
		<item>
		<title>What are Hadoop alternatives and should you look for one?</title>
		<link>http://bigdata-madesimple.com/what-are-hadoop-alternatives-and-should-you-look-for-one/</link>
		<comments>http://bigdata-madesimple.com/what-are-hadoop-alternatives-and-should-you-look-for-one/#comments</comments>
		<pubDate>Thu, 01 Jun 2017 02:30:20 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=21452</guid>
		<description><![CDATA[<p>Hadoop’s development from a batch-oriented, large-scale analytics tool to an entire ecosystem comprised of various application, tools, services...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/what-are-hadoop-alternatives-and-should-you-look-for-one/">What are Hadoop alternatives and should you look for one?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Hadoop’s development from a batch-oriented, large-scale analytics tool to an entire ecosystem comprised of various application, tools, services and vendors goes hand in hand with the rise of big data marketplace. It is predominantly used for large scale data analysis by companies such as Facebook and eBay and although <a href="http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/">Hadoop is considered to be synonymous with big data analysis</a>, it’s far from being the only suitable option. Although Hadoop might be an adequate tool for large data sets, it is far inferior to SQL when it comes to expressing computations.</p>
<p>Additionally, Hadoop does not offer a viable indexing option and only offers full table scans. Not to mention it’s tendency for leaky abstractions including <a href="https://library.netapp.com/ecmdocs/ECMP1552961/html/GUID-6B3E55C3-BD76-4B3D-A9CC-FA0128B60D19.html">cluster contention</a>, file fragmentation, and java memory errors. Analyzing large data sets is a piece of cake for Hadoop, but when it comes to streaming calculations in real time, it falls a bit short. Fortunately, there are solutions available that deal with this particular issue, including Apache’s very own Spark and Storm, Google’s BigQuery and DataTorrent’s RTS tools, to name a few.</p>
<p><strong>1. Apache Spark</strong></p>
<p>Hailed as the de-facto successor to the already popular Hadoop, <a href="http://spark.apache.org/">Apache Spark</a> is used as a computational engine for Hadoop data. Unlike Hadoop, Spark provides an increase in computational speed and offers full support for the various applications that the tool offers.</p>
<p>Although Spark is normally associated with various Hadoop implementations, it can actually be used with various other data stores. Not only does Spark not have to rely on Hadoop, it can be run as a completely independent tool.</p>
<p>Numerous companies, including IBM, have started to align their analytics around Spark. It offers a flexibility with different data stores that is useful and more importantly, practical when compared to Hadoop.</p>
<p>This is reflected particularly well in the fact that Spark is an open-source platform which offers data processing capabilities in real time at speeds 100 times faster than what Hadoop’s MapReduce tools can currently offer, which makes it a perfect option for <a href="http://bigdata-madesimple.com/top-10-machine-learning-frameworks/">machine learning</a> based processing. It can be executed on virtually any platform, including Apache Mesos, EC2 and Hadoop, either from a cloud or an independent cluster mode.</p>
<p><strong>2. Apache Storm</strong></p>
<p><a href="http://storm.apache.org/">Apache Storm</a> is another excellent, open-source tool used for processing large quantities of analytics data. Whereas Hadoop can only process data in batches, Storm can do it in real time. The biggest difference between Hadoop and Apache Storm is the way it handles and distributes data.</p>
<p>Hadoop enters the data in an HDFS file system which is later distributed through nodes in order to get processed. Once this task is accomplished, the information is pulled back to HDFS and can finally be used.</p>
<p>Storm, on the other hand, is a system where processes don’t have a clear beginning and ending. This type of data processing is based largely on the Big Data topology construction, which is later transformed and further analyzed using a continuous stream of various information entries. This makes Storm a system which can be used for CEP, or complex event processing.</p>
<p>Apache Storm can be viewed as a useful solution for companies which allows them to react appropriately to both continuous and sudden influxes of data. Let’s not forget that Apache Storm is fault-tolerant, fully scalable and extremely easy to set up and operate.</p>
<p>Apache storm was developed in Clojure, a Lisp dialect which is normally executed using a JVM or Java Virtual Machine. This is one of its greatest strengths, as it offers compatibility with applications and various components written in programming languages such as C#, Java, Python, Perl, PHP and Scala.</p>
<p>Since its being made by Apache, Storm is also compatible with <a href="https://flink.apache.org/">Flink</a>, which can be used as a support for state management and event-time processing. This means that <a href="http://picnet.com.au/software-engineering/">software development has more room</a> for flexibility when compared to other frameworks which aren’t nearly as versatile.</p>
<p><strong>3. Google BigQuery</strong></p>
<p><a href="https://cloud.google.com/bigquery/what-is-bigquery">Google’s BigQuery</a> is a fully-fledged platform used for big data analysis. It allows its users to use SQL without being bothered with database or infrastructure management. The web service relies heavily on Google storage in order to provide users with an interactive analysis of large sets of data.</p>
<p>This means that you don’t have to invest in additional hardware which would otherwise be needed to process such large quantities of data. Its data mining algorithms are extremely useful for discovering specific user-behavior patterns in the raw data which are normally very difficult to discern using standard reporting.</p>
<p>What makes BigQuery such as strong contender against Hadoop is the fact that it works very well with the <a href="https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/">MapReduce</a> tool. Not to mention that Google has a rather proactive approach when it comes to improving the existing features and adding completely new ones, all in order to provide its users with a superior data analysis tool. This is particularly evident in their efforts to make importing custom data sets and using it with services such as Google Analytics a walk in the park.</p>
<p><strong>4. DataTorrent RTS</strong></p>
<p>Another open-source solution for the analysis and processing of big data, both in batches and in real-time. This all-in-one tool which is designed to completely revolutionize the workings of Hadoop’s MapReduce environment and further improve the performance offered by Apache’s Spark and Storm tools.</p>
<p><a href="https://www.datatorrent.com/products-services/datatorrent-rts/">DataTorrent RTS</a> can process billions of individual events every second and recover node outages without losing data and without human intervention. It is completely scalable, easy to execute and offers guaranteed event processing and significantly higher in-memory performance.</p>
<p><strong>5. Hydra</strong></p>
<p>The last addition to our Hadoop alternatives list is <a href="https://projecthydra.org/">Hydra</a>, a task processing system which provides its users with real-time analytics. This system was created out of the necessity for a scalable distributing solution and is currently being licensed by Apache.</p>
<p>Hydra’s tree-based configuration allows it to easily perform both batch and streaming operations by storing and processing user data across multiple clusters, some of which could potentially have thousands of individual nodes. It features a management system which automatically distributes new jobs between clusters, balances out all the existing jobs, replicates data and even handles node failures.</p>
<p>With all tremendous power and numerous benefits, Hadoop has to offer, it still has some significant drawbacks. Its data distribution process is far too complex and it lacks efficiency when processing both unstructured and big data. Fortunately, there are some available alternatives, some of which offer significant speed increase and utilize hardware more efficiently by relying on advanced streaming operations. Make sure to thoroughly test them out before deciding on one tool that best meets your specific requirements.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/what-are-hadoop-alternatives-and-should-you-look-for-one/">What are Hadoop alternatives and should you look for one?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/what-are-hadoop-alternatives-and-should-you-look-for-one/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The business of transferring data from Salesforce to Hadoop</title>
		<link>http://bigdata-madesimple.com/the-business-of-transferring-data-from-salesforce-to-hadoop/</link>
		<comments>http://bigdata-madesimple.com/the-business-of-transferring-data-from-salesforce-to-hadoop/#comments</comments>
		<pubDate>Fri, 28 Apr 2017 06:40:10 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=21346</guid>
		<description><![CDATA[<p>The sustained success of Hadoop has brought about a radical change in big data management. This highly popular...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/the-business-of-transferring-data-from-salesforce-to-hadoop/">The business of transferring data from Salesforce to Hadoop</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>The sustained success of Hadoop has brought about a radical change in big data management. This highly popular open-source MapReduce technology allows easy access and provides reliable answers to advanced data questions. Data management has been taken to the next level by Hadoop.</p>
<p>Salesforce is actually a cloud based Customer Relationship Management (CRM) suite that boasts of wide-ranging customization possibilities and business processes support. It is embraced by big organizations across the globe. It is endowed with incredible efficiency and capability of following a Business to Business pipeline and it offers comprehensive packages for analytics, marketing, and customer service. There are several special licenses meant for partner, as well as, customer communities that would be providing web portals integrated directly with the Customer Relationship Management. This has been very valuable because, with Communities, one could consider building a whole platform offering and collecting data from different customers in a reasonably short period of time.</p>
<p><b>The Hadoop &amp; Salesforce Integration</b></p>
<p>Thanks to the recent integration between Salesforce and the key components of Hadoop such as Hortonworks and Cloudera, data management has become much more trustworthy and a lot easier. The recent integration of Hadoop and Salesforce marks the beginning of ease and perfection in handling huge data entries. Now it would be much more convenient in managing bulky databases and files.</p>
<p>Salesforce is regarded as super-efficient software for organizing business processes and data. However, this multi-tenant structure has certain limitations that would be cutting down the amount of data that could be imported and also the precise amount of time you would be using for running complicated algorithms. In this context, integration of Salesforce CRM and Hadoop is a robust choice. Salesforce could be manipulated for generating transactional data which could be saved, as well as, analyzed in Hadoop.</p>
<p>Today the biggest challenge is the integration of these components for daily users. It would be profitable only when database managers could effectively exploit the benefits of such integration. Enterprise organizations that seem to have already made their investment in the cloud, boast of several Salesforce orgs for serving the specific requirements of different business units.</p>
<p>When enterprises are interested in examining potential cross-selling interactions they are left with analyzing mammoth amounts of interaction data and other customer transactions within Hadoop clusters. Thanks to Informatica Cloud support meant for Salesforce and numerous variants of Hadoop, now you could significantly cut down your deployment time. Get in touch with <a href="https://www.flosum.com/salesforce-version-control-git/">Flosum.com</a> for smart solutions.</p>
<p><b>Getting Your Salesforce Data onto Hadoop​</b></p>
<p>​There is a whole set of challenges involved with migrating Salesforce data to a Hadoop cluster. It offers a lot more opportunities in database integration too, like combining Salesforce data with domain-specific business data and log data. That said it doesn&#8217;t really have to be a very difficult task. There is a host of great tools and solutions like Salesforce2Hadoop which can make these entries and their transfers a piece of cake. They are generally command line tools and can be used to increase data import from Salesforce to your local file system, and support all sorts of data types like Accounts and Opportunities, and custom data types too.</p>
<p>The process is a bit lengthy and involves a sizeable learning curve but it is very interesting. You must update Avro Schema for every import, and this is reflected on the Enterprise WSDL too. WSC is used for the data extraction process. It is a Java Library Component which uses SOAP to interact with Salesforce and is much easier to use than SOAP itself.</p>
<p><b>Conclusion</b></p>
<p>There are absolutely no second thoughts about Hadoop taking data management to a whole new level of success. Hadoop is supposed to be the fresh face in the effective management of excessively bulky systems and also large files, which would otherwise, be regarded as difficult. The integration of Salesforce and Hadoop has simplified the process of large data file management. The integration has led to the emergence of newer applications which are pretty much effective in solving day to day data issues. Experts point out that only those individuals who welcome and adopt this new technology and exploit the benefit of the integration, are able to walk away victorious.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/the-business-of-transferring-data-from-salesforce-to-hadoop/">The business of transferring data from Salesforce to Hadoop</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/the-business-of-transferring-data-from-salesforce-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Reasons why hadoop as a service is recommended for your business</title>
		<link>http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/</link>
		<comments>http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/#comments</comments>
		<pubDate>Mon, 10 Apr 2017 06:54:09 +0000</pubDate>
		<dc:creator>Baiju NT</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=21209</guid>
		<description><![CDATA[<p>The importance that data is playing in business is hard to downplay. Data is growing exponentially in size...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/">Reasons why hadoop as a service is recommended for your business</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>The <a href="http://bigdata-madesimple.com/4-reasons-startups-must-unlock-the-potential-of-big-data/">importance</a> that data is playing in business is hard to downplay.</p>
<p>Data is growing exponentially in size and as it does its ability to not only affect our <a href="https://hbr.org/2016/08/how-the-big-data-explosion-has-changed-decision-making">business decisions</a> but also has the potential to become the bedrock of some of the highest earning industries of the future. From Data science to the Internet of Things, data is poised to become one of the most important tools for the businessperson of tomorrow.</p>
<p>Which brings us to the issue of how you actually manage to take advantage of that huge mound of data.</p>
<p>There are plenty of resources online to help you navigate this brave new world, but here we’ll be focusing on one: Hadoop.</p>
<p>Considering that the terabytes only continue to pile up on us, the world of tomorrow is going to need faster, more efficient, more insightful and just overall better methods of processing data so companies can derive insights from it.</p>
<p>Hadoop is a software program that is looking to help companies access data like never before. An open-source program, Hadoop is way to store and analyze large amounts of data through a system that employs an interconnected network of nodes. Being open-source, it is constantly improving and cost-effective. Some of the biggest companies in the world, in fact, like Google and Yahoo!, use Hadoop to help manage their endless waves of data. Think about how much data Google has to sort through, and it’s pretty clear that they would need some pretty powerful software to help with their burden.</p>
<p>With a focus on helping companies both big and small (though particularly useful for larger companies with more nodes that help improve the effectiveness), Hadoop is having a popular moment that may translate into it becoming the industry standard for how data is stored, processed and analyzed.</p>
<p>Which means that folks who can work their way around Hadoop are in a fortunate position, and that brings us to Hadoop as a Service (HaaS) and its complementary applications.</p>
<p><b><i>How Hadoop as a Service Helps You Get the Most Out of Your Data </i></b></p>
<p><a href="http://searchcloudstorage.techtarget.com/definition/Hadoop-as-a-service-HaaS">HaaS</a> is a natural extension to the Hadoop software. Seeing as how <a href="http://www.opencirrus.org/why-are-the-biggest-tech-companies-favoring-hadoop/">Hadoop is already cherished</a> by some of tech’s biggest titans, it would only make sense that more companies would come to trust the software.</p>
<p>And as more businesses are looking to take advantage of all that Hadoop has to offer in making sense of data, HaaS is a tertiary service that provides the necessary framework to fully engage with Hadoop through the use of a third-party vendor.</p>
<p>Essentially, HaaS provides the help you need to power your business via the use of Hadoop and data.</p>
<p>Some of the biggest providers of HaaS include giants like Amazon and IBM. It’s a big business bent on letting your company take full advantage of all that data has to offer.</p>
<p>A HaaS provider should offer a number of Hadoop supports, including:</p>
<ul>
<li>Hadoop framework deployment support</li>
<li>Hadoop cluster management</li>
<li>Alternative programming languages</li>
<li>Data transfer between clusters</li>
<li>Customizable and user-friendly dashboards and data manipulation</li>
<li>Security features</li>
</ul>
<p>And HaaS is only going to get bigger and better.</p>
<p><a href="http://www.orbisresearch.com/reports/index/global-hadoop-as-a-servicehdaas-market-2015-2019">TechNavio&#8217;s</a> analysts forecast the Global HaaS market to in the midst of a heavy growth period at a compound annual growth rate of 84.81% over the period 2014-2019.</p>
<p>With growth, of course, comes bigger profits.</p>
<p>According to Zion Market <a href="https://globenewswire.com/news-release/2017/02/15/917410/0/en/Global-Hadoop-Market-will-reach-USD-87-14-billion-by-2022-Zion-Market-Research.html">Research</a>, the global Hadoop market was valued at approximately $7.69 billion in 2016 and is expected to reach approximately $87.14 billion by 2022.</p>
<p>All indications are that both Hadoop and the resultant scaffold industry of HaaS will likely be there to propel the data processing sector into new heights of not only profitability for themselves, but also data literacy for their users.</p>
<p>The fact is that data’s importance cannot be downplayed in terms of its role in tomorrow and today’s business environment. HaaS steps in to help businesses make sense of data by giving them tools and providing the framework necessary to work through the endless streams of bytes.</p>
<p>Huge quantity of <a href="http://www.cloudsecuretech.com/unorganized-huge-sets-data-can-affect-business/">unmanaged data</a> can also have dire consequences on a business, which makes having Hadoop all the more necessary.</p>
<p>With some of the most trusted and successful companies in the world using Hadoop or providing HaaS for smaller organizations looking to make the most out of their data, it’s never been easier or more affordable for your company to acquire valuable insights from the wealth of information that awaits.</p>
<p>Using data to inform business decisions is nothing new, but the breadth and depth of these new tools used to asses and process the seemingly infinite mass of data is. HaaS is simply one more way that companies both large and small can make use of data to push their business forward.</p>
<p>Consider what Philip Russom, Transforming Data With Intelligence research director, had to say on the topic of Hadoop during an October webinar.</p>
<p>“The primary path to getting business value from big data, and a lot of new data, like machine data, is through analytics. There are challenges around Hadoop, but I don’t see them stopping anybody,” he said.</p>
<p>“Hadoop is known for its linear scalability. Hadoop can become, essentially, a bigger and better data staging area for both warehousing and data integration.”</p>
<p>He went on to discuss Hadoop’s issues with data governance.</p>
<p>“Hadoop has desirable use cases, but it can be a challenge in terms of data governance. Don’t forget—Hadoop is still kind of new, and it’s still kind of spartan in a lot of ways. That’s part of the secret sauce.”</p>
<p><b><i>How Do I Know if HaaS Is Right for Me? </i></b></p>
<p>While Big Data is definitely something every business wants to engage and get a handle on so as to improve their information pool and decision making capabilities, obviously not all companies will benefit equally from the use of a program like Hadoop and HaaS.</p>
<p>HaaS and Hadoop are great for their cost-effectiveness, but increase in efficiency depending on the number of nodes being used, and therefore tend to benefit larger companies more due to the larger network.</p>
<p>But ultimately, Hadoop and HaaS offer understandable, user-friendly ways to engage with data that will help you acquire top tier data processing software and capabilities without having to shell out huge amounts of cash. In a world that is over-saturated with information, HaaS helps you sort through the chaos and come to better business decisions. With an offer like that, it’s hard to say no.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/">Reasons why hadoop as a service is recommended for your business</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/reasons-why-hadoop-as-a-service-is-recommended-for-your-business/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why use Hadoop? Top pros and cons of Hadoop</title>
		<link>http://bigdata-madesimple.com/why-use-hadoop-top-pros-and-cons-of-hadoop/</link>
		<comments>http://bigdata-madesimple.com/why-use-hadoop-top-pros-and-cons-of-hadoop/#comments</comments>
		<pubDate>Wed, 15 Mar 2017 11:51:41 +0000</pubDate>
		<dc:creator>Ahamed Meeran</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://54.179.177.208/?p=20934</guid>
		<description><![CDATA[<p>Big Data is one of the major areas of focus in today’s digital world. There are tons of...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/why-use-hadoop-top-pros-and-cons-of-hadoop/">Why use Hadoop? Top pros and cons of Hadoop</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Big Data is one of the major areas of focus in today’s digital world. There are tons of data generated and collected from the various processes carried out by the company. This data could contain patterns and methods as to how the company can improve its processes. The data also contains feedback from the customer. Needless to say, this data is vital to the company and should not be discarded. But, the entire set is also not useful, a certain amount of data is futile. This set should be differentiated from the useful part and discarded. To carry out this major process, various platforms are used. The most popular among these platforms is Hadoop. Hadoop can efficiently analyse the data and extract the useful information. It also comes with its own set of advantages and disadvantages such as:</p>
<p><strong>Pros</strong></p>
<p><strong>1) Range of data sources</strong></p>
<p>The data collected from various sources will be of structured or unstructured form. The sources can be social media, clickstream data or even email conversations. A lot of time would need to be allotted in order to convert all the collected data into a single format. Hadoop saves this time as it can derive valuable data from any form of data. It also has a variety of functions such as data warehousing, fraud detection, market campaign analysis etc.</p>
<p><strong>2) Cost effective</strong></p>
<p>In conventional methods, companies had to spend a considerable amount of their benefits into storing large amounts of data. In certain cases they even had to delete large sets of raw data in order to make space for new data. There was a possibility of losing valuable information in such cases. By using Hadoop, this problem was completely solved. It is a cost-effective solution for data storage purposes. This helps in the long run because it stores the entire raw data generated by a company. If the company changes the direction of its processes in the future, it can easily refer to the raw data and take the necessary steps. This would not have been possible in the traditional approach because the raw data would have been deleted due to increase in expenses.</p>
<p><strong>3) Speed</strong></p>
<p>Every organization uses a platform to get the work done at a faster rate. Hadoop enables the company to do just that with its data storage needs. It uses a storage system wherein the data is stored on a distributed file system. Since the tools used for the processing of data are located on same servers as the data, the processing operation is also carried out at a faster rate. Therefore, you can processes terabytes of data within minutes using Hadoop.</p>
<p><strong>4) Multiple copies</strong></p>
<p>Hadoop automatically duplicates the data that is stored in it and creates multiple copies. This is done to ensure that in case there is a failure, data is not lost. Hadoop understands that the data stored by the company is important and should not be lost unless the company discards it.</p>
<p><strong>Cons</strong></p>
<p><strong>1) Lack of preventive measures</strong></p>
<p>When handling sensitive data collected by a company, it is mandatory to provide the necessary security measures. In Hadoop, the security measures are disabled by default. The person responsible for data analytics should be aware of this fact and take the required measures to secure the data.</p>
<p><strong>2) Small Data concerns</strong></p>
<p>There are a few big data platforms in the market that aren’t fit for small data functions. Hadoop is one such platform wherein only large business that generates big data can utilize its functions. It cannot efficiently perform in small data environments.</p>
<p><strong>3) Risky functioning</strong></p>
<p>Java is one of the most widely used programming languages. It has also been connected to various controversies because cyber criminals can easily exploit the frameworks that are built on Java. Hadoop is one such framework that is built entirely on Java. Therefore, the platform is vulnerable and can cause unforeseen damages.</p>
<p>Every platform used in the digital world comes with its own set of advantages and disadvantages. These platforms serve a purpose that it vital to the company. Hence, it is necessary to check if the pros outweigh the cons. If they do, then utilize the pros and take preventive measures to guard yourself against the cons. To know more about Hadoop and pursue a career in it, enrol for a <a href="http://www.knowledgehut.com/big-data-and-hadoop/big-data-and-hadoop-training">big data Hadoop certification</a>. You can also gain better with big data Hadoop training online courses.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/why-use-hadoop-top-pros-and-cons-of-hadoop/">Why use Hadoop? Top pros and cons of Hadoop</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/why-use-hadoop-top-pros-and-cons-of-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to fetch HBase table data in Apache Phoenix?</title>
		<link>http://bigdata-madesimple.com/how-to-fetch-hbase-table-data-in-apache-phoenix/</link>
		<comments>http://bigdata-madesimple.com/how-to-fetch-hbase-table-data-in-apache-phoenix/#comments</comments>
		<pubDate>Mon, 17 Oct 2016 09:40:01 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=19887</guid>
		<description><![CDATA[<p>This exclusive post is shared by big data services providers to help developers in development. They tell the...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-fetch-hbase-table-data-in-apache-phoenix/">How to fetch HBase table data in Apache Phoenix?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>This exclusive post is shared by big data services providers to help developers in development. They tell the best way to fetch HBase table data in Apache Phoenix. Read this article and discover what they have to say about Big Data related services.</p>
<p>The term &#8216;Big Data&#8217; has been continuously in the limelight from quite some time now. It would not be exaggerating the term by saying it contains all the data which can define one behavior with respect to everything in the world and that’s the reason it increasing exponentially and centre of attraction of many articles, meetings, and conferences.</p>
<p>Now we have to see that a big data will leave big impact or not. Nowadays many applications working in Data Analytics and Data mining using BigData at their core, and constantly attempting to get the better prediction. Those who are making decision want to study the patterns what happened in the past and what’s happening in the present in order to predict what’s going to hit in the future.</p>
<p>As Hadoop offers the complete framework — specifically for customers who want to prepare the Data lake from the Data Warehouse or Data Mart followed by the Data analytics with the help of Apache SparkMlib, we can see much improvement towards the predictions as compared to the previous-generation methods.</p>
<p><strong>Technology:</strong></p>
<p>We all know that HBase is the NoSql database for the Hadoop ecosystem and working very well in that area but there is one limitation of the HBase is that it is not very user friendly for SQL developer, to overcome this limitation with the Apache HBase they come up with the SQL skin above HBase known as  Apache Phoenix.</p>
<p>While working with HBase-Phoenix integration, we come up with one strange behaviour of Phoenix and thought to share it with you.</p>
<p><strong>Use-case:</strong></p>
<p>Below is the table we have created in Apache Phoenix named <strong>&#8216;test_1&#8242;</strong> with 2 columns.</p>
<p><img class="aligncenter size-full wp-image-19902" alt="Apache Phoenix" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC.png" width="669" height="69" /></p>
<p>Inserted few rows into it manually.</p>
<p><img class="aligncenter size-full wp-image-19903" alt="Inserted rows in Apache Phoenix" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-2.png" width="727" height="99" /></p>
<p>Verified the same.</p>
<p><img class="aligncenter size-full wp-image-19904" alt="Apache Phoenix verify the same number of rows" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-3.png" width="654" height="149" /></p>
<p>Now as per the integration structure of the HBase and Phoenix, the table we have created in above should get reflected in HBase with the same data.</p>
<p>We have verified the same, table has been created in HBase with the same name (test_1) and we have verified the data that would be placed correctly.</p>
<p><img class="aligncenter size-full wp-image-19905" alt="Hbase main" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-4.png" width="789" height="153" /></p>
<p>Now we have tested the same procedure another way round – we will create the table in HBase, load data into it and check for the same in Phoenix.</p>
<p>We have created the table in HBase with name <b>‘TEST_2’</b></p>
<p><img class="aligncenter size-full wp-image-19906" alt="HBase with name Test_2" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-5.png" width="526" height="68" /></p>
<p>Inserted 3 rows into it.</p>
<p><img class="aligncenter size-full wp-image-19907" alt="Apache Phoenix(Insert rows)" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-1.png" width="468" height="142" /></p>
<p>Verify the same with the SCAN command.</p>
<p><img class="aligncenter size-full wp-image-19908" alt="Scan command" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/Scan-command.png" width="793" height="108" /></p>
<p><strong> <span style="text-decoration: underline;">Issue:</span></strong></p>
<p>We have searched for the same table in Phoenix that we have created in HBase but the table is not present over theirs.</p>
<p>It’s actually a limitation of Phoenix that it will not pick up the table details automatically from HBase metadata.</p>
<p><img class="aligncenter size-full wp-image-19909" alt="Phoenix issue" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/Phoenix.png" width="717" height="123" /></p>
<p><strong><span style="text-decoration: underline;">Resolution: </span></strong></p>
<p>We have to create the table manually in Phoenix with the same name with which we have created in HBase (TEST_2).</p>
<p><img class="aligncenter size-full wp-image-19910" alt="Phoenix resolution" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/Phoenix-resolution.png" width="666" height="71" /></p>
<p>Once the table is created with the same name, it will automatically fetch the data from the HBase metadata and reflect the same into Phoenix table.</p>
<p><img class="aligncenter size-full wp-image-19911" alt="Apache phoenix 1" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-phoenix-1.png" width="1048" height="240" /></p>
<p><img class="aligncenter size-full wp-image-19912" alt="JDBC phoenix 2" src="http://bigdata-madesimple.com/wp-content/uploads/2016/10/JDBC-phoenix-2.png" width="716" height="148" /></p>
<p>This post is intended by <a href="http://www.technoligent.com/bigdata-solutions-consulting.html"><b>Big Data services</b></a> experts to make you learn about the fetching process for HBase table data in Apache Phoenix. If there is anything you want to ask, write in comments. <b></b></p>
<p><b><span style="text-decoration: underline;">Conclusion:</span></b></p>
<p>We found that when we create table in Phoenix and load the data into it, then the same data will reflect in the HBase under the table with the same name but when we attempt vice-versa and create table in HBase and load data into it, same will NOT be reflected in Phoenix table and we have to manually map the Phoenix table with HBase table.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/how-to-fetch-hbase-table-data-in-apache-phoenix/">How to fetch HBase table data in Apache Phoenix?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/how-to-fetch-hbase-table-data-in-apache-phoenix/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why you should consider using Hadoop in big data management?</title>
		<link>http://bigdata-madesimple.com/remote-dba-experts-consider-using-hadoop-big-data-management/</link>
		<comments>http://bigdata-madesimple.com/remote-dba-experts-consider-using-hadoop-big-data-management/#comments</comments>
		<pubDate>Mon, 01 Aug 2016 05:05:13 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=19232</guid>
		<description><![CDATA[<p>If you are continuously involved in big data or any project that involves database management, you probably have...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/remote-dba-experts-consider-using-hadoop-big-data-management/">Why you should consider using Hadoop in big data management?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>If you are continuously involved in big data or any project that involves database management, you probably have heard of Hadoop. This is an open source project introduced in 2005 by Doug Reed, a computer scientist. The project is now being managed by Apache Software. Why has this platform gained so much popularity?</p>
<p><img class="aligncenter size-full wp-image-19233" alt="Big data and hadoop" src="http://bigdata-madesimple.com/wp-content/uploads/2016/07/Big-data-and-hadoop.jpg" width="937" height="553" /></p>
<p>In addition to the many benefits that come with the use of Hadoop, one of the main reasons why companies consider it is because of the low cost of implementation. The platform gives developers great data management provision. The framework supports the process of big data sets in distributed computing environment. It is also scalable in that you can start with a single server and grow to thousands of servers/machines with each providing storage and computation.</p>
<p>Another remarkable feature of Hadoop is that it offers a distributed file system that facilitates the rapid transfer of data among nodes and also enables systems to run uninterrupted in case any of the nodes fail. This significantly reduces the risk of a catastrophic system failure even when several nodes fail.</p>
<p>Simply put, <a href="http://www.smartdatacollective.com/michelenemschoff/191151/how-maximize-performance-and-scalability-within-your-hadoop-architecture">Hadoop is considered for its scalability.</a> It has proven to be valuable for the large scale organizations. Below are some of the remarkable benefits you will enjoy from using the platform.</p>
<p><img class="aligncenter size-full wp-image-19234" alt="Hadoop" src="http://bigdata-madesimple.com/wp-content/uploads/2016/07/Hadoop.jpg" width="1096" height="576" /></p>
<p><b>Scalable </b></p>
<p>Scalability is the main benefit you get from Hadoop. This is one of the most scalable data storage platforms that you can invest in. The platform will store your data and also distribute it in large data sets across your servers. What this means is that the platform will give you the ability to run your applications on thousands of nodes while carrying thousands of terabytes. This is the feature that makes Hadoop the ultimate choice for big data management.</p>
<p><b>Cost effective </b></p>
<p>When choosing a data management system, cost is always a major factor you have to consider. The great thing is that <a href="http://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/">Hadoop offers a very cost effective storage solution</a> for your organization’s ever growing data sets. You no longer need to down-sample data and classify it based on priority. You also don’t need to delete the raw data if you don’t have to.</p>
<p>Hadoop is a cost effective solution for handling big data sets. The massive storage and data distribution feature makes it possible for you to use all your data without being forced to delete the least essential data. It is no longer expensive to store your data. The new platform will affordably store your company’s data. You will enjoy better cost savings since you no longer have to spend thousands of dollars per terabyte. The cost of storage with Hadoop is much lower.</p>
<p><b>Great flexibility </b></p>
<p>With Hadoop, you will be able to derive business insights from a range of data sources like email conversations, social media, and clickstream data. This is possible because the platform enables you to access new data sources and easily use different data types. You will be able to work with structured and unstructured data.</p>
<p>What is more is that you can use Hadoop for a range of other purposes. The platform can be used for recommendation systems, log processing, market campaign analysis, fraud detection and data warehousing. The platform is definitely the ultimate tool for <a href="http://www.remotedba.com/">remote DBA experts</a>.</p>
<p><b>Improved speed</b></p>
<p>The last thing you want is for your system to be sluggish. This is not only annoying but it will also slow the overall running of your business. This is a common problem when using traditional methods in data management. Hadoop, however, resolves this problem.</p>
<p>It is important to understand that the data storage method on Hadoop is based on the distributed file system. This clearly maps the exact location of data in the cluster. Second, the tools used in data processing are located on the same server that data is located. This leads to faster data processing.</p>
<p>When dealing with large data sets, Hadoop makes it possible for you to process terabytes of data efficiently and within minutes. You can process petabytes of data in just hours instead of days.</p>
<p><b>Resistant to failure </b></p>
<p>When compared to other platforms, Hadoop has a higher fault tolerance. This is because of how the platform works. In its working, when data is sent to a node, the same data is replicated to other nodes within the cluster. This means that in the event of failure in one or more nodes, there will always be a copy available.</p>
<p><b>Continuity </b></p>
<p>All the aforementioned benefits of using <a href="http://www.dataversity.net/revolutionizing-big-data-and-hadoop-operations-and-analytics/">Hadoop result in one main thing, continuity</a>. The reason why many companies suffer is because they lack the data they need to predict the future. The data was often lost when prioritizing which data to keep and which one to delete. This is no longer the case once you embrace Hadoop. The platform enables you to collect and keep all the data in a well-organized manner. This has enabled many companies to store and analyze the high volume of their data successfully.</p>
<p>With Hadoop, the data is generated continuously be it through mobile platforms, social media presence, and other services. These activities lead to more data gathering. These solutions need to be scaled fast and in a cost effective manner that is also secure. Hadoop offers that.</p>
<p><b>Advanced analytics </b></p>
<p>Last but not least, with Hadoop, you will benefit from advanced data analytics. The platform offers facts and figures that are more accurate than other platforms. The platform also offers advanced features like predictive analytics and data visualization. This will help get useful insights in graphical manner. At the end of the day, the data will enable you to optimize performance using a single server as well as be able to handle large volumes of information.</p>
<p>When working with large data sets, Hadoop offers more benefits than other relational data management systems. The platform is cost-effective, safe and fast. These are qualities that are essential in database management. What is more is that Hadoop will grow with you to accommodate your exploding data needs. Hadoop is affordable for both small businesses and enterprises.</p>
<p><strong>Conclusion</strong></p>
<p>Now from the above the article you would come to understand the importance of Hadoop and benefits which are useful in big data management. Therefore it becomes essential for companies to know the advantages of hadoop so that they can make proper implementation.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/remote-dba-experts-consider-using-hadoop-big-data-management/">Why you should consider using Hadoop in big data management?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/remote-dba-experts-consider-using-hadoop-big-data-management/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>11 key tuning checklists for apache Hadoop!</title>
		<link>http://bigdata-madesimple.com/11-key-tuning-checklists-for-apache-hadoop/</link>
		<comments>http://bigdata-madesimple.com/11-key-tuning-checklists-for-apache-hadoop/#comments</comments>
		<pubDate>Mon, 09 May 2016 10:14:36 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18455</guid>
		<description><![CDATA[<p>Apache Hadoop is a well know and de-facto framework for processing large big data sets through distributed &#38;...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/11-key-tuning-checklists-for-apache-hadoop/">11 key tuning checklists for apache Hadoop!</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Apache Hadoop is a well know and de-facto framework for processing large big data sets through distributed &amp; parallel computing. YARN(Yet Another Resources Negotiator) allowed Hadoop to evolve from a simple MapReduce engine to a big data ecosystem that can run heterogeneous (MapReduce and non-MapReduce) apps concurrently. This results in larger clusters with more workloads and users than ever before. Traditional recommendations encourage provisioning, isolation, and tuning to increase performance and avoid resource contention but result in highly underutilized clusters.</p>
<p>Tuning is an essential part of maintaining a Hadoop cluster. Cluster administrators must interpret system metrics and optimize for specific workloads (e.g., high CPU utilization versus high I/O). To know what to tune, Hadoop operators often rely on monitoring software for insight into cluster activity. Tools like Ganglia, Cloudera Manager, or Apache Ambari will give us near real-time statistics at the node level, and many provide after-the-fact reports for particular jobs.</p>
<p>Here we will quickly look in to the 11 checklists for tuning by admin/developer,</p>
<ol>
<li><b></b><b>Number of mappers</b> : If we found that mappers are only running for a few seconds, try to use fewer mappers that can run longer (a minute or two ).  Then increase mapred.min.split.size to decrease the number of mappers allocated in a slot.</li>
<li><b></b><b>Mapper output:</b> Mappers should output as little data as possible. Hence try filtering out records on the mapper side and use minimal data to form the map output key and map output value.</li>
<li><b></b><b>Number of reducers:</b> Reduce tasks should run for five minutes or so and produce at least a blocks worth of data.</li>
<li><b></b><b>Combiners: </b>We can specify a combiner to cut the amount of data shuffled between the mappers and the reducers.</li>
<li><b></b><b>Compression:</b> We can you enable map output compression to improve job execution time.</li>
<li><b></b><b>Custom serialization</b>: We can implement a RawComparator.</li>
<li><b></b><b>Disks per node:</b> We can adjust the number of disks per node (mapred.local.dir, dfs.name.dir, dfs.data.dir) and test how scaling affects execution time.</li>
<li><b></b><b>JVM reuse: </b>We have to consider enabling JVM reuse (mapred.job.reuse.jvm.num.tasks) for workloads with lots of short-running tasks.</li>
<li><b></b><b>Maximize memory for the shuffle</b>: We can generally maximize memory for the shuffle, but give the map and reduce functions enough memory to operate. Hence make the mapred.child.java.opts property as large as possible for the amount of memory on the task nodes.</li>
<li><b></b><b>Minimize disk spilling</b>: One spill to disk is optimal. The MapReduce counter spilled_records is a useful metric, as it counts the total number of records that were spilled to disk during a job.</li>
<li><b></b><b>Adjust memory allocation</b>: Total Memory = Map Slots + Reduce Slots + TT+ DN + Other Services + OS.</li>
</ol>
<p>Originally appeared on <a href="http://dataottam.com/2016/04/28/11-key-tuning-checklists-for-apache-hadoop/" target="_blank">DataOttam</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/11-key-tuning-checklists-for-apache-hadoop/">11 key tuning checklists for apache Hadoop!</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/11-key-tuning-checklists-for-apache-hadoop/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>8 breaking changes in Apache Flink 1.0.0</title>
		<link>http://bigdata-madesimple.com/8-breaking-changes-apache-flink-1-0-0/</link>
		<comments>http://bigdata-madesimple.com/8-breaking-changes-apache-flink-1-0-0/#comments</comments>
		<pubDate>Mon, 02 May 2016 11:35:07 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=18405</guid>
		<description><![CDATA[<p>Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/8-breaking-changes-apache-flink-1-0-0/">8 breaking changes in Apache Flink 1.0.0</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.</p>
<p>The Apache Flink community is pleased to announce the availability of the 1.0.0 release. The community put significant effort into improving and extending Apache Flink since the last release, focusing on improving the experience of writing and executing data stream processing pipelines in production.</p>
<p>And the 1.0.0 release introduced API breaking change and Apache Flink had many cool &amp; exciting features,</p>
<p><b>DataStream API</b></p>
<ul>
<li>Removed partition ByHash. Use keyBy instead.</li>
<li>Scala API: fold parameters switched.</li>
<li>Hash partitioner now scrambles hashes with murmur hash. This might break programs relying on the output of the hashing.</li>
</ul>
<p><b>DataSet API</b></p>
<ul>
<li>Combinable annotation removed. Implement a combinable GroupReduceFunction&lt;IN, OUT&gt; by implementing a CombineFunction&lt;IN, IN&gt; or GroupCombineFunction&lt;IN, IN&gt; interface in the GroupReduceFunction.</li>
</ul>
<p><b>Gelly</b></p>
<ul>
<li>The LabelPropagation library method now supports any Comparable type of label. It used to expect a Long value, so now users have to specify one extra type parameter when calling the method.</li>
<li>Gelly vertex-centric model has been renamed to scatter-gather. Graph’s runVertexCentricIteration() methods have been renamed to runScatterGatherIteration() and VertexCentricConfiguration has been renamed to ScatterGatherConfiguration.</li>
</ul>
<p><b>Start/Stop scripts</b></p>
<p>The ./bin/start-webclient.sh and ./bin/stop-webclient.sh scripts have been removed. The webclient is now included in Flink’s web dashboard and activated by default. It can be disabled by configuring jobmanager.web.submit.enable: false in ./conf/flink-conf.yaml.</p>
<p><b>Backwards compatibility:</b><strong> </strong><b></b></p>
<p>Flink 1.0 removes the hurdle of changing the application code when Flink releases new versions. This is huge for production users who want to maintain their business logic and applications while seamlessly benefiting from new patches in Flink.</p>
<p><b>Operational features:</b></p>
<p>Flink by now boasts very advanced monitoring capabilities (this release adds backpressure monitoring, checkpoint statistics, and the ability to submit jobs via the web interface). This release also adds <a href="http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/">savepoints</a>, an essential feature (and unique in the open source world) that allows users to pause and resume applications without compromising result correctness and continuity.</p>
<p><b>Battle-tested:</b><strong> </strong><b></b></p>
<p>Flink is by now in production use at both large tech and Fortune Global 500 companies. A team at Twitter recently <a href="http://data-artisans.com/extending-the-yahoo-streaming-benchmark/">clocked Flink at 15 million events per second </a>in a moderate cluster.</p>
<p><b>Integrated:</b><strong> </strong><b></b></p>
<p>Flink has always been integrated with the most popular open source tools, such as Hadoop (HDFS, YARN), Kafka (this release adds full support for Kafka 0.9), HBase, and others. Flink also features compatibility packages and runners, so that it can be used as an execution engine for programs written in MapReduce, Apache Storm, Cascading, and Apache Beam (incubating).</p>
<p>Originally appeared on <a href="http://dataottam.com/2016/03/09/8-breaking-changes-in-apache-flink-1-0-0/" target="_blank">DataOttam</a></p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/8-breaking-changes-apache-flink-1-0-0/">8 breaking changes in Apache Flink 1.0.0</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/8-breaking-changes-apache-flink-1-0-0/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Top ten pointers in the new Apache Spark release(version 1.6)</title>
		<link>http://bigdata-madesimple.com/top-ten-pointers-in-the-new-apache-spark-release-version-1-6/</link>
		<comments>http://bigdata-madesimple.com/top-ten-pointers-in-the-new-apache-spark-release-version-1-6/#comments</comments>
		<pubDate>Mon, 11 Jan 2016 09:23:08 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=16918</guid>
		<description><![CDATA[<p>In 2016, we should be excited that Apache Spark community launched Apache Spark 1.6. Committers – There are...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-ten-pointers-in-the-new-apache-spark-release-version-1-6/">Top ten pointers in the new Apache Spark release(version 1.6)</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>In 2016, we should be excited that Apache Spark community launched Apache Spark 1.6.</p>
<p><strong>Committers</strong> – There are around 1000 contributors to Apache Spark, which has doubled.</p>
<p><strong>Patches</strong> – Apache Spark 1.6 version includes &amp; covers 1000 patches.</p>
<p><strong>Run SQL query on files</strong> – This feature helps user and application to run SQL queries on files directly without creating a table. And it’s similar to the feature available in Apache Drill. For an example select id from JSON.`path/to/json/files` as j.</p>
<p><strong>Star (*) expansion for StructTypes</strong> – This features makes it easier to nest and unnest arbitrary numbers of columns. It is pretty common for customers to do regular extractions of update data from an external data source (e.g. MySQL or Postgres). While this is possible today in the new release with some small improvements to the analyzer. And the goal is to allow users to execute the following two queries as well as their data frame equivalents to find the most recent record for each key to unnest the struct from above group by query.</p>
<p><strong>Parquet Performance</strong> – It has been the most commonly used data formats within the Apache Spark, and Parquet scan performance has pretty significant impact on many large applications. Before this version it depends on parquet-mr to read and decode Parquet files, with that often many times are spent in record assembly, which is a process that reconstructs records from Parquet columns. But in Spark 1.6. They have introduced a new Parquet reader that bypasses the old parquert-mr’s record assembly and uses a more optimized code path for flat schemas. It seems and benchmarks results in 50% improvement.</p>
<p><strong>Automatic Memory Management</strong> – In older version of Apache Spark (lesser than 1.6), it just splits the available memory into two regions which are called execution memory and cache memory. Execution memory is the region that is used for sorting, hashing, and shuffling, while cache memory is used to cache recent data. And now in the Spark 1.6 version, it introduces a new memory manager that will automatically tune the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. Hence, many applications will be get benefited for operators like joins and aggregations, without any user optimization and tuning.</p>
<p><strong>Streaming State Management</strong> – State management is a very vital function in streaming applications in Spark, often used to maintain aggregations or session information. Apache Spark 1.6 introduces a new mapWithStateAPI that scales linearly with the number of updates rather than the total number of records. The mapWithState has an efficient implementations of deltas, rather than always requiring full scans over data. It helped the user with a greatness of performance improvements.</p>
<p><strong>Spark Datasets</strong> – The lesser version of Apache Spark(less than 1.6) is the lack of support for compile-time type safety. To solve this problem in Spark 1.6 team introduced a typed extension of the DataFrame API called Datasets. The Dataset API extends the DataFrame API to supports static typing and user functions that run directly on existing Scala or Java types. When compared with the traditional RDD API, Datasets provide better memory management as well as in the long term better performance.</p>
<p><strong>Machine Learning Pipeline Persistence</strong> – In the lesser version of Apache Spark(less than 1.6) lot of machine learning applications leverage Spark’s ML pipeline feature to construct learning pipelines. In the past, we have to implement custom persistence code to store the pipeline externally which could be used for big data applications. But in Spark 1.6, the pipeline API offers functionality to save and reload pipelines from a previous state and apply models built previously to new data later.</p>
<p><strong>Addition of New Algorithms</strong> – In Apache Spark 1.6 release they have increased algorithm coverage in machine learning like univariate and bivariate statistics, survival analysis, standard equation for least squares, bisecting K-means clustering, online hypothesis testing, latent Dirichlet allocation(LDA), R-like statistics, feature interactions in R formula, instance weights, univariate and bivariate statistics in DataFrames, LIBSVM data source, non-standard JSON data.</p>
<p>Reference – databricks.com, issues.apache.org, Big Data Analytics Community.</p>
<p>Originally appeared on <a href="http://dataottam.com/2016/01/06/top-10-pointers-in-new-apache-spark-1-6-release/#prettyPhoto" target="_blank">DataOttam</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/top-ten-pointers-in-the-new-apache-spark-release-version-1-6/">Top ten pointers in the new Apache Spark release(version 1.6)</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/top-ten-pointers-in-the-new-apache-spark-release-version-1-6/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is the role of RDDs in apache Spark? – Part 1</title>
		<link>http://bigdata-madesimple.com/what-is-the-role-of-rdds-in-apache-spark-part-1/</link>
		<comments>http://bigdata-madesimple.com/what-is-the-role-of-rdds-in-apache-spark-part-1/#comments</comments>
		<pubDate>Tue, 05 Jan 2016 11:33:11 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=16835</guid>
		<description><![CDATA[<p>This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/what-is-the-role-of-rdds-in-apache-spark-part-1/">What is the role of RDDs in apache Spark? – Part 1</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is simply a distributed collection of elements or objects (Java, Scala, Python, and user defined functions) across the Spark cluster. In Spark all work is expressed in three ways as follows,</p>
<ul>
<li>Creating new RDDs</li>
<li>Transforming existing RDDs</li>
<li>Calling operations on RDDs to compute a result</li>
</ul>
<p><b>RDD Foundations:</b></p>
<p>RDD in Spark is simple an immutable distributed collection of objects, each split into multiple partitions. We create RDDs in two ways as like,</p>
<ul>
<li>By loading an external dataset</li>
<li>By distributing a collection of objects in their driver program</li>
</ul>
<p>Once the RDDs are created it offers two types of operations such as,</p>
<ul>
<li>Transformations</li>
<li>Actions</li>
</ul>
<p>Transformations construct a new RDD from a previous one [filter, map, groupBy] and Actions on other hands compute result based on an RDD either it return to driver program or save it to an external storage system (HDFS, S3, Cassandra, HBase, etc.,) [first, count, collect, save].</p>
<p>Transformations and actions are different because of the way Spark computes RDDs, as we can able to define the new RDDs any time. Spark computes them only in a lazy fashion that is nothing but when used first time in action.</p>
<p>Finally the RDDs are by default recomputed each time when we  run an action on them, if we want to use multiple times then we can ask Spark to persist by using RDD.persist().</p>
<p><img class="alignnone size-full wp-image-16836" alt="RDD presist" src="http://bigdata-madesimple.com/wp-content/uploads/2016/01/RDD-presist.png" width="516" height="195" /></p>
<p>Listed are the number of ways and options to use for persisting RDD in Spark and if we wanted to replicate the data on two machines then we need to add _2 at the end of storage level. In production practices we will often use persist () to load subset of the data into memory which could be query frequently. And, the cache() is same calling persists() with default storage level.</p>
<p>Just to summarize every Spark program will works as follows,</p>
<ul>
<li>Create some input RDDs from external data</li>
<li>Transform RDDs to define new RDDs using transformations like filter()</li>
<li>Use persist () to persist an intermediate RDDs which will be reused</li>
<li>Launch actions such as count(), first() to kick start the parallel computation</li>
</ul>
<p>And to conclude RDDs are Immutable, portioned collections of objects spread across a cluster, stored in RAM or on disk, built through lazy parallel transformations, and automatically rebuilt on failure. In Part 2 – we will be sharing internal details of Transformations &amp; Actions of RDDs and benefits of Lazy Evaluation.</p>
<p>Reference – Big Data Analytics Community, Learning Spark: Karau, Konwinski, Wendell, Zaharia.</p>
<p>Interesting?</p>
<p>Originally appeared on <a href="http://dataottam.com/2015/12/22/what-is-the-role-of-rdds-in-apache-spark-part-1/" target="_blank">dataottam</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/what-is-the-role-of-rdds-in-apache-spark-part-1/">What is the role of RDDs in apache Spark? – Part 1</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/what-is-the-role-of-rdds-in-apache-spark-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is Apache Hadoop the only option to implement big data?</title>
		<link>http://bigdata-madesimple.com/is-apache-hadoop-the-only-option-to-implement-big-data/</link>
		<comments>http://bigdata-madesimple.com/is-apache-hadoop-the-only-option-to-implement-big-data/#comments</comments>
		<pubDate>Thu, 24 Dec 2015 04:47:48 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=16756</guid>
		<description><![CDATA[<p>Yes, Hadoop is not only the options to big data problem. Hadoop is one of the solutions. The...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/is-apache-hadoop-the-only-option-to-implement-big-data/">Is Apache Hadoop the only option to implement big data?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Yes, Hadoop is not only the options to big data problem. Hadoop is one of the solutions.</p>
<p>The HPCC (High Performance Computing Cluster) Systems technology is an open source data driven and intensive processing and delivery platform developed by LexisNexis Risk Solutions. HPCC Systems incorporates a big data software architecture implemented on commodity shared-nothing computing clusters to provide high-performance, data-parallel processing and delivery for applications utilizing Big Data.</p>
<p>The HPCC Systems platform includes system configurations to support both, parallel batch data processing (Thor) ~ (HBase Design &amp; Data Consumption Principle) and high-performance data delivery applications using indexed data files (Roxie) ~ (HBase Design &amp; Data Consumption Principle). It includes Enterprise Control Language (ECL) which is a parallel data-centric declarative programming language.</p>
<p><img class="alignnone size-full wp-image-16757" alt="HPCC" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/HPCC.png" width="624" height="298" /></p>
<p>HPCC Systems detailed components:</p>
<p>Thor – Data Refinery Cluster designed to execute big data workflows including extraction, loading, cleansing, transformations, linking and indexing.</p>
<p>Roxie – Rapid Data Delivery Cluster provides separate high-performance online query delivery for Big Data delivery. Roxie utilizes highly optimized distributed B-tree indexed data structures and has been conceived for high concurrent use. A typical 10 node cluster can process thousands of concurrent requests and deliver them in fractions of a second.</p>
<p>ECL – Enterprise Control Language is declarative, data-centric, distributed processing language for Big Data.  ECL is a declarative, collaborative and extensible, high-level language that allows the programmer to describe the desired outcome instead of programming tedious and ambiguous scripting.</p>
<p><img class="alignnone size-full wp-image-16758" alt="ECL" src="http://bigdata-madesimple.com/wp-content/uploads/2015/12/ECL.png" width="323" height="189" /></p>
<p>Declarative: describes the what, not the how. Focused: Higher level code means fewer programmers and shortens time to delivery. Extensible: As new attributes are defined, they become primitives that other programmers can use. Implicitly parallel: Parallelism is built into the underlying platform. The programmer does not need to manage it. Maintainable: Designed for long-term, large scale, enterprise use. Complete: Provides for a complete programming paradigm. Homogeneous: One language to express data algorithms across the entire HPCC Systems platform, including a data ETL and high speed delivery.</p>
<p>IDE – The Integrated Development Environment called the ECL IDE turns code into graphs that facilitate the understanding and processing of large-scale, complex data analytics.</p>
<p>ESP – Enterprise Services Platform provides an easy to use interface to access ECL queries using XML, HTTP, SOAP (Simple Object Access Protocol) and REST (Representational State Transfer).</p>
<p>Data Graphs – We need series of advanced functions to solve many complex data. The HPCC Systems technology, complex data challenges can be represented naturally with a transformative data graph. The nodes of the data graph can be processed in parallel as distinct data flows. Each section of the graph includes information such as function, records processed or skew and each node can be drilled into for specific details.</p>
<p>Originally appeared on <a title="data ottam" href="http://dataottam.com/2015/12/14/is-apache-hadoop-the-only-option-to-implement-big-data/" target="_blank">dataottam</a>.</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/is-apache-hadoop-the-only-option-to-implement-big-data/">Is Apache Hadoop the only option to implement big data?</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/is-apache-hadoop-the-only-option-to-implement-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The top 12 Apache Hadoop challenges</title>
		<link>http://bigdata-madesimple.com/the-top-12-apache-hadoop-challenges/</link>
		<comments>http://bigdata-madesimple.com/the-top-12-apache-hadoop-challenges/#comments</comments>
		<pubDate>Mon, 23 Nov 2015 10:23:22 +0000</pubDate>
		<dc:creator>Manu Jeevan</dc:creator>
				<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdata-madesimple.com/?p=16328</guid>
		<description><![CDATA[<p>Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its...</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/the-top-12-apache-hadoop-challenges/">The top 12 Apache Hadoop challenges</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></description>
				<content:encoded><![CDATA[<p>Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.</p>
<p>And it proved that the Hadoop solves the Big Data problems like Volume, Variety, Velocity and Values but we left with top 12 Hadoop challenges.</p>
<ol>
<li>Hadoop is a complex distributed system with low-level APIs</li>
<li>Specialized skills are required for using Hadoop, preventing most developers from effectively building solutions</li>
<li>Business logic and infrastructure APIs have no clear separation, burdening app developers</li>
<li>Automated testing of end-to-end solutions is impractical or impossible</li>
<li>Hadoop is a diverse collection of many open source projects</li>
<li>Understanding multiple technologies and hand-coding integration between them</li>
<li>Significant effort is wasted on simple tasks like data ingestions and ETL</li>
<li>Moving from proof-of-concept to production is difficult and can take months or quarters</li>
<li>Hadoop is more than just offline storage and batch analytics</li>
<li>Different processing paradigms require data to be stored in specific ways</li>
<li>Real-time and batch ingestion requires deeply integrating several components</li>
<li>Common data patterns often require but don’t support data consistency and correctness</li>
</ol>
<p>And to address the above challenges we the enterprise need to go with few commercial Hadoop Software tools like <a href="http://cask.co/product/#features" target="_blank">cask</a>, <a href="http://www.zaloni.com/product/bedrock" target="_blank">bedrock</a>, <a href="http://www.zaloni.com/product/mica" target="_blank">mica</a>, <a href="http://www.pentaho.com/product/data-integration" target="_blank">Pentaho</a>, <a href="http://www.talend.com/products/big-data" target="_blank">Talend</a>,<a href="http://www.htrunk.com/product" target="_blank">hTrunk</a> ,  <a href="https://www.informatica.com/products/big-data/big-data-edition.html#fbid=Nn6JOoUU85w" target="_blank">Informatica Big Data Management</a> to benefits the real Hadoop’s power.</p>
<p><b>Cask</b> &#8211; The Cask Data Application Platform (CDAP) is an open source, integrated platform for developers and organizations to build, deploy, and manage hadoop big data applications.</p>
<p><b>Bedrock</b> &#8211; To realize value from an enterprise data lake and the powerful, but ever-changing ecosystem of Hadoop, you need enterprise-grade data management. Zaloni’s Bedrock is the industry’s only fully integrated Hadoop data management platform. By simplifying and automating common data management you can focus your time and resources on building the insights and analytics that drive your business. Bedrock makes it easy.</p>
<p><b>Mica </b>- Historically data transformation has been an IT function where business analysts provide their requirements and IT builds and executes the transformation. Today enterprises want to modernize their Big Data architecture and shorten data preparation time so that data scientists and business analysts can be more productive. Mica provides the on-ramp for self-service data discovery, curation, and governance. You can evolve your capability to empower practitioners &#8211; from line of business end-users to highly skilled data scientists.</p>
<p><b>hTrunk</b> &#8211; The product is built from ground to sophisticate Hadoop application development without having to write or maintain complicated Apache Hadoop code. To Meet the Enterprise needs by tackling the challenges of big data application development. hTRUNK provides a suite of components to deliver lower cost, higher capacity infrastructure.</p>
<p><b>Pentaho</b> &#8211; A Comprehensive Data Integration and Business Analytics Platform. Within a single platform, our solution provides big data analytics tools to extract prepare and blend your data, plus the visualizations and analytics that will change the way you run your business. Regardless of data source, analytic requirement or deployment environment, Pentaho allows you to turn big data into big insights.</p>
<p><b>Talend </b>- Talend simplifies the integration of big data so you can respond to business demands without having to write or maintain complicated Big Data code. Enable existing developers to start working with Apache Hadoop, Apache Spark, Spark Streaming and NoSQL databases today, in one platform. Use simple, graphical tools and wizards to generate native code that leverages the full power of big data and accelerates your path to informed decisions.</p>
<p><b>Informatica Big Data Management </b>- Ingest, process, clean, govern, and secure big data to repeatable deliver trusted information for big data and analytics. And get access to an extensive library of prebuilt transformation capabilities on Hadoop using a visual development environment.</p>
<p>As always feel free to add in the list of software’s which can help enterprise to realize the power of Hadoop.</p>
<p><b>Reference:</b></p>
<p>CASK</p>
<p>Analytics &amp; Big Data Open Source Community</p>
<p>The post <a rel="nofollow" href="http://bigdata-madesimple.com/the-top-12-apache-hadoop-challenges/">The top 12 Apache Hadoop challenges</a> appeared first on <a rel="nofollow" href="http://bigdata-madesimple.com">Big Data Made Simple - One source. Many perspectives.</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bigdata-madesimple.com/the-top-12-apache-hadoop-challenges/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
