Wednesday, 25 November 2015

Hive Basic Understanding

Hive is Petabyte scale dataware house system on Hadoop.
Hadoop based system for querying & managing structured data.
Its used to Query Big Data in SQL fashion.

For Execution Hive uses - Map/Reduce
For Storage Hive uses    - HDFS
For Metadata                 - RDBMS



Origin of Hive -
Hive was designed by Facebook for querying from petabytes of data. There was sudden data explosion at Facebook which was impossible to store in traditional DBMS & query.

Hive made users job extremly esay to query data stored on HDFS.
Hive now became parallel DBMS which uses Hadoop for its storage & execution architecture.


Why Hive -
Hive is another dataware house system designed because existing Dataware house systems do not meet all the requirement in scalable , agile & cost effeciant way.

Programming model used in Hadoop is - MapReduce. Its very difficult to write Map-Reduce program for every small or big reports. Also it's requires highly skilled resources to write such a complex code.
Using Hive one can simply issue the query as simple & similar we do in SQL. But here Hive generates Map Reduce code for user based on Query issued.


Advantages of Hive -
Hive can work with very large data (100's to Terabytes).
Hive can work on large hadoop cluster (100's of Nodes).
Data stored on Hive has defined Schema.
Hive is used for Batch jobs also (Load & Query).


Where not to use Hive -
If you need responses in seconds.
If you don't want to impose a schema.
If traditional DBMS already can do the job.
If your data is measured in GB's or even less.
If you don't have enough time & highly skilled resources.



Hive Entities -
Database, Table, Partitions, Bucketing Columns.
MORE....



Hive Data Types -

Primitive Data Types
TINYINT          1 Byte Signed Integer
SMALLINT         2 Byte Signed Integer
INT              4 Byte Signed Integer
BIGINT           8 Byte Signed Integer
BOOLEAN          True or False (Boolean)
FLOAT            Single precision floating bytes
DOUBLE           Double precision floating point
STRING           Sequence of charaters (within Sigle or double quotes)
TIMESTAMP        java.sql.Timestamp format
etc...


Collection Data Types
STRUCT           Similar to Structure in C.
MAP              (Key,Value) pair
ARRAY            Ordered sequence of similar data types.



Hive operations -

DDL operations
[CREATE/ALTER/DROP] [TABLE/VIEW/PARTITION]
CREATE TABLE AS SELECT

DML operations
INSERT OVERWRITE

Queries...
Sub-Queries within "FROM" clause.
Joins  [Inner join & Outer (Left, Right & Full outer join)]
Multi-Table insert
Sampling

Interfaces
JDBC/ODBC/THRIFT

Wednesday, 21 October 2015

6 Reasons Why Java Developers Should Learn Hadoop

Imagine there are two girls standing in front of you - The first girl is cute, beautiful, interesting and has the smile that any guy would die for. And the other girl is average-looking, quiet, not-so-impressive... no different from the ones that you usually see in the restaurant cash counter. Which girl will you call out for a date? If you're like me, you will choose the attractive girl. You see, life is full of options and making the right choice is what matters the most.

If you're a Java developer, then you probably have more choices to make - like the switch from Java to Hadoop.
Big data and Hadoop are the two most popular buzzwords in the industry. Chances are that you have come across these two terms on the Java payscale forums or seen your senior colleagues making the switch to get bigger paychecks. I'll tell you what, the upgrade from Java to Hadoop is not just about staying updated with the latest technology or getting appraisals - it's about being competent and putting your career on the fifth gear.

The good news for all the aspiring Hadoop developers is that, the Big Data industry has already crossed the $50 billion dollar mark and over 64% of the top 720 companies worldwide are interesting to invest in this forward-thinking technology as revealed by Gartner in 2013.

If that's not convincing, then take a look at these stats:
1. According to an IDC report, the Big Data industry is growing at the rate of 31.7% per year.
2. Java developers are seen as the best replacement option for Hadoop developers, says Forrester.
3. Hadoop developers enjoy a mighty 250% pay hike than Java developers, as stated in an Analytics Industry Report.

What's special about Hadoop?
Unlike the traditional databases which weren't capable of dealing with large volumes of data, Hadoop offers the quickest, cheapest, and smartest way to store and process giant volumes of data - and that's the reason why it is so popular among big corporations, government organizations, hospitals, universities, financial services, marketing agencies, etc. The best way to familiarize with the language is to check out a beginner's big data hadoop course.

Okay, now let's some reasons why Java developers should switch to Hadoop.

1. Easy To Learn For Java Developers
A tennis player like Rafael Nadal loves clay courts because the surface suits him well and that's where he has been most successful. Similarly, any Java developer would love Hadoop because it's completely written in Java - a language that you are already so familiar with. Switching from Java to Hadoop is a cake-walk for professionals like you because the MapReduce script used in the Hadoop is actually written in Java itself. Awesome, isn't it?

Your Java skills will come in handy when debugging Hadoop apps and employing Pig (programming tool) Latin commands.

2. Helps You To Stay Ahead Of Your Competition
If you are a Java professional, you are just seen as a person in the crowd. But, if you are a Hadoop developer, you are seen as potential leader in the crowd. Big Data and Hadoop jobs are a hot deal in the market and Java professionals with the required skill set are easily picked by big companies for high salary packages. All you have to do is attend a big data hadoop  training program and learn the concepts from an expert.

3. Scope To Move Into Bigger Domains
Fortunately for you, the road doesn't end with Hadoop and MapReduce. There is always the golden opportunity to use your Hadoop skills and expertise to move into higher levels such as Artificial Intelligence, Data Science, Sensor Web data, and Machine Learning. These are emerging markets, and you'll see them dominate the industry in the next 4-5 years. Good knowledge in Big Data and Hadoop could boost your chances of getting into some of the bigger Big Data-dependent companies such as Amazon, Yahoo, Facebook, Twitter, IBM, and eBay.

4. Lucrative Packages For Hadoop Professionals
By switching from Java to Hadoop, you can expect a higher salary and better career prospects - the kind of salary and designation that your wife would like to rave about. According to Indeed, the average salary for a Big Data Hadoop developer with 1-2 years of experience is around $140,000 per annum in the United States. However, as you gain experience and become a senior Hadoop developer, you will be able to make a good $400,000+ salary.

5. An Improved Quality Of Work
Learning Big Data Hadoop can be highly beneficial because it will help you to deal with bigger, complex projects much easier and deliver better output than your colleagues. In order to be considered for appraisals, you need to be someone who can make a difference in the team, and that's what Hadoop lets you to be.

6. Grow With The Industry
With IDC predicting that the Big Data and Hadoop user base (big companies and government organizations) is likely to increase at 27% per year, you have a great opportunity to upgrade your knowledge and skills and grow with the industry.

Big Data and Hadoop are widely used in applications such as IT log analytics, Fraud detection, Social media analysis, and Call centre analytics - and learning a big data hadoop tutorial could be the way to kick-start your Hadoop career right away. Once you do that, you will find that staying updated with the latest technology will be a lot easier and getting into top organizations will never be 'just a dream' - it will be a reality.

That's about it, folks! These are some rock-solid reasons why learning Hadoop is important and how it can help take your career to the next level

Thursday, 17 September 2015

Data Lake Showdown: Object Store or HDFS?

The explosion of data is causing people to rethink their long-term storage strategies. Most agree that distributed systems, one way or another, will be involved. But when it comes down to picking the distributed system–be it a file-based system like HDFS or an object-based file store such as Amazon S3–the agreement ends and the debate begins.
The Hadoop Distributed File System (HDFS) has emerged as a top contender for building a data lake. The scalability, reliability, and cost-effectiveness of Hadoop make it a good place to land data before you know exactly what value it holds. Combine that with the ecosystem growing around Hadoop and the rich tapestry of analytic tools that are available, and it’s not hard to see why many organizations are looking at Hadoop as a long-term answer for their big data storage and processing needs.
At the other end of the spectrum are today’s modern object storage systems, which can also scale out on commodity hardware and deliver storage costs measured in the cents-per-gigabyte range. Many large Web-scale companies, including Amazon, Google, and Facebook, use object stores to give them certain advantages when it comes to efficiently storing petabytes of unstructured data measuring in the trillions of objects.
But where do you use HDFS and where do you use object stores? In what situations will one approach be better than the other? We’ll try to break this down for you a little and show the benefits touted by both.
Why You Should Use Object-Based Storage
According to the folks at Storiant, a provider of object-based storage software, object stores are gaining ground among large companies in highly regulated industries that need greater assurances that no data will be lost.
“They’re looking at Hadoop to analyze the data, but they’re not looking at it as a way to store it long term,” says John Hogan, Storiant’s vice president of engineering and product management. “Hadoop is designed to pour through a large data set that you’ve spread out across a lot of compute. But it doesn’t have the reliability, compliance, and power attributes that make it appropriate to store it in the data lake for the long term.”
Object-based storage systems such as Storiant’s offer superior long-term data storage reliability compared to Hadoop for several reasons, Hogan says. For starters, they use a type of algorithm called erasure encoding that spreads the data out across any number of commodity disks. Object stores like Storiant’s also build spare drives into their architectures to handle unexpected drive failures, and rely on the erasure encoding to automatically rebuild the data volumes upon failure.
If you use Hadoop’s default setting, everything is stored three times, which delivers five 9s of reliability, which used to be the gold standard for enterprise computing. Hortonworks architect Arun Murthy, who helped develop Hadoop while at Yahoo, pointed out at the recent Hadoop Summit that if you only storing everything twice in HDFS, that it takes one 9 off the reliability, giving you four 9s. That certainly sounds good. Source

Thursday, 6 August 2015

Step by Step learning guide for Hadoop

Just as I was frustrated and disappointed with the training I attended with Bigdata training academny in chennai, I decided to publish best sites and reference materials for Hadoop that I come across.

Atleast this way I can be of some help for the "to be"  Hadoop aspirants and professionals so that won't waste their money in cheap institutes like the Bigdata training 
I aim my blog to be one stop shop for learning Bigdata Apache Hadoop, PIG and Hbase ,,,
Also as and when time permits , I will also create tuorials for hadoop, pig and hbase and publish them.
First at the beginner level and then to advaned level

The best site is cloudera.com for all the beginners of Apache Bigdata Hadoop and its ecosystem . Go and visit this URL http://university.cloudera.com/onlineresources.html

Thursday, 9 July 2015

Opportunities in Data Management With Hadoop

Every day, every minute, millions of pictures videos and other forms of data are being dumped on to the internet via websites like Facebook, you tube etc. Ever wondered where this data is being stored to be used effectively year after year? The growing number of data sources like social media are challenging the big data technologies. Being the latest sensation, media giants like Google, Facebook and Yahoo have decided to choose Hadoop for their data management predicaments.
Any enterprise wishing to leverage its data and analytics is advised to install Hadoop framework; open source software that allows processing of large data over clusters of computers.
History of Hadoop
Hadoop was created back in 2005 by computer scientists Doug Reed Cutting and Mike Cafarella. Hadoop was named by Doug after his son's stuffed toy elephant and is now being managed by Apache Software Foundation. In 2006 Dough joined Yahoo! which dedicated a team to develop Hadoop. By 2008, Hadoop was being used by other companies beside Yahoo! like Facebook, New York Times and Last.fm.
The Hadoop architecture is made up of the Hadoop Common, Hadoop distributed file system (HDFS) and a MapReduce engine. MapReduce and HDFS are designed to handle any node failures. The architecture distributes data into chunks across many servers for the programmers to easily analyze and visualize easily.
Demand for Hadoop
The market for Hadoop is projected to rise from a $1.5 billion in 2012 to an estimated $16.1 billion by 2020 as per report by Allied Market Research. The profits are predicted to be made by the Commercial Hadoop companies like Amazon Web Services, Cloudera, Hortonworks etc.
The reason for the success for this platform is its low cost implementation which helps companies to adopt this technology more conveniently. It is also adept at automatically handling node failures and data replications and does all the hard work.
It is clear that data management industry has expanded from software and web into retail, hospitals, government etc. This creates a huge demand for scalable and cost effective platforms of data storage like Hadoop. Hence it comes as a no surprise that a skill in Hadoop is most desired as of now. The future for data storage is endless, as it is highly unlikely that the companies will stop storing their data or find an alternative to do so anytime soon.
Training in Hadoop basics is sure to go long way and will pay off in the long run as companies are willing to offer competitive salaries for candidates with desired skill-sets. Banking on this demand will definitely prove beneficial.

Tuesday, 30 June 2015

Big Data-Hadoop and Its Impact on Business Intelligence Systems

Recently my work necessitated me look into the new features added in informatica 9.1, but I never thought the journey will take me to explore further on this and write a blog Let's see how I traversed through different new aspects that are getting very much related to data management and Business Intelligence. First we will look what is Bigdata and its position now.

People would always think how the organizations like Yahoo, Google, Facebook store large amounts of data of the users. We should take a note that Facebook stores more photos than Google's Picassa. Any guesses??

What is Hadoop
The answer is BigData Hadoop and it is a way to store large amounts of data in petabytes and zettabytes. This storage system is called as Hadoop Distributed File System. Hadoop was developed by Doug Cutting based on ideas suggested by Google's papers. Mostly we get large amounts of machine generated data. For example, the Large Hadron Collider to study the origins of universe produces 15 petabytes of data every year for each experiment carried out.

MapReduce
The next thing which comes to our mind is how quick we can access these large amounts of data. Hadoop uses MapReduce, which first appeared in research papers of Google. It follows 'Divide and Conquer'. The data is organized as key value pairs. It processes the entire data that is spread across countless number of systems in parallel chunks from a single node. Then it will sort and process the collected data.

With a standard PC server, Hadoop will connect to all the servers and distributes the data files across these nodes. It used all these nodes as one large file system to store and process the data, making it a 100% unadulterated distributed file system. Extra nodes can be added if data reaches the maximum installed capacity, making the setup highly scalable. It is very cheap as it is open source and doesn't require special processors like used in traditional servers. Hadoop is also one of the NoSQL implementations.

Hadoop in Real time
The Tennessee Valley Authority(TVA) uses smart-grid field devices to collect data on its power-transmission lines and facilities across the country. These sensors send in data at a rate of 30 times per second - at that rate, the TVA estimates it will have half a petabyte of data archived within a few years. TVA uses Hadoop to store and analyze data. In India Power Grid Corporation of India intends to install these smart devices in their grids for collecting data to reduce transmission losses. It is better they also emulate TVA. Recently Facebook moved to 30 Petabyte Hadoop, which sounds incredible and hard to digest the fact we are using such a myriad volume of data.

Data Warehouse and Business Intelligence [http://hexaware.com/business-intelligence-analytics.htm] Products supporting Hadoop and MapReduce
1) Greenplum
2) Informatica
3) Teradata
5) Pentaho
6) Talend
If Hadoop and other NoSQL implementations are widely used, the limitations of traditional SQL systems can be resolved like storing unstructured data. With the volume of data increasing exponentially, commercialization of Hadoop will happen in a large scale and data integrator tools will play a key role in mining data for business.

Readers share your experiences if any of you have worked with Hadoop on other ETL and BI Tools, tools that are available in the market.