Recently my work necessitated me look into the new features added in
informatica 9.1, but I never thought the journey will take me to explore
further on this and write a blog Let's see how I traversed through
different new aspects that are getting very much related to data
management and Business Intelligence. First we will look what is Bigdata
and its position now.
People would always think how the organizations like Yahoo, Google,
Facebook store large amounts of data of the users. We should take a note
that Facebook stores more photos than Google's Picassa. Any guesses??
What is Hadoop
The answer is BigData Hadoop and it is a way to store large amounts of data
in petabytes and zettabytes. This storage system is called as Hadoop
Distributed File System. Hadoop was developed by Doug Cutting based on
ideas suggested by Google's papers. Mostly we get large amounts of
machine generated data. For example, the Large Hadron Collider to study
the origins of universe produces 15 petabytes of data every year for
each experiment carried out.
MapReduce
The next thing which comes to our mind is how quick we can access
these large amounts of data. Hadoop uses MapReduce, which first appeared
in research papers of Google. It follows 'Divide and Conquer'. The data
is organized as key value pairs. It processes the entire data that is
spread across countless number of systems in parallel chunks from a
single node. Then it will sort and process the collected data.
With a standard PC server, Hadoop will connect to all the servers and
distributes the data files across these nodes. It used all these nodes
as one large file system to store and process the data, making it a 100%
unadulterated distributed file system. Extra nodes can be added if data
reaches the maximum installed capacity, making the setup highly
scalable. It is very cheap as it is open source and doesn't require
special processors like used in traditional servers. Hadoop is also one
of the NoSQL implementations.
Hadoop in Real time
The Tennessee Valley Authority(TVA) uses smart-grid field devices to
collect data on its power-transmission lines and facilities across the
country. These sensors send in data at a rate of 30 times per second -
at that rate, the TVA estimates it will have half a petabyte of data
archived within a few years. TVA uses Hadoop to store and analyze data.
In India Power Grid Corporation of India intends to install these smart
devices in their grids for collecting data to reduce transmission
losses. It is better they also emulate TVA. Recently Facebook moved to
30 Petabyte Hadoop, which sounds incredible and hard to digest the fact
we are using such a myriad volume of data.
Data Warehouse and Business Intelligence
[http://hexaware.com/business-intelligence-analytics.htm] Products
supporting Hadoop and MapReduce
1) Greenplum
2) Informatica
3) Teradata
5) Pentaho
6) Talend
2) Informatica
3) Teradata
5) Pentaho
6) Talend
If Hadoop and other NoSQL implementations are widely used, the
limitations of traditional SQL systems can be resolved like storing
unstructured data. With the volume of data increasing exponentially,
commercialization of Hadoop will happen in a large scale and data
integrator tools will play a key role in mining data for business.
Readers share your experiences if any of you have worked with Hadoop on other ETL and BI Tools, tools that are available in the market.
No comments:
Post a Comment