Tuesday 16 December 2014

Overview of Hadoop Applications

Hadoop is nothing but a source of software framework that is generally used in the processing immense and bulk data simultaneously across many servers. In the recent years, it has turned out to be one of most viable option for enterprises, which has the never-ending requirement to save and manage all the data. Web based businesses such as Facebook, Amazon, eBay, and Yahoo have used high-end Hadoop applications to manage their large data sets. It is believed that Hadoop Training is still relevant to both small organizations as well as big time businesses.
Hadoop is able to process a huge chunk of data in a lesser time which enabled the companies to analyze that this was not possible before within that stipulated time. Another important advantage of the Hadoop applications is the cost effectiveness, which cannot be availed in any other technologies. One can avoid the high cost involved in the software licenses and the fees that has to be upgraded periodically when using anything apart from Hadoop. It is highly recommended for businesses, which have to work with huge amount of data, to go for Hadoop applications as it helps in fixing any issues.

Actually, Hadoop applications are made up of two parts; one is the HDFS, which means the Hadoop Distributed File System while the other is the Hadoop map reduce that helps in the processing of data and scheduling of job depending upon the priority, which is a technique that initially originated in Google search engine. Along with these two primary components, there are nine other parts, which are decided as per the distribution one uses along with other complementary tools. There are three most common functions of Hadoop applications. The first function is the storage and analysis of all the data, which does not require the loading of the relational database management system. Secondly, it is used in the conversion of huge repository of semi-structured and unstructured data, for example a log file in the form of a structured data. Such complicated data are hard to understand in SQL tools like analyzing the graph and data mining.

Hadoop applications are mostly used in the web-related businesses wherein one has to work with big log files and data from the social network sites. When it comes to media or the advertising world, enterprises use Hadoop, which enables the best performance of ad offer analysis and help understand online reviews. Before using any Hadoop tool, it is advisable to read through the Hadoop map tutorials available online.

Friday 28 November 2014

PMO using Big Data techniques on mygov.in to translate popular mood into government action

NEW DELHI: The Prime Minister's Office is using Big Data techniques to process ideas thrown up by citizens on its crowd sourcing platform mygov. in, place them in context of the popular mood as reflected in trends on social media, and generate actionable reports for ministries and departments to consider and implement.
The Modi government has roped in global consulting firm PwC to assist in the data mining exercise, and now wants to elevate Mygov.in platform from a one-way flow of citizens' ideas to a dialogue where the government keeps them abreast of some of the actions that emerge from their brainstorming.

"There is a large professional data analytics team working behind the scenes to process and filter key points emerging from debates on mygov.in, gauge popular mood about particular issues from social media sites like Twitter and Facebook," said a senior official aware of the development, adding that these are collated into special reports about possible action points that are shared with the PMO and line ministries. Ministries are being asked to revert with an action taken report on these ideas and policy suggestions currently being generated on 19 different policy challenges such as expenditure reforms, job creation, energy conservation, skill development and government initiatives such as Clean India, Digital India and Clean Ganga.

With the PM inviting Indian communities in America and Australia to join the online platform, which he has termed a 'mass movement towards Surajya', the traffic handling capacity of mygov.in is being scaled up consistently, the official said.

PwC executive director Neel Ratan said that the firm is 'helping the government' process the citizen inputs coming through on Mygov. in in.

"There is a science and art behind it. We have people constantly looking at all ideas coming up, filtering them and after a lot of analysis, correlating it to sentiments coming through on the rest of social media," he said, stressing this is throwing up interesting trends and action points, being relayed to ministries. "It's turning out to be fairly action-oriented. I think it is distinctly possible that 30-50 million people would be actively contributing to Mygov.in over the next year and a half, given its current pace of growth," Ratan said.

PwC's global leader in government and public services Jan Sturesson told ET the participative governance model being adopted through mygov.in could become a model for the developed world.

"The biggest issue for governments today is how to be relevant. If all citizens are treated with dignity and invited to collaborate, it can be easier for administrations to have a direct finger on the pulse of the nation rather than lose it in transmission through multiple layers of bureaucracy," he said, not ruling out the possibility of using the mygov.in for quick referendums on contemporary policy dilemmas in a couple of years.
"The problem in the West has been that the US, Australia and UK follow a public management philosophy that treats citizens as consumers. That's ridiculous, because a consumer pays the bill and complains, while a citizen engages differently and takes responsibility," said Sturesson.
Within the 19 broad citizen engagement themes on mygov.in, there are multiple discussion groups focused on specific subsectors and themes. When it was launched in July, the site enabled brainstorming among its registered users around seven policy challenges. Users are allowed to sign up for four discussion groups in areas of interest apart from a group dealing with issues in their immediate vicinity.

Article Source - http://articles.economictimes.indiatimes.com/2014-11-26/news/56490626_1_mygov-digital-india-modi-government

Saturday 22 November 2014

Learning Hadoop with Linux World India



The best Big Data Hadoop Training in Jaipur is offered by LinuxWorld India. When it comes to technological training, we are the best in the business. The organization was established in 2005. Our training comprises of all the expert professionals and teachers who impart tremendous knowledge. When it comes to Cisco certifications, we are the first preference. We bring together a blend of network training and solutions with the help of our authorized training partner. We believe in providing the knowledge to its betterment. With the best study in Hadoop, you can build your super computer. We provide the best classroom and training of Version 2 Hadoop. We are the first ones to bring this facility to India.

The training fees required for this course is Rs. 25500 and the course module will be delivered to you by us. After completing this course with us you will be able a master in Hadoop and will produce a framework of MapReduce. You will also be an expert in writing complex programs related to MapReduce. Hadoop is a tool that was built on Java, and it focuses on improving the performance of hardware. By studying this, you can also create a cluster of data that will help you to program different models.

Thursday 13 November 2014

Big Data Training in Jaipur



Intelligence is a leading an advanced Hadoop Training in Jaipur. They are especially known for the Data Warehousing and Hadoop in Jaipur. It includes the core competencies such as ABAP, DW, DI, HANA, SAP BO, AIX, Solaris, Red Hat Certification, Linux, PLSQL, Oracle SQL, Business Objects, Data stage, Informatics, Congos, Software Testing, ISTQB certification, Amazon Web Services, Mobile application of the developer using the IOS and Android. They also provide you with the best online training for all of the above technologies. Their main objective is to provide production based knowledge for every training module. They also have experienced trainers and well-equipped lab for every technology. 
The Hadoop administrators are one of the finest and famous among the world that are more in demand and are highly compensated in the technical role. Many of them are nowadays going for the CCAH qualification. They get to learn all the Hadoop topics such as first, to determine the infrastructure for your cluster and to correct the hardware. Second, learn the internals of the HDFS, Map Reduce and YARN. Third, learn the deployment to integrate with the data center and proper cluster configuration. Fourth, learn how to load data into the cluster from the RDBMS with the help of Scoop and from the dynamically generated files with the help of the Flume. 
Fifth, to configure the fair scheduler in order to provide service level agreements for the multiple users of the cluster. Sixth, solving the Hadoop issues, tuning, diagnosing and troubleshooting problems. Seventh, best practices for maintaining and preparing the production in the Apache Hadoop. This training helps you with good jobs, you get the complete technology specified, and they conduct campus drives. This course is best for the IT managers and the system administrators who have the basic knowledge about the Linux experience. Despite the Apache Hadoop, knowledge is not required.

To More about the Big Data Hadoop Training in Jaipur ,Please Click This

Friday 31 October 2014

Remember FLURPS to design better big data analytic solutions

FLURPS is an acronym for six components of well-rounded big data designs: Functionality, Localizability, Usability, Reliability, Performance, Supportability. Here's the case for using this template.
bigdata082613.gif
I've been advocating for customer-centric design as long as I've been designing solutions for customers. I still do this, because I have to. It's remarkable to me that after decades of building high-tech solutions for customers, technologists still seem to build solutions in an IT vacuum and then get upset when customers don't find them very functional.
Gold plating is a term used to describe developers who infer customer requirements and subsequently build features that the end users never requested -- because the developers know better. Unfortunately, data scientists are carrying this tradition forward with analytic solutions. That said, I wouldn't categorically dismiss any requirement that doesn't come from a customer or end user. It may seem odd coming from such a strong advocate of customer-centric design, but there are aspects of a well-built solution that customers don't know or appreciate.
When designing a big data analytic solution, make sure it includes a well-rounded set of requirements, including ones that the end user won't directly know about or care about.

What's FLURPS?

FLURPS is a great acronym that I learned as a young computer Big Data Training in Jaipur, and it works great as a template for building well-rounded analytic solutions. FLURPS stands for: Functionality, Localizability, Usability, Reliability, Performance, Supportability. It seems like a lost acronym that I'd like to resurrect to help us design and build better solutions.
This funny-sounding acronym reminds me of a big, hairy puppet like Mr. Snuffleupagus -- that's why it has stuck in my mind for so many years. Let's go through the different elements of FLURPS and how they can enhance your design.

Functionality

Functionality remains the key component of design; it represents all the features the customer knows and wants. When building a requirements document, most analysts separate functional from non-functional requirements, which is a good practice. Furthermore, functional requirements should always take precedence over non-functional requirements. Never sacrifice functional requirements for non-functional requirements -- you should satisfy non-functional requirements in addition to functional requirements.

Localizability

Localizability handles geographical concerns such as language. Internationalization (or i18n, for those in the know) is closely related to localizability in that it architecturally provides the technical infrastructure to localize a solution. Knowing that your recommendation engine will be used globally, you may internationalize it by automatically sensing where your user is located, and then localize it by providing, for example, German-, Russian-, or Chinese-specific content. Bear in mind that good localizability extends beyond language translation and caters to cultural differences in functionality.

Usability

Usability deals with the customer experience. This is a pet peeve of mine -- I'm tired of seeing analytic solutions that force the user into the mind of the developer. For instance, it's very common to see use cases where a batch operation seems obvious, but the solution only allows single transaction processing. If I could possibly have hundreds of input variables to my predictive analytics engine, why should I have to create them one by one?
Although usability seems like it should fall into the functional category, it does not. Most customers don't know how to design a usable solution; however, they know when it's not usable. Putting a user experience expert on your team is a fantastic idea.

Reliability

Reliability handles the stability of your application. Reliability is not something end users contemplate because they assume your solution will be stable; when it's not, frustration can quickly escalate to extreme dissatisfaction.
You must build reliable solutions. How many times have you lost work because your system crashed? And by the way, a cute little icon telling you that the system crashed doesn't help. Build requirements into your solution to recover from exceptional situations and gracefully exit only when all possible routes of recovery are lost. I once designed a web application that went through four or five levels of exception before it finally, gracefully quit -- after saving all of the user's work. The users never knew the application was going into its third and fourth level of exception, and that's the way it should be.

Performance

Performance is a bigger deal than you might think. I recently trained a group of users on a new web-based system that would on occasion take several minutes for a submenu to appear -- the industry standard is between two and three seconds. The performance of the system broadcasted the quality of the rest of the system, and it wasn't positive.
I know performance issues can be difficult to track down, but that's your problem, not the users. Make sure reasonable response times are documented in your requirements and thoroughly tested when the system is built.

Supportability

Supportability is the last, but not the least important, component of a robust design. Whether you're designing a product that will be used by customers or an internal system that will be used by employees, it's vitally important that the operations group is in a good position to support the solution.
For analytic solutions, supportability often extends beyond requirements into organizational design. The instrumentation on an analytic solution is often sophisticated, so it's important to staff the operations function with very knowledgeable technicians -- maybe even other data scientists. When I'm putting together a development team, I often include at least one person from the support team; this way, they can influence the solution's design from the perspective of someone who's going to support it.

Monday 15 September 2014

The Google Cloud Platform: 10 things you need to know

The Google Cloud Platform comprises many of Google's top tools for developers. Here are 10 things you might not know about it.


The infrastructure-as-a-service (IaaS) market has exploded in recent years. Google stepped into the fold of IaaS providers, somewhat under the radar. The Google Cloud Platform is a group of cloud computing tools for developers to build and host web applications.

It started with services such as the Google App Engine and quickly evolved to include many other tools and services. While the Google Cloud Platform was initially met with criticism of its lack of support for some key programming languages, it has added new features and support that make it a contender in the space.

Here's what you need to know about the Google Cloud Platform.

1. Pricing

Google recently shifted its pricing model to include sustained-use discounts and per-minute billing. Billings starts with a 10-minute minimum and bills per minute for the following time. Sustained-use discounts begin after a particular instance is used for more than 25% of a month. Users receive a discount for each incremental minute used after they reach the 25% mark. Developers can find more information here.

If you're wondering what it would cost for your organization, try Google's pricing calculator.

2. Cloud Debugger
The Cloud Debugger gives developers the option to assess and debug code in production. Developers can set a watchpoint on a line of code, and any time a server request hits that line of code, they will get all of the variables and parameters of that code. According to Google blog post, there is no overhead to run it and "when a watchpoint is hit very little noticeable performance impact is seen by your users."

3. Cloud Trace
Cloud Trace lets you quickly figure out what is causing a performance bottleneck and fix it. The base value add is that it shows you how much time your product is spending processing certain requests. Users can also get a report that compares performances across releases.

4. Cloud Save

The Cloud Save API was announced at the 2014 Google I/O developers conference by Greg DeMichillie, the director of product management on the Google Cloud Platform. Cloud Save is a feature that lets you "save and retrieve per user information." It also allows cloud-stored data to be synchronized across devices.

5. Hosting
The Cloud Platform offers two hosting options: the App Engine, which is their Platform-as-a-Service and Compute Engine as an Infrastructure-as-a-Service. In the standard App Engine hosting environment, Google manages all of the components outside of your application code.

The Cloud Platform also offers managed VM environments that blend the auto-management of App Engine, with the flexibility of Compute Engine VMs.The managed VM environment also gives users the ability to add third-party frameworks and libraries to their applications.

6. Andromeda
Google Cloud Platform networking tools and services are all based on Andromeda, Google's network virtualization stack. Having access to the full stack allows Google to create end-to-end solutions without compromising functionality based on available insertion points or existing software.

According to a Google blog post, "Andromeda is a Software Defined Networking (SDN)-based substrate for our network virtualization efforts. It is the orchestration point for provisioning, configuring, and managing virtual networks and in-network packet processing."

7. Containers
Containers are especially useful in a PaaS situation because they assist in speeding deployment and scaling apps. For those looking for container management in regards to virtualization on the Cloud Platform, Google offers its open source container scheduler known as Kubernetes. Think of it as a Container-as-a-Service solution, providing management for Docker containers.

8. Big Data
The Google Cloud Platform offers a full big data solution, but there are two unique tools for big data processing and analysis on Google Cloud Platform. First, BigQuery allows users to run SQL-like queries on terabytes of data. Plus, you can load your data in bulk directly from your Google Cloud Storage.

The second tool is Google Cloud Dataflow. Also announced at I/O, Google Cloud Dataflow allows you to create, monitor, and glean insights from a data processing pipeline. It evolved from Google's MapReduce.

9. Maintenance
Google does routine testing and regularly send patches, but it also sets all virtual machines to live migrate away from maintenance as it is being performed.

"Compute Engine automatically migrates your running instance. The migration process will impact guest performance to some degree but your instance remains online throughout the migration process. The exact guest performance impact and duration depend on many factors, but it is expected most applications and workloads will not notice," the Google developer website said.

VMs can also be set to shut down cleanly and reopen away from the maintenance event.

10. Load balancing
In June, Google announced the Cloud Platform HTTP Load Balancing to balance the traffic of multiple compute instances across different geographic regions.

To more about Big Data Hadoop Training in Jaipur please Visit on --

http://www.bigdatahadoop.info/

To More Visit - http://www.techrepublic.com/article/the-google-cloud-platform-10-things-you-need-to-know/

Thursday 11 September 2014

The Early Release Books Keep Coming: This Time, Hadoop Security

We are thrilled to announce the availability of the early release of Hadoop Security, a new book about security in the Apache Hadoop ecosystem published by O’Reilly Media. The early release contains two chapters on System Architecture and Securing Data Ingest and is available in O’Reilly’s catalog and in Safari Books.

Hadoop security

The goal of the book is to serve the experienced security architect that has been tasked with integrating Hadoop into a larger enterprise security context. System and application administrators also benefit from a thorough treatment of the risks inherent in deploying Hadoop in production and the associated how and why of Hadoop security.

As Hadoop continues to mature and become ever more widely adopted, material must become specialized for the security architects tasked with ensuring new applications meet corporate and regulatory policies. While it is up to operations staff to deploy and maintain the system, they won’t be responsible for determining what policies their systems must adhere to. Hadoop is mature enough that dedicated security professionals need a reference to navigate the complexities of security on such a massive scale. Additionally, security professionals must be able to keep up with the array of activity in the Hadoop security landscape as exemplified by new projects like Apache Sentry (incubating) and cross-project initiatives such as Project Rhino.

Security architects aren’t interested in how to write a MapReduce job or how HDFS splits files into data blocks, they care about where data is going and who will be able to access it. Their focus is on putting into practice the policies and standards necessary to keep their data secure. As more corporations turn to Hadoop to store and process their most valuable data, the risks with a potential breach of those systems increases exponentially. Without a thorough treatment of the subject, organizations will delay deployments or resort to siloed systems that increase capital and operating costs.

The first chapter available is on the System Architecture where Hadoop is deployed. It goes into the different options for deployment: in-house, cloud, and managed. The chapter also covers how major components of the Hadoop stack get laid out physically from both a server perspective and a network perspective. It gives a security architect the necessary background to put the overall security architecture of a Hadoop deployment into context.

The second available chapter is on Securing Data Ingest it covers the basics of Confidentiality, Integrity, and Availability (CIA) and applies them to feeding your cluster with data from external systems. In particular, the two most common data ingest tools, Apache Flume and Apache Sqoop, are evaluated for their support of CIA. The chapter details the motivation for securing your ingest pipeline as well as providing ample information and examples on how to configure these tools for your specific needs. The chapter also puts the security of your Hadoop data ingest flow into the broader context of your enterprise architecture.

We encourage you to take a look and get involved early. Security is a complex topic and it never hurts to get a jump start on it. We’re also eagerly awaiting feedback. We would never have come this far without the help of some extremely kind reviewers. You can also expect more chapters to come in the coming months. We’ll continue to provide summaries on this blog as we release new content so you know what to expect.

If anyone want to learn Big Data Hadoop Training than Visit on - http://www.bigdatahadoop.info

Thursday 4 September 2014

Open Source Cloud Computing with Hadoop

Have you ever wondered how Google, Facebook and other Internet giants process their massive workloads? Billions of requests are served every day by the biggest players on the Internet, resulting in background processing involving datasets in the petabyte scale. Of course they rely on Linux and cloud computing for obtaining the necessary scalability and performance. The flexibility of Linux combined with the seamless scalability of cloud environments provide the perfect framework for processing huge datasets, while eliminating the need for expensive infrastructure and custom proprietary software. Nowadays, Hadoop is one of the best choices in open source cloud computing, offering a platform for large scale data crunching.

Introduction

In this article we introduce and analyze the Hadoop project, which has been embraced by many commercial and scientific initiatives that need to process huge datasets. It provides a full platform for large-scale dataset processing in cloud environments, being easily scalable since it can be deployed on heterogeneous cluster infrastructure and regular hardware. As of April 2011, Amazon, AOL, Adobe, Ebay, Google, IBM, Twitter, Yahoo and several universities are listed as users in the project's wiki. Being maintained by the Apache Foundation, Hadoop comprises a full suite for seamless distributed scalable computing on huge datasets. It provides base components on top of which new distributed computing sub projects can be implemented. Among its main components is an open source implementation of the MapReduce framework (for distributed data processing) together with a data storage solution composed by a distributed filesystem and a data warehouse.

The MapReduce Framework

The MapReduce framework was created and patented by Google in order to process their own page rank algorithm and other applications that support their search engine. The idea behind it was actually introduced many years ago by the first functional programming languages such as LISP, and basically consists of partitioning a large problem into several "smaller" problems that can be solved separately. The partitioning and finally the main problem's result are computed by two functions: Map and Reduce. In terms of data processing, the Map function takes a large dataset and partitions it into several smaller intermediate datasets that can be processed in parallel by different nodes in a cluster. The reduce function then takes the separate results of each computation and aggregates them to form the final output. The power of MapReduce can be leveraged by different applications to perform operations such as sorting and statistical analysis on large datasets, which may be mapped into smaller partitions and processed in parallel.

Hdaoop MapReduce


Hadoop includes a Java implementation of the MapReduce framework, its underlying components and the necessary large scale data storage solutions. Although application programming is mostly done in Java, it provides APIs in different languages such as Ruby and Python, allowing developers to integrate Hadoop to diverse existing applications. It was first inspired by Google's implementation of MapReduce and the GFS distributed filesystem, absorbing new features as the community proposed new specific sub projects and improvements. Currently, Yahoo is one of the main contributors to this project, making public the modifications carried out by their internal developers. The basis of Hadoop and its several sub projects is the Core, which provides components and interfaces for distributed I/O and filesystems. The Avro data serialization system is also an important building block, providing cross-language RPC and persistent data storage.

On top of the Core, there's the actual implementation of MapReduce and its APIs, including the Hadoop Streaming, which allows flexible development of Map and Reduce functions in any desired language. A MapReduce cluster is composed by a master node and a cloud of several worker nodes. The nodes in this cluster may be any Java enabled platform, but large Hadoop installations are mostly run on Linux due to its flexibility, reliability and lower TCO. The master node manages the worker nodes, receiving jobs and distributing the workload across the nodes. In Hadoop terminology, the master node runs the JobTracker, responsible for handling incoming jobs and allocating nodes for performing separate tasks. Worker nodes run TaskTrackers, which offer virtual task slots that are allocated to specific map or reduce tasks depending on their access to the necessary input data and overall availability. Hadoop offers a web management interface, which allows administrators to obtain information on the status of jobs and individual nodes in the cloud. It also allows fast and easy scalability through the addition of cheap worker nodes without disrupting regular operations.

HDFS: A distributed filesystem

The main use of the MapReduce framework is in processing large volumes of data, and before any processing takes place it is necessary to first store this data in some volume accessible by the MapReduce cluster. However, it is impractical to store such large data sets on local filesystems, and much more impractical to synchronize the data across the worker nodes in the cluster. In order to address this issue, Hadoop also provides the Hadoop Distributed Filesystem (HDFS), which easily scales across the several nodes in a MapReduce cluster, leveraging the storage capacity of each node to provide storage volumes in the petabyte scale. It eliminates the need for expensive dedicated storage area network solutions while offering similar scalability and performance. HDFS runs on top of the Core and is perfectly integrated into the MapReduce APIs provided by Hadoop. It is also accessible via command line utilities and the Thrift API, which provides interfaces for various programming languages, such as Perl, C++, Python and Ruby. Furthermore, a FUSE (Filesystem in Userspace) driver can be used to mount HDFS as a standard filesystem.

In a typical HDFS+MapReduce cluster, the master node runs a NameNode, while the rest of the (worker) nodes run DataNodes. The NameNode manages HDFS volumes, being queried by clients to carry out standard filesystem operations such as add, copy, move or delete files. The DataNodes do the actual data storage, receiving commands from the NameNode and performing operations on locally stored data. In order to increase performance and optimize network communications, HDFS implements rack awareness capabilities. This feature enables the distributed filesystem and the MapReduce environment to determine which worker nodes are connected to the same switch (i.e. in the same rack), distributing data and allocating tasks in such a way that communication takes place between nodes in the same rack without overloading the network core. HDFS and MapReduce automatically manage which pieces of a given file are stored on each node, allocating nodes for processing these data accordingly. When the JobTracker receives a new job, it first queries the DataNodes of worker nodes in a same rack, allocating a task slot if the the node has the necessary data stored locally. If no available slots are found in the rack, the JobTracker then allocates the first free slot it finds.

Hive: A petabyte scale database

On top of the HDFS distributed filesystem, Hadoop implements Hive, a distributed data warehouse solution. Actually, Hive started as an internal project at Facebook and has now evolved into a fully blown project of its own, being maintained by the Apache foundation. It provides ETL (Extract, Transform and Load) features and QL, a query language similar to standard SQL. Hive queries are translated into MapReduce jobs run on table data stored on HDFS volumes. This allows Hive to process queries that involve huge datasets with performances comparable to MapReduce jobs while providing the same abstraction level of a database. Its performance is most apparent when running queries over large datasets that do not change frequently. For example, Facebook relies on Hive to store user data, run statistical analysis, process logs and generate reports.

Conclusion

We have briefly overviewed the main features and components of Hadoop. Leveraging the power of cloud computing, many large companies rely on this project to perform their day to day data processing. This is yet another example of open source software being used to build large scale scalable applications while keeping costs low. However, we have only scratched the surface of the fascinating infrastructure behind Hadoop and its many possible uses. In future articles we will see how to set up a basic Hadoop cluster and how to use it for interesting applications such as log parsing and statistical analysis.

Further Reading
 
IF anyone interested to learning more about Bigdata Hadoop Training in Jaipur than Click Here 

If you are interested in learning more about Hadoop's architecture, administration and application development these are the best places to start:

- Hadoop: The Definitive Guide, Tim White, O'Rilley/Yahoo Press, 2 edition, 2010
- Apache Hadoop Project homepage: http://hadoop.apache.org/

Monday 1 September 2014

Normalizing Corporate Small Data With Hadoop and Data Science

In part one of this series (Hadoop for Small Data), we introduced the idea that Small Data is the mission-critical data management challenge. To reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence.”

We are excluding stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results, because the business objective there is more focused on getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative. This is the realm of executive decision making, accounting, risk management, regulatory compliance, security, to name a few.

We chose this so-called Small Data use case for our inaugural Tech Lab series for several reasons. First, such data is obviously critical to the business, and should therefore be germane to any serious discussion of a information-driven enterprise. Second, the multivariate nature of the data presents a serious challenge in and of itself. Third, the rules and other business logic that give the data meaning tend to be opaque, sometimes embedded deep in operational systems; which means that effecting transparency into this layer can yield tremendous opportunity for the business to fine-tune its operations and grow smoothly.

The Tech Lab was designed to bring the rigors of scientific process to the world of data management, a la Data Science. Our mission is to demonstrate the gritty, brass tacks processes by which organizations can identify opportunities with data big and small, then build real-world solutions to deliver value. Each project features a Data Scientist (yours truly), who takes a set of enterprise software tools into the lab, then tackles some real-world data to build the solution. The entire process is documented via a series of blogs and several Webcasts, which detail the significant issues and hurdles encountered, and insights about how they were addressed or overcome.

All too often in the world of enterprise data, serious problems are ignored, or worse, assumed to be unsolvable. This leads to a cycle of spending money, time, and organizational capital. Not only can this challenge be solved, but doing so will vastly improve your personal and organizational success by having accurate, meaningful data that is understood, managed, and common.

Now, all of this probably sounds exactly like the marketing for the various conference fads of the past decade. We do not need to name them since we all recall the multiple expensive tools and bygone years which in the end did not yield much improvement. So, how can we inject success into this world?

The answer is to adopt what has been working for a very long time and is now a hot topic in data management, namely, Data Science. This new and exciting field (to data management not to science in general) comes with a tremendous amount of thoroughly tried and tested methods, and is linked to a strong community with deep knowledge and an ingrained willingness to help. This is the “science” part of Data Science. Data Science uses the fundamental precepts of how science deals with data: maintain detailed auditable and visible information of important activities, assumptions, and logic; embrace uncertainty since there can never be a perfect result; welcome questions of how and why the data values are obtained, used, and managed; understand the differences between raw and normalized data.

It is the latter tenet that we will concentrate on for this discussion and for the next Tech Lab with Cloudera. In science, normalizing data is done every day as a necessary and critical activity of work, whether experimental (as I used to do in Nanotechnology) or computer modeling. Normalizing data is more sophisticated than what is commonly done in integration (i.e. ETL). It combines subject matter knowledge, governance, business rules, and raw data. In contrast, ETL moves different data parts from their sources to a new location with some simple logic infused along the way. ETL has failed to solve even medium level problems with discordant, conflicting, real-world corporate data, albeit not for want of money and time. Indeed, the types of challenges in corporate Small Data are solved with an order of magnitude less expense, time, and organizational friction and with much higher complexity in many scientific and engineering fields.

One real-world example is a well governed part number used across major supply chain and accounting applications. Despite policy stating the specific syntax of the numbers, in actual data systems there are a variety of forms with some having suffixes, some having prefixes, and some having values taken out of circulation. Standard approaches like architecture and ETL cannot solve this (although several years have typically been spent trying) because the knowledge of why, who, when, and what is often not available. In the meantime, costs are driven up to support this situation, management is stifled in modernizing applications and infrastructure, legacy applications cannot be retired, and the lack of common data prevents meaningful Business Intelligence. Note that this lack of corporate knowledge also means that other top-down approaches like data virtualization and semantic mediation are doomed because they rely on mapping all source data values (not the models or metadata and this is a critical distinction to understand) to a common model.

This is much more typical of the state of corporate Small Data than simple variations in spelling or code lookups. This was for one element among many. Consider this for your company – whether related to clinical health, accounting, financial, and other core corporate data sets – and you can see the enormity of the challenge. This also explains why the techniques of the last years have not worked. If you do not have the complete picture then your architecture does not reflect your actual operations. Similarly, ETL tools work primarily on tables and use low level “transforms” like LTRIM. When the required transform crosses tables, and possibly even sources (compare element A in Table X in source 1 to elements B and C in source 2, and element D in source 3), then it becomes too difficult to develop and manage.

This was the status quo. I say “was” because we now have new tools and new methods that are well designed, engineered, tested, and understood to solve this challenge; and with the additional benefits that they are cheaper, faster, more accurate; and engender organizational cooperation. This is the combination of normalizing data with the computing power of Hadoop.

Data Normalization excels at correcting this challenge and does so with high levels of visibility, flexibility, collaboration, and alignment to business tempo. Raw data is the data that comes directly from sensors or other collectors and is typically known to be incorrect in some manner. This is not a problem as long as there is visible, collaborative, and evolving (remember the tenet that there is never a perfect result), knowledge of how to adjust it to make it better. This calibration is part of normalizing the raw data in a controlled, auditable manner to make it as meaningful as possible; while also having explicit information about how accurate it should be considered. Normalizing data needs adjustable and powerful computing tools. For very complicated data there are general purpose mathematical tools and specialized applications. However, corporate Small Data does not need this level of computing, but does need a way to code complicated business rules with clarity, openness to review, and ease and speed of updates.

This is what Hadoop provides. Hadoop is ready made for running programs on demand with the power of parallel computing and distributed storage. These are the very capabilities that enable Data Normalization to be part of mainstream business data management. One of the key needs of solving Small Data challenges that prior technologies could not provide, is low cycle time to make adjustments as more knowledge is gained and business requirements change (which will always occur, sometimes daily). Gone is the era when data could be managed in six to twelve month cycles of requirements, data modeling, ETL scripting, database engineering, and BI construction. All of this must respond and be in-step with business, not the other way around. With Hadoop, Data Normalization routines in Java programs can be run as often as desired and with multiple parallel jobs. This means a normalization routine that might have taken hours on an entire corporate warehouse can now be done in minutes. Results can then be used in any number of applications and tools.

A simple Hadoop cluster of just a handful of nodes will have enough power to normalize Small Data in concert with business tempo. Now, you can have accurate data that reflects the real business rules of your organization and that adapts and grows with you. Of course, getting this to work in your corporate production environment requires more than just the raw power of Hadoop technology. It requires mature and tested management of this technology and an assured integration of its parts that will not become a maintenance nightmare nor security risk.
This is exactly what the Cloudera distribution provides. All Hadoop components are tested, integrated, and bundled into a working environment with additional components specifically made to match the ease of management and maintenance of more traditional tools. Additionally, this distribution is being managed with clearly planned updates and version releases. While there are too many individual components to comment on now, one which deserves mention as a key aid to Data Normalization, and indeed the Hadoop environment itself, is Cloudera’s Hue web tool that allows browsing the file system, issuing queries to multiple data sources, planning and executing jobs, and reviewing metadata. If you have any questions, comments, or concerns about Small Data please join me live on Webcast II of our inaugural Tech Lab!
 
If anyone want to learn Bigdata hadoop than visit on -- www.bigdatahadoop.info
 

Friday 8 August 2014

What is the difference between big data and Hadoop?

The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.

Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.

Hadoop is one of the tools designed to handle big data.
Hadoop Training Institute in Jaipur and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).

The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.

Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.

 
Article Source: http://www.techopedia.com/7/29680/technology-trends/what-is-the-difference-between-big-data-and-hadoop

Friday 1 August 2014

How Big Data Can Help Your Organization Outperform Your Peers

Big data has a lot of potential to benefit organizations in any industry, everywhere across the globe. Big data is much more than just a lot of data and especially combining different data sets will provide organizations with real insights that can be used in the decision-making and to improve the financial position of an organization. Before we can understand how big data can help your organization, let's see what big data actually is:
It is generally accepted that big data can be explained according to three V's: Velocity, Variety and Volume. However, I would like to add a few more V's to better explain the impact and implications of a well thought through big data strategy.

Velocity
The Velocity is the speed at which data is created, stored, analyzed and visualized. In the past, when batch processing was common practice, it was normal to receive an update to the database every night or even every week. Computers and servers required substantial time to process the data and update the databases. In the big data era, data is created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created.
The speed at which data is created currently is almost unimaginable: Every minute we upload 100 hours of video on YouTube. In addition, over 200 million emails are sent every minute, around 20 million photos are viewed and 30.000 uploaded on Flickr, almost 300.000 tweets are sent and almost 2,5 million queries on Google are performed.
The challenge organizations have is to cope with the enormous speed the data is created and use it in real-time.

Variety
In the past, all data that was created was structured data, it neatly fitted in columns and rows but those days are over. Nowadays, 90% of the data that is generated by organization is unstructured data. Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data. The wide variety of data requires a different approach as well as different techniques to store all raw data.
There are many different types of data and each of those types of data require different types of analyses or different tools to use. Social media like Facebook posts or Tweets can give different insights, such as sentiment analysis on your brand, while sensory data will give you information about how a product is used and what the mistakes are.

Volume
90% of all data ever created, was created in the past 2 years. From now on, the amount of data in the world will double every two years. By 2020, we will have 50 times the amount of data as that we had in 2011. The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second.
If we look at airplanes they generate approximately 2,5 billion Terabyte of data each year from the sensors installed in the engines. Also the agricultural industry generates massive amounts of data with sensors installed in tractors. John Deere for example uses sensor Big data hadoop Training in Jaipur to monitor machine optimization, control the growing fleet of farming machines and help farmers make better decisions. Shell uses super-sensitive sensors to find additional oil in wells and if they install these sensors at all 10.000 wells they will collect approximately 10 Exabyte of data annually. That again is absolutely nothing if we compare it to the Square Kilometer Array Telescope that will generate 1 Exabyte of data per day.
In the past, the creation of so much data would have caused serious problems. Nowadays, with decreasing storage costs, better storage options like Hadoop and the algorithms to create meaning from all that data this is not a problem at all.

Veracity
Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Incorrect data can cause a lot of problems for organizations as well as for consumers. Therefore, organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Especially in automated decision-making, where no human is involved anymore, you need to be sure that both the data and the analyses are correct.
If you want your organization to become information-centric, you should be able to trust that data as well as the analyses. Shockingly, 1 in 3 business leaders do not trust the information they use in the decision-making. Therefore, if you want to develop a big data strategy you should strongly focus on the correctness of the data as well as the correctness of the analyses.

Variability
Big data is extremely variable. Brian Hopkins, a Forrester principal analyst, defines variability as the "variance in meaning, in lexicon". He refers to the supercomputer Watson who won Jeopardy. The supercomputer had to "dissect an answer into its meaning and [... ] to figure out what the right question was". That is extremely difficult because words have different meanings an all depends on the context. For the right answer, Watson had to understand the context.
Variability is often confused with variety. Say you have bakery that sells 10 different breads. That is variety. Now imagine you go to that bakery three days in a row and every day you buy the same type of bread but each day it tastes and smells different. That is variability.
Variability is thus very relevant in performing sentiment analyses. Variability means that the meaning is changing (rapidly). In (almost) the same tweets a word can have a totally different meaning. In order to perform a proper sentiment analyses, algorithms need to be able to understand the context and be able to decipher the exact meaning of a word in that context. This is still very difficult.
Visualization
This is the hard part of big data. Making all that vast amount of data comprehensible in a manner that is easy to understand and read. With the right visualizations, raw data can be put to use. Visualizations of course do not mean ordinary graphs or pie-charts. They mean complex graphs that can include many variables of data while still remaining understandable and readable.
Visualizing might not be the most technological difficult part; it sure is the most challenging part. Telling a complex story in a graph is very difficult but also extremely crucial. Luckily there are more and more big data startups appearing that focus on this aspect and in the end, visualizations will make the difference.
Value
All that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data. McKinsey states that potential annual value of big data to the US Health Care is $ 300 billion, more than double the total annual health care spending of Spain. They also mention that big data has a potential annual value of € 250 billion to the Europe's public sector administration. Even more, in their well-regarded report from 2011, they state that the potential annual consumer surplus from using personal location data globally can be up to $ 600 billion in 2020. That is a lot of value.
Of course, data in itself is not valuable at all. The value is in the analyses done on that data and how the data is turned into information and eventually turning it into knowledge. The value is in how organizations will use that data and turn their organization into an information-centric company that bases their decision-making on insights derived from data analyses.
Use cases
Know that the definition of big data is clear, let's have a look at the different possible use cases. Of course, for each industry and each individual type of organization, the possible use cases differ. There are however, also a few generic big data use cases that show the possibilities of big data for your organization.

1. Truly get to know your customers, all of them in real-time.
In the past we used focus groups and questionnaires to find out who our customers where. This was always outdated the moment the results came in and it was far too high over. With big data this is not necessary anymore. Big Data allows companies to completely map the DNA of its customers. Knowing the customer well is the key to being able to sell to them effectively. The benefits of really knowing your customers are that you can give recommendations or show advertising that is tailored to the individual needs.
2. Co-create, improve and innovate your products real-time.
Big data analytics can help organizations gain a better understanding of what customers think of their products or services. Through listening on social media and blogs what people say about a product, it can give more information about it than with a traditional questionnaire. Especially if it is measured in real-time, companies can act upon possible issues immediately. Not only can the sentiment about products be measured, but also how that differs among different demographic groups or in different geographical locations at different timings.
 If anyone want to know more about Hadoop Training in Jaipur 
3. Determine how much risk your organization faces.
Determining the risk a company faces is an important aspect of today's business. In order to define the risk of a potential customer or supplier, a detailed profile of the customer can be made and place it in a certain category, each with its own risk levels. Currently, this process is often too broad and vague and quite often a customer or supplier is placed in a wrong category and thereby receiving a wrong risk profile. A too high-risk profile is not that harmful, apart from lost income, but a too low risk profile could seriously damage an organization. With big data it is possible to determine a risk category for each individual customer or supplier based on all of their data from the past and present in real-time.
4. Personalize your website and pricing in real-time toward individual customers.
Companies have used split-tests and A/B tests for some years now to define the best layout for their customers in real-time. With big data this process will change forever. Many different web metrics can be analyzed constantly and in real-time as well as combined. This will allow companies to have a fluid system where the look, feel and layout change to reflect multiple influencing factors. It will be possible to give each individual visitor a website specially tailored to his or her wishes and needs at that exact moment. A returning customer might see another webpage a week or month later depending on his or her personal needs for that moment.
5. Improve your service support for your customers.
With big data it is possible to monitor machines from (great) distance and check how they are performing. Using telematics, each different part of a machine can be monitored in real-time. Data will be sent to the manufacturer and stored for real-time analysis. Each vibration, noise or error gets detected automatically and a when the algorithm detects a deviation from the normal operation, service support can be warned. The machine can even schedule automatically for maintenance at a time when the machine is not in use. When the engineer comes to fix the machine, he knows exactly what to do due to all the information available.
6. Find new markets and new business opportunities by combining own data with public data.
Companies can also discover unmet customer desires using big data. By doing pattern and/or regression analysis on your own data, you might find needs and wishes of customers you did not know they were present. Combining various data sets can give whole new meanings to existing data and allows organizations to find new markets, target groups or business opportunities it was previously not yet aware of.
7. Better understand your competitors and more importantly, stay ahead of them.
What you can do for your own organization can also be done, more or less, for your competition. It will help organizations better understand the competition and knowing where they stand. It can provide a valuable head start. Using big data analytics, algorithms can find out for example if a competitor changes its pricing and automatically adjust your prices as well to stay competitive.
8. Organize your company more effectively and save money.
By analyzing all the data in your organization you may find areas that can be improved and can be organized better. Especially the logistics industry can become more efficient using the new big data source available in the supply chain. Electronic On Board Recorders in trucks tell us where they are, how fast they drive, where they drive etc. Sensors and RF tags in trailers and distribution help on-load and off-load trucks more efficiently and combining road conditions, traffic information and weather conditions with the locations of clients can substantially save time and money.
Of course these are just generic use cases are just a small portion of the massive possibility of big data, but it shows that there are endless opportunities to take advantage of big data. Each organization has different needs and requires a different big data approach. Making correct usage of these possibilities will add business value and help you stand out from your competition.