BigData Hadoop Training in India | BigData Hadoop Training in Jaipur: September 2014

Monday, 15 September 2014

The Google Cloud Platform: 10 things you need to know

The Google Cloud Platform comprises many of Google's top tools for developers. Here are 10 things you might not know about it.

The infrastructure-as-a-service (IaaS) market has exploded in recent years. Google stepped into the fold of IaaS providers, somewhat under the radar. The Google Cloud Platform is a group of cloud computing tools for developers to build and host web applications.

It started with services such as the Google App Engine and quickly evolved to include many other tools and services. While the Google Cloud Platform was initially met with criticism of its lack of support for some key programming languages, it has added new features and support that make it a contender in the space.

Here's what you need to know about the Google Cloud Platform.

1. Pricing

Google recently shifted its pricing model to include sustained-use discounts and per-minute billing. Billings starts with a 10-minute minimum and bills per minute for the following time. Sustained-use discounts begin after a particular instance is used for more than 25% of a month. Users receive a discount for each incremental minute used after they reach the 25% mark. Developers can find more information here.

If you're wondering what it would cost for your organization, try Google's pricing calculator.

2. Cloud Debugger
The Cloud Debugger gives developers the option to assess and debug code in production. Developers can set a watchpoint on a line of code, and any time a server request hits that line of code, they will get all of the variables and parameters of that code. According to Google blog post, there is no overhead to run it and "when a watchpoint is hit very little noticeable performance impact is seen by your users."

3. Cloud Trace
Cloud Trace lets you quickly figure out what is causing a performance bottleneck and fix it. The base value add is that it shows you how much time your product is spending processing certain requests. Users can also get a report that compares performances across releases.

4. Cloud Save

The Cloud Save API was announced at the 2014 Google I/O developers conference by Greg DeMichillie, the director of product management on the Google Cloud Platform. Cloud Save is a feature that lets you "save and retrieve per user information." It also allows cloud-stored data to be synchronized across devices.

5. Hosting
The Cloud Platform offers two hosting options: the App Engine, which is their Platform-as-a-Service and Compute Engine as an Infrastructure-as-a-Service. In the standard App Engine hosting environment, Google manages all of the components outside of your application code.

The Cloud Platform also offers managed VM environments that blend the auto-management of App Engine, with the flexibility of Compute Engine VMs.The managed VM environment also gives users the ability to add third-party frameworks and libraries to their applications.

6. Andromeda
Google Cloud Platform networking tools and services are all based on Andromeda, Google's network virtualization stack. Having access to the full stack allows Google to create end-to-end solutions without compromising functionality based on available insertion points or existing software.

According to a Google blog post, "Andromeda is a Software Defined Networking (SDN)-based substrate for our network virtualization efforts. It is the orchestration point for provisioning, configuring, and managing virtual networks and in-network packet processing."

7. Containers
Containers are especially useful in a PaaS situation because they assist in speeding deployment and scaling apps. For those looking for container management in regards to virtualization on the Cloud Platform, Google offers its open source container scheduler known as Kubernetes. Think of it as a Container-as-a-Service solution, providing management for Docker containers.

8. Big Data
The Google Cloud Platform offers a full big data solution, but there are two unique tools for big data processing and analysis on Google Cloud Platform. First, BigQuery allows users to run SQL-like queries on terabytes of data. Plus, you can load your data in bulk directly from your Google Cloud Storage.

The second tool is Google Cloud Dataflow. Also announced at I/O, Google Cloud Dataflow allows you to create, monitor, and glean insights from a data processing pipeline. It evolved from Google's MapReduce.

9. Maintenance
Google does routine testing and regularly send patches, but it also sets all virtual machines to live migrate away from maintenance as it is being performed.

"Compute Engine automatically migrates your running instance. The migration process will impact guest performance to some degree but your instance remains online throughout the migration process. The exact guest performance impact and duration depend on many factors, but it is expected most applications and workloads will not notice," the Google developer website said.

VMs can also be set to shut down cleanly and reopen away from the maintenance event.

10. Load balancing
In June, Google announced the Cloud Platform HTTP Load Balancing to balance the traffic of multiple compute instances across different geographic regions.

To more about Big Data Hadoop Training in Jaipur please Visit on --

http://www.bigdatahadoop.info/

To More Visit - http://www.techrepublic.com/article/the-google-cloud-platform-10-things-you-need-to-know/

Thursday, 11 September 2014

The Early Release Books Keep Coming: This Time, Hadoop Security

We are thrilled to announce the availability of the early release of Hadoop Security, a new book about security in the Apache Hadoop ecosystem published by O’Reilly Media. The early release contains two chapters on System Architecture and Securing Data Ingest and is available in O’Reilly’s catalog and in Safari Books.

Hadoop security

The goal of the book is to serve the experienced security architect that has been tasked with integrating Hadoop into a larger enterprise security context. System and application administrators also benefit from a thorough treatment of the risks inherent in deploying Hadoop in production and the associated how and why of Hadoop security.

As Hadoop continues to mature and become ever more widely adopted, material must become specialized for the security architects tasked with ensuring new applications meet corporate and regulatory policies. While it is up to operations staff to deploy and maintain the system, they won’t be responsible for determining what policies their systems must adhere to. Hadoop is mature enough that dedicated security professionals need a reference to navigate the complexities of security on such a massive scale. Additionally, security professionals must be able to keep up with the array of activity in the Hadoop security landscape as exemplified by new projects like Apache Sentry (incubating) and cross-project initiatives such as Project Rhino.

Security architects aren’t interested in how to write a MapReduce job or how HDFS splits files into data blocks, they care about where data is going and who will be able to access it. Their focus is on putting into practice the policies and standards necessary to keep their data secure. As more corporations turn to Hadoop to store and process their most valuable data, the risks with a potential breach of those systems increases exponentially. Without a thorough treatment of the subject, organizations will delay deployments or resort to siloed systems that increase capital and operating costs.

The first chapter available is on the System Architecture where Hadoop is deployed. It goes into the different options for deployment: in-house, cloud, and managed. The chapter also covers how major components of the Hadoop stack get laid out physically from both a server perspective and a network perspective. It gives a security architect the necessary background to put the overall security architecture of a Hadoop deployment into context.

The second available chapter is on Securing Data Ingest it covers the basics of Confidentiality, Integrity, and Availability (CIA) and applies them to feeding your cluster with data from external systems. In particular, the two most common data ingest tools, Apache Flume and Apache Sqoop, are evaluated for their support of CIA. The chapter details the motivation for securing your ingest pipeline as well as providing ample information and examples on how to configure these tools for your specific needs. The chapter also puts the security of your Hadoop data ingest flow into the broader context of your enterprise architecture.

We encourage you to take a look and get involved early. Security is a complex topic and it never hurts to get a jump start on it. We’re also eagerly awaiting feedback. We would never have come this far without the help of some extremely kind reviewers. You can also expect more chapters to come in the coming months. We’ll continue to provide summaries on this blog as we release new content so you know what to expect.

If anyone want to learn Big Data Hadoop Training than Visit on - http://www.bigdatahadoop.info

Thursday, 4 September 2014

Open Source Cloud Computing with Hadoop

Have you ever wondered how Google, Facebook and other Internet giants process their massive workloads? Billions of requests are served every day by the biggest players on the Internet, resulting in background processing involving datasets in the petabyte scale. Of course they rely on Linux and cloud computing for obtaining the necessary scalability and performance. The flexibility of Linux combined with the seamless scalability of cloud environments provide the perfect framework for processing huge datasets, while eliminating the need for expensive infrastructure and custom proprietary software. Nowadays, Hadoop is one of the best choices in open source cloud computing, offering a platform for large scale data crunching.

Introduction
In this article we introduce and analyze the Hadoop project, which has been embraced by many commercial and scientific initiatives that need to process huge datasets. It provides a full platform for large-scale dataset processing in cloud environments, being easily scalable since it can be deployed on heterogeneous cluster infrastructure and regular hardware. As of April 2011, Amazon, AOL, Adobe, Ebay, Google, IBM, Twitter, Yahoo and several universities are listed as users in the project's wiki. Being maintained by the Apache Foundation, Hadoop comprises a full suite for seamless distributed scalable computing on huge datasets. It provides base components on top of which new distributed computing sub projects can be implemented. Among its main components is an open source implementation of the MapReduce framework (for distributed data processing) together with a data storage solution composed by a distributed filesystem and a data warehouse.

The MapReduce Framework

The MapReduce framework was created and patented by Google in order to process their own page rank algorithm and other applications that support their search engine. The idea behind it was actually introduced many years ago by the first functional programming languages such as LISP, and basically consists of partitioning a large problem into several "smaller" problems that can be solved separately. The partitioning and finally the main problem's result are computed by two functions: Map and Reduce. In terms of data processing, the Map function takes a large dataset and partitions it into several smaller intermediate datasets that can be processed in parallel by different nodes in a cluster. The reduce function then takes the separate results of each computation and aggregates them to form the final output. The power of MapReduce can be leveraged by different applications to perform operations such as sorting and statistical analysis on large datasets, which may be mapped into smaller partitions and processed in parallel.

Hdaoop MapReduce

Hadoop includes a Java implementation of the MapReduce framework, its underlying components and the necessary large scale data storage solutions. Although application programming is mostly done in Java, it provides APIs in different languages such as Ruby and Python, allowing developers to integrate Hadoop to diverse existing applications. It was first inspired by Google's implementation of MapReduce and the GFS distributed filesystem, absorbing new features as the community proposed new specific sub projects and improvements. Currently, Yahoo is one of the main contributors to this project, making public the modifications carried out by their internal developers. The basis of Hadoop and its several sub projects is the Core, which provides components and interfaces for distributed I/O and filesystems. The Avro data serialization system is also an important building block, providing cross-language RPC and persistent data storage.

On top of the Core, there's the actual implementation of MapReduce and its APIs, including the Hadoop Streaming, which allows flexible development of Map and Reduce functions in any desired language. A MapReduce cluster is composed by a master node and a cloud of several worker nodes. The nodes in this cluster may be any Java enabled platform, but large Hadoop installations are mostly run on Linux due to its flexibility, reliability and lower TCO. The master node manages the worker nodes, receiving jobs and distributing the workload across the nodes. In Hadoop terminology, the master node runs the JobTracker, responsible for handling incoming jobs and allocating nodes for performing separate tasks. Worker nodes run TaskTrackers, which offer virtual task slots that are allocated to specific map or reduce tasks depending on their access to the necessary input data and overall availability. Hadoop offers a web management interface, which allows administrators to obtain information on the status of jobs and individual nodes in the cloud. It also allows fast and easy scalability through the addition of cheap worker nodes without disrupting regular operations.

HDFS: A distributed filesystem

The main use of the MapReduce framework is in processing large volumes of data, and before any processing takes place it is necessary to first store this data in some volume accessible by the MapReduce cluster. However, it is impractical to store such large data sets on local filesystems, and much more impractical to synchronize the data across the worker nodes in the cluster. In order to address this issue, Hadoop also provides the Hadoop Distributed Filesystem (HDFS), which easily scales across the several nodes in a MapReduce cluster, leveraging the storage capacity of each node to provide storage volumes in the petabyte scale. It eliminates the need for expensive dedicated storage area network solutions while offering similar scalability and performance. HDFS runs on top of the Core and is perfectly integrated into the MapReduce APIs provided by Hadoop. It is also accessible via command line utilities and the Thrift API, which provides interfaces for various programming languages, such as Perl, C++, Python and Ruby. Furthermore, a FUSE (Filesystem in Userspace) driver can be used to mount HDFS as a standard filesystem.

In a typical HDFS+MapReduce cluster, the master node runs a NameNode, while the rest of the (worker) nodes run DataNodes. The NameNode manages HDFS volumes, being queried by clients to carry out standard filesystem operations such as add, copy, move or delete files. The DataNodes do the actual data storage, receiving commands from the NameNode and performing operations on locally stored data. In order to increase performance and optimize network communications, HDFS implements rack awareness capabilities. This feature enables the distributed filesystem and the MapReduce environment to determine which worker nodes are connected to the same switch (i.e. in the same rack), distributing data and allocating tasks in such a way that communication takes place between nodes in the same rack without overloading the network core. HDFS and MapReduce automatically manage which pieces of a given file are stored on each node, allocating nodes for processing these data accordingly. When the JobTracker receives a new job, it first queries the DataNodes of worker nodes in a same rack, allocating a task slot if the the node has the necessary data stored locally. If no available slots are found in the rack, the JobTracker then allocates the first free slot it finds.

Hive: A petabyte scale database

On top of the HDFS distributed filesystem, Hadoop implements Hive, a distributed data warehouse solution. Actually, Hive started as an internal project at Facebook and has now evolved into a fully blown project of its own, being maintained by the Apache foundation. It provides ETL (Extract, Transform and Load) features and QL, a query language similar to standard SQL. Hive queries are translated into MapReduce jobs run on table data stored on HDFS volumes. This allows Hive to process queries that involve huge datasets with performances comparable to MapReduce jobs while providing the same abstraction level of a database. Its performance is most apparent when running queries over large datasets that do not change frequently. For example, Facebook relies on Hive to store user data, run statistical analysis, process logs and generate reports.

Conclusion

We have briefly overviewed the main features and components of Hadoop. Leveraging the power of cloud computing, many large companies rely on this project to perform their day to day data processing. This is yet another example of open source software being used to build large scale scalable applications while keeping costs low. However, we have only scratched the surface of the fascinating infrastructure behind Hadoop and its many possible uses. In future articles we will see how to set up a basic Hadoop cluster and how to use it for interesting applications such as log parsing and statistical analysis.

Further Reading

IF anyone interested to learning more about Bigdata Hadoop Training in Jaipur than Click Here

If you are interested in learning more about Hadoop's architecture, administration and application development these are the best places to start:

- Hadoop: The Definitive Guide, Tim White, O'Rilley/Yahoo Press, 2 edition, 2010
- Apache Hadoop Project homepage: http://hadoop.apache.org/

Monday, 1 September 2014

Normalizing Corporate Small Data With Hadoop and Data Science

In part one of this series (Hadoop for Small Data), we introduced the idea that Small Data is the mission-critical data management challenge. To reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence.”

We are excluding stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results, because the business objective there is more focused on getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative. This is the realm of executive decision making, accounting, risk management, regulatory compliance, security, to name a few.

We chose this so-called Small Data use case for our inaugural Tech Lab series for several reasons. First, such data is obviously critical to the business, and should therefore be germane to any serious discussion of a information-driven enterprise. Second, the multivariate nature of the data presents a serious challenge in and of itself. Third, the rules and other business logic that give the data meaning tend to be opaque, sometimes embedded deep in operational systems; which means that effecting transparency into this layer can yield tremendous opportunity for the business to fine-tune its operations and grow smoothly.

The Tech Lab was designed to bring the rigors of scientific process to the world of data management, a la Data Science. Our mission is to demonstrate the gritty, brass tacks processes by which organizations can identify opportunities with data big and small, then build real-world solutions to deliver value. Each project features a Data Scientist (yours truly), who takes a set of enterprise software tools into the lab, then tackles some real-world data to build the solution. The entire process is documented via a series of blogs and several Webcasts, which detail the significant issues and hurdles encountered, and insights about how they were addressed or overcome.

All too often in the world of enterprise data, serious problems are ignored, or worse, assumed to be unsolvable. This leads to a cycle of spending money, time, and organizational capital. Not only can this challenge be solved, but doing so will vastly improve your personal and organizational success by having accurate, meaningful data that is understood, managed, and common.

Now, all of this probably sounds exactly like the marketing for the various conference fads of the past decade. We do not need to name them since we all recall the multiple expensive tools and bygone years which in the end did not yield much improvement. So, how can we inject success into this world?

The answer is to adopt what has been working for a very long time and is now a hot topic in data management, namely, Data Science. This new and exciting field (to data management not to science in general) comes with a tremendous amount of thoroughly tried and tested methods, and is linked to a strong community with deep knowledge and an ingrained willingness to help. This is the “science” part of Data Science. Data Science uses the fundamental precepts of how science deals with data: maintain detailed auditable and visible information of important activities, assumptions, and logic; embrace uncertainty since there can never be a perfect result; welcome questions of how and why the data values are obtained, used, and managed; understand the differences between raw and normalized data.

It is the latter tenet that we will concentrate on for this discussion and for the next Tech Lab with Cloudera. In science, normalizing data is done every day as a necessary and critical activity of work, whether experimental (as I used to do in Nanotechnology) or computer modeling. Normalizing data is more sophisticated than what is commonly done in integration (i.e. ETL). It combines subject matter knowledge, governance, business rules, and raw data. In contrast, ETL moves different data parts from their sources to a new location with some simple logic infused along the way. ETL has failed to solve even medium level problems with discordant, conflicting, real-world corporate data, albeit not for want of money and time. Indeed, the types of challenges in corporate Small Data are solved with an order of magnitude less expense, time, and organizational friction and with much higher complexity in many scientific and engineering fields.

One real-world example is a well governed part number used across major supply chain and accounting applications. Despite policy stating the specific syntax of the numbers, in actual data systems there are a variety of forms with some having suffixes, some having prefixes, and some having values taken out of circulation. Standard approaches like architecture and ETL cannot solve this (although several years have typically been spent trying) because the knowledge of why, who, when, and what is often not available. In the meantime, costs are driven up to support this situation, management is stifled in modernizing applications and infrastructure, legacy applications cannot be retired, and the lack of common data prevents meaningful Business Intelligence. Note that this lack of corporate knowledge also means that other top-down approaches like data virtualization and semantic mediation are doomed because they rely on mapping all source data values (not the models or metadata and this is a critical distinction to understand) to a common model.

This is much more typical of the state of corporate Small Data than simple variations in spelling or code lookups. This was for one element among many. Consider this for your company – whether related to clinical health, accounting, financial, and other core corporate data sets – and you can see the enormity of the challenge. This also explains why the techniques of the last years have not worked. If you do not have the complete picture then your architecture does not reflect your actual operations. Similarly, ETL tools work primarily on tables and use low level “transforms” like LTRIM. When the required transform crosses tables, and possibly even sources (compare element A in Table X in source 1 to elements B and C in source 2, and element D in source 3), then it becomes too difficult to develop and manage.

This was the status quo. I say “was” because we now have new tools and new methods that are well designed, engineered, tested, and understood to solve this challenge; and with the additional benefits that they are cheaper, faster, more accurate; and engender organizational cooperation. This is the combination of normalizing data with the computing power of Hadoop.

Data Normalization excels at correcting this challenge and does so with high levels of visibility, flexibility, collaboration, and alignment to business tempo. Raw data is the data that comes directly from sensors or other collectors and is typically known to be incorrect in some manner. This is not a problem as long as there is visible, collaborative, and evolving (remember the tenet that there is never a perfect result), knowledge of how to adjust it to make it better. This calibration is part of normalizing the raw data in a controlled, auditable manner to make it as meaningful as possible; while also having explicit information about how accurate it should be considered. Normalizing data needs adjustable and powerful computing tools. For very complicated data there are general purpose mathematical tools and specialized applications. However, corporate Small Data does not need this level of computing, but does need a way to code complicated business rules with clarity, openness to review, and ease and speed of updates.

This is what Hadoop provides. Hadoop is ready made for running programs on demand with the power of parallel computing and distributed storage. These are the very capabilities that enable Data Normalization to be part of mainstream business data management. One of the key needs of solving Small Data challenges that prior technologies could not provide, is low cycle time to make adjustments as more knowledge is gained and business requirements change (which will always occur, sometimes daily). Gone is the era when data could be managed in six to twelve month cycles of requirements, data modeling, ETL scripting, database engineering, and BI construction. All of this must respond and be in-step with business, not the other way around. With Hadoop, Data Normalization routines in Java programs can be run as often as desired and with multiple parallel jobs. This means a normalization routine that might have taken hours on an entire corporate warehouse can now be done in minutes. Results can then be used in any number of applications and tools.

A simple Hadoop cluster of just a handful of nodes will have enough power to normalize Small Data in concert with business tempo. Now, you can have accurate data that reflects the real business rules of your organization and that adapts and grows with you. Of course, getting this to work in your corporate production environment requires more than just the raw power of Hadoop technology. It requires mature and tested management of this technology and an assured integration of its parts that will not become a maintenance nightmare nor security risk.

This is exactly what the Cloudera distribution provides. All Hadoop components are tested, integrated, and bundled into a working environment with additional components specifically made to match the ease of management and maintenance of more traditional tools. Additionally, this distribution is being managed with clearly planned updates and version releases. While there are too many individual components to comment on now, one which deserves mention as a key aid to Data Normalization, and indeed the Hadoop environment itself, is Cloudera’s Hue web tool that allows browsing the file system, issuing queries to multiple data sources, planning and executing jobs, and reviewing metadata. If you have any questions, comments, or concerns about Small Data please join me live on Webcast II of our inaugural Tech Lab!

If anyone want to learn Bigdata hadoop than visit on -- www.bigdatahadoop.info

See more at: http://vision.cloudera.com/normalizing-corporate-small-data-with-hadoop-and-data-science/#sthash.ifmQLj4Z.dpuf