Hadoop and Big Data are dramatically impacting business, yet the
exact relationship between Hadoop and Big Data remains open to
discussion
You might think of Hadoop as the horse and Big Data as the rider. Or perhaps more accurate: Hadoop as the tool and Big Data as the house being built. Whatever the analogy, these two technologies – both seeing rapid growth – are inextricably linked.
No matter how you define it, though, Big Data is increasingly the tool that sets businesses apart. Those that can reap competitive insights from a Big Data solution gain key advantage; companies unable to leverage this technology will fall behind.
Hadoop and Big Data: The Perfect Union?
Whether Hadoop and Big Data are the ideal match “depends on what you’re doing,” says Nick Heudecker, a Gartner analyst who specializes in Data Management and Integration.
“Hadoop certainly allows you to onboard a tremendous amount of data very quickly, without making any compromises about what you’re storing and what you’re keeping. And that certainly facilitates a lot of the Big Data discovery,” he says.
Hadoop offers a full ecosystem along with a single Big Data platform. It is sometimes called a “data operating system.” Source: Gartner
Mike Gualtieri, a Forrester analyst whose key coverage areas include Big Data strategy and Hadoop, notes that Hadoop is part of a larger ecosystem – but it’s a foundational element in that data ecosystem.
Who’s Choosing Hadoop as a Big Data Tool
Based on Gartner research, the industries most strongly drawn to Hadoop are those in banking and financial services. Additional Hadoop early adopters include “more generally, services, which we define as anyone selling software or IT services,” Heudecker says. Insurance, as well as manufacturing and natural resources also see Hadoop users.
Those are the kinds of industries that encounter more – and more diverse – kinds of data. “I think Hadoop certainly lends itself well to that, because now you don’t have to make compromises about what you’re going to keep and what you’re going to store,” Heudecker says. “You just store everything and figure it out later.”
Hadoop Headwinds
Yet not all is rosy in the world of Hadoop. Recent Gartner research about Hadoop adoption notes that “investment remains tentative in the face of sizable challenges around business value and skills.”
The May 2015 report, co-authored by Heudecker and Gartner analyst Merv Adrian, states:
"Despite considerable hype and reported successes for early adopters, 54 percent of survey respondents report no plans to invest at this time, while only 18 percent have plans to invest in Hadoop over the next two years. Furthermore, the early adopters don't appear to be championing for substantial Hadoop adoption over the next 24 months; in fact, there are fewer who plan to begin in the next two years than already have."
The Gartner report’s gloomiest news for Hadoop:
I asked Heudecker about these Hadoop impediments and he noted the lack of IT pros with top Hadoop skills:
“We talked with a large financial services organization, they were just starting their Hadoop journey,” he says, “and we asked, ‘Who helped you?’ And they said ‘Nobody, because the companies we called had just as much experience with Hadoop as we did.’ So when the largest financial service companies on the planet can’t find help for their Hadoop project, what does that mean for the global 30,000 companies out there?”
This lack of skilled tech pros for Hadoop is a true concern, Heudecker says. “That’s certainly being borne out in the data that we have, and in the conversations that we have with clients,” he says. “And I think it’s going to be a while before Hadoop skills are plentiful in the market.”
The Hadoop/Big Data Vendor Connection
A growing community of Hadoop vendors offer a byzantine array of solutions. Flavors and configurations abound. These vendors are leveraging the fact that Hadoop has a certain innate complexity – meaning buyers need some help. Hadoop is comprised of various software components, all of which need to work in concert. Adding potential confusion, different aspects of the ecosystem progress at varying speeds.
Handling these challenges is “one of the advantages of working with a vendor,” Heudecker says. “They do that work for you.” As mentioned, a key element of these solutions is SQL – Heudecker refers to SQL as “the Lingua Franca of data management.”
Is there a particular SQL solution that will be the perfect match for Hadoop?
“I think over the next 3-5 years, you’ll actually see not one SQL solution emerge as a winner, but you’ll likely see several, depending on what you want to do,” Heudecker says. “In some cases Hive may be your choice depending on certain use cases. In other cases you may want to use Drill or something like Presto, depending on what your tools will support and what you want to accomplish.”
As for winner or losers in the race for market share? “I think it’s too soon. We’ll be talking about survivors, not winners.”
The emerging community of vendors tends to tout one key attribute: ease of use. Matchett notes that, “If you go to industry events, it’s just chock full of startups saying, ‘Hey, we’ve got this new interface that allows the business [user] just to drag and drop and leverage Big Data without having to know anything.’”
Hadoop Appliances: Big Data in a Box
Even as Hadoop matures, there continues to be Big Data solutions that far outmatch it – at a higher price for those who need greater capability.
“There’s still definitely a gap between what a Teradata warehouse can do, or an IBM Netezza, Oracle Exadata, and Hadoop,” says Forrester’s Gualtieri. “I mean, if you need high concurrency, if you need tons of users and you’re doing really complicated queries that you need to have perform super fast, that’s just like if you’re in a race and you need race car.” In that case you simply need the best. “So, there’s still a performance gap, and there’s a lot of engineering work that has to be done.”
Big Data Debate: Hadoop vs. Spark, or Hadoop and Spark?
A discussion – or debate – is now raging within the Big Data community: sure, Hadoop is hot, but now Spark is emerging. Maybe Spark is better – some tech observers trumpet its advantages – and so Hadoop (some observers suggest) will soon fade from its high position.
Like Hadoop, Spark is a cluster computing platform (both Hadoop and Spark are Apache projects). Spark is earning a reputation as a good choice for complicated data processing jobs that need to be performed quickly. Its in-memory architecture and directed acyclic graph (DAG) processing is far faster than Hadoop’s MapReduce – at least at the moment. Yet Spark has its downsides. For instance, it does not have its own file system. In general, IT pros think of Hadoop is best for volume where Spark is best for speed, but in reality the picture isn’t that clear.
Hadoop and Big Data Future Speak: Data Gravity, Containers, IoT
Clearly, there’s been a lot of hype about Big Data, about how it’s the new Holy Grail of business decision making.
That hype may have run its course. “Big data is essentially turning into data,” opines Heudecker. “It’s time to get past the hype and start thinking about where the value is for your business.” The point: “Don’t treat Big Data as an end unto itself. It has to derive from a business need.”
As for Hadoop’s role in this, its very success may contain a paradox. With time, Hadoop may grow less visible. It may become so omnipresent that it’s no longer seen as a stand alone tool.
“Over time, Hadoop will eventually bake into your information infrastructure,” Huedecker says. “It should never have been an either/or choice. And it won’t be in the future. It will be that I have multiple data stores. I will use them depending on the SLAs I have to comply with for the business. And so you’ll have a variety of different data stores.”
Hadoop and Big Data are in many ways the perfect union – or at least they have the potential to be.
Hadoop is hailed as the open
source distributed computing platform that harnesses dozens – or
thousands – of server nodes to crunch vast stores of data. And Big Data earns massive buzz as the quantitative-qualitative science of harvesting insight from vast stores of data.
You might think of Hadoop as the horse and Big Data as the rider. Or perhaps more accurate: Hadoop as the tool and Big Data as the house being built. Whatever the analogy, these two technologies – both seeing rapid growth – are inextricably linked.
However, Hadoop and Big Data share the same “problem”: both are
relatively new, and both are challenged by the rapid churn that’s
characteristic of immature, rapidly developing technologies.
Hadoop was developed in 2006, yet it wasn’t until Cloudera’s launch in 2009 that it moved toward commercialization. Even years later it prompts mass disagreement. In June 2015 The New York Times offered the gloomy assessment that Companies Move On From Big Data Technology Hadoop. Furthermore, leading Big Data experts (see below) claim that Hadoop suffers major headwinds.
Similarly, while Big Data has been around for years – called
“business intelligence” long before its current buzz – it still creates
deep confusion. Businesses are unclear about how to harness its power.
The myriad software solutions and possible strategies leaves some users
only flummoxed. There’s backlash, too, due to its level of Big Data
hype. There’s even confusion about the term itself: “Big Data” has as
many definitions as people you’ll ask about it. It’s generally defined
as “the process of mining actionable insight from large quantities of
data,” yet it also includes machine learning, geospatial analytics and
an array of other intelligence uses.
No matter how you define it, though, Big Data is increasingly the tool that sets businesses apart. Those that can reap competitive insights from a Big Data solution gain key advantage; companies unable to leverage this technology will fall behind.
Big bucks are at stake. Research firm IDC forecasts that Big Data technology and services will grow at a 26.4% compound annual growth rate
through 2018, to become a $41.4 billion dollar global market. If
accurate, that forecast means it’s growing a stunning six times the rate
of the overall tech market.
Research by Wikibon
predicts a similar growth rate; the chart below reflects Big Data’s
exponential growth from just a few years ago. Given Big Data’s explosive
trajectory, it’s no wonder that Hadoop – widely seen as a key Big Data
tool – is enjoying enormous interest from enterprises of all sizes.
Hadoop and Big Data: The Perfect Union?
Whether Hadoop and Big Data are the ideal match “depends on what you’re doing,” says Nick Heudecker, a Gartner analyst who specializes in Data Management and Integration.
“Hadoop certainly allows you to onboard a tremendous amount of data very quickly, without making any compromises about what you’re storing and what you’re keeping. And that certainly facilitates a lot of the Big Data discovery,” he says.
However, businesses continue to use other Big Data technologies,
Heudecker says. A Gartner survey indicates that Hadoop is the third
choice for Big Data technology, behind Enterprise Data Warehouse and
Cloud Computing.
While Hadoop is a leading Big Data tool, it is not the top option for enterprise users.
It’s no surprise that the Enterprise Data Warehouse tops Hadoop as
the leading Big Data technology. A company’s complete history and
structure can be represented by the data stored in the data warehouse.
Moreover, Heudecker says, based on the Gartner user survey, “we see the
Enterprise Data Warehouse being combined with a variety of different
databases: SQL, graph databases, memory technologies, complex
processing, as well as stream processing.”
So while Hadoop is a key Big Data tool, it remains one contender
among many at this point. “I think there’s a lot of value in being able
to tell a cohesive federated story across multiple data stores,”
Huedecker says. That is, “Hadoop being used for some things; your data
warehouse being used for others. I don’t think anybody realistically
wants to put the whole of their data into a single platform. You need to
optimize to handle the potential workloads that you’re doing.”
Hadoop offers a full ecosystem along with a single Big Data platform. It is sometimes called a “data operating system.” Source: Gartner
Mike Gualtieri, a Forrester analyst whose key coverage areas include Big Data strategy and Hadoop, notes that Hadoop is part of a larger ecosystem – but it’s a foundational element in that data ecosystem.
“I would say ‘Hadoop and friends’ is a perfect match for
Big Data,” Gualtieri says. A variety of tools can be combined for best
results. “For example, you need streaming technology to process
real-time data. There’s software such as DataTorrent that runs on Hadoop, that can induce streaming. There’s Spark
[more on Spark later]. You might want to do batch jobs that are in
memory, and it’s very convenient, although not required, to run that
Spark cluster on a Hadoop cluster.”
Still, Hadoop’s position in the Big Data universe is
truly primary. “I would say Hadoop is a data operating system,”
Gualtieri says. “It’s a fundamental, general purpose platform. The
capabilities that it has those of an operating system: It has a file
system, it has a way to run a job.” And the community of vendors and
open source projects all feed into a healthy stream for Hadoop. “They’re
making it the Big Data platform.”
In fact, Hadoop’s value for Big Data applications goes
beyond its primacy as a data operating system. As Gualtieri sees it,
Hadoop is also an application platform. This capability is enabled by YARN, the cluster management technology that’s part of Hadoop (YARN stands for Yet Another Resource Manager.)
“YARN is really an important piece of glue here because
it allows innovation to occur in the Big Data community,” he says,
“because when a vendor or an open source project contributes something
new, some sort of new application, whether it’s machine learning,
streaming, a SQL engine, an ETL tool, ultimately, Hadoop becomes an
application platform as well as a data platform. And it has the
fundamental capability to handle all of these applications, and to
control the resources they use.”
YARN and HDFS provide Hadoop with a diverse array of capabilities.
Regardless of how technology evolves in the years ahead,
Hadoop will always have a place in the pioneering days of Big Data
infancy. There was a time when many businesses looked at their vast
reservoir of data – perhaps a sprawling 20 terabytes – and in essence
gave up. They assumed it was too big to be mined for insight.
But Hadoop changed that, notes Mike Matchett,
analyst with the Tenaja Group who specializes in Big Data. The
development of Hadoop meant “Hey, if you get fifty white node cluster
servers – they don’t cost you that much – you get commodity servers,
there’s no SAN you have to have because you can use HDFS and local disc,
you can do something with it. You can find [Big Data insights]. And
that was when [Hadoop] kind of took off.”
Based on Gartner research, the industries most strongly drawn to Hadoop are those in banking and financial services. Additional Hadoop early adopters include “more generally, services, which we define as anyone selling software or IT services,” Heudecker says. Insurance, as well as manufacturing and natural resources also see Hadoop users.
Those are the kinds of industries that encounter more – and more diverse – kinds of data. “I think Hadoop certainly lends itself well to that, because now you don’t have to make compromises about what you’re going to keep and what you’re going to store,” Heudecker says. “You just store everything and figure it out later.”
On the other hand, there are laggards, says Teneja
Group’s Matchett. “You see people who say, ‘We’re doing fine with our
structured data warehouse. There’s not a lot of real-time menus for the
marketing we’re doing yet or that we see the need for.’”
But these slow adopters will get on board, he says.
“They’ll come around and say, ‘If we have a website and we have any
user-tracking and it’s creating a quick stream of Big Data, we’re going
to have to use that for market data.’” And really, he asks, “Who
doesn’t have a website and a user base of some kind?”
Forrester’s Gualtieri notes that interest in Hadoop is
very high. “We did a Hadoop Wave,” he say, referring to the Forrester
report Big Data Hadoop Solutions.
“We evaluated and published that last year, and of all of the thousands
of Forrester documents published that year on all kinds of topics, it
was like the second most read document.” Driving this popularity is
Hadoop’s fundamental place as a data operating system, he says.
Furthermore, “The amounts of investments by – and I’m
not even talking about the startup guys – the investments by companies
like SAS, IBM, Microsoft, all of the commercial guys – their goal is to
make it easy and do more sophisticated things,” Gaultieri says. “So
there’s a lot of value being added.”
He foresees a potential scenario in which Hadoop is part
of every operating system. And while adoption is still growing, “I
estimate that in the next few years, the next 2-3 years, it will be 100
percent,” of enterprises will deploy Hadoop. Gaultieri refers to a
phenomenon he calls “Hadoopenomics,” that is, Hadoop’s ability to unlock
a full ecosystem of profitable Big Data scenarios, chiefly because
Hadoop offers lower cost storing and accessing of data, relative to a
sophisticated data warehouse. “It’s not as capable as a data warehouse,
but it’s good for many things,” he says.
Yet not all is rosy in the world of Hadoop. Recent Gartner research about Hadoop adoption notes that “investment remains tentative in the face of sizable challenges around business value and skills.”
The May 2015 report, co-authored by Heudecker and Gartner analyst Merv Adrian, states:
"Despite considerable hype and reported successes for early adopters, 54 percent of survey respondents report no plans to invest at this time, while only 18 percent have plans to invest in Hadoop over the next two years. Furthermore, the early adopters don't appear to be championing for substantial Hadoop adoption over the next 24 months; in fact, there are fewer who plan to begin in the next two years than already have."
“Only 26 percent of respondents claim to be either deploying,
piloting or experimenting with Hadoop, while 11 percent plan to invest
within 12 months and seven percent are planning investment in 24 months.
Responses pointed to two interesting reasons for the lack of intent.
First, several responded that Hadoop was simply not a priority. The
second was that Hadoop was overkill for the problems the business faced,
implying the opportunity costs of implementing Hadoop were too high
relative to the expected benefit.”
The Gartner report’s gloomiest news for Hadoop:
With such large incidence of organizations with no plans or
already on their Hadoop journey, future demand for Hadoop looks fairly
anemic over at least the next 24 months. Moreover, the lack of near-term
plans for Hadoop adoption suggest that, despite continuing enthusiasm
for the big data phenomenon, demand for Hadoop specifically is not
accelerating. The best hope for revenue growth for providers would
appear to be in moving to larger deployments within their existing
customer base."
“We talked with a large financial services organization, they were just starting their Hadoop journey,” he says, “and we asked, ‘Who helped you?’ And they said ‘Nobody, because the companies we called had just as much experience with Hadoop as we did.’ So when the largest financial service companies on the planet can’t find help for their Hadoop project, what does that mean for the global 30,000 companies out there?”
This lack of skilled tech pros for Hadoop is a true concern, Heudecker says. “That’s certainly being borne out in the data that we have, and in the conversations that we have with clients,” he says. “And I think it’s going to be a while before Hadoop skills are plentiful in the market.”
Gualtieri, however, voices quite a different view. The
idea that Hadoop faces a lack is skilled workers is “a myth” he says.
Hadoop is based on Java, he notes. “A large enterprise has lots of Java
developers, and Java developers over the years always have to learn new
frameworks. And guess what? Just take a couple of your good Java guys
and say, ‘Do this on Hadoop,’ and they will figure it out. It’s not that
hard.” Java developers will be able to get a sample app running that
can do simple tasks before long, he says.
These in-house, homegrown Hadoop experts enable cost
savings, he says. “So instead of looking for the high-priced Hadoop
experts who say, ‘I know Hadoop,’ what I see when I talk to a lot of
enterprises, I’m talking to people who have been there for ten years –
they just became the Hadoop expert.”
An additional factor makes Hadoop dead simple to adopt,
Gualtieri says: “That is SQL for Hadoop. SQL is known by developers.
It’s known by many business intelligence professionals, and even
business people and data analysis [professionals], right? It’s very
popular.
“And there are at least thirteen different SQL for Hadoop
query engines on Hadoop. So you don’t need to know a thing about
MapReduce. You don’t need to know anything about distributed data or
distributed jobs” to accomplish an effective query.
Gualtieri points to a diverse handful of Hadoop SQL solutions: “Apache Drill, Cloudera Impala, Apache Hive …Presto, HP Vertica has a solution, Pivotal Hawk, Microsoft Polybase…
Naturally all the database companies and data warehouse companies have a
solution. They’ve repurposed their engines. And then there the open
source firms.” All (or most) of these solutions tout their usability.
Matchett takes a middle ground between Heudecker’s view
that Hadoop faces a shortage of skilled workers and Gualtieri’s belief
that in-house Java developers and vendor solutions can fill the gap:
“There are plenty of places where people can get lots of
mileage out of it,” he says, referring to easy-to-use Hadoop
deployments – particularly AWS’s offering. “You and I can both go to
Amazon with a credit card and check out an EMR cluster,
which is a Hadoop cluster, and get it up and running without knowing
anything. You could do that in ten minutes with your Amazon account and
have a Big Data cluster.”
However, “at some level of professionalism or scale of
productivity, you’re going to need experts, still,” Matchett says. “Just
like you would with an RDBMS. It’s going to be pretty much analogous to
that.” Naturally these experts are more expensive and harder to find.
To be sure, there are easier solutions: “There are lots
of startup businesses that are committed to being cloud-based and
Web-based, and there’s no way they’re going to go run their Hadoop
clusters internally,” Matchett says. “They’re going to check them out of
the cloud.”
Again, though, at some point they may need top talent:
“They may still want a data scientist to solve their unique competitive
problem. They need the scientist to figure out what they can do
differently than their competitors or anybody else.”
A growing community of Hadoop vendors offer a byzantine array of solutions. Flavors and configurations abound. These vendors are leveraging the fact that Hadoop has a certain innate complexity – meaning buyers need some help. Hadoop is comprised of various software components, all of which need to work in concert. Adding potential confusion, different aspects of the ecosystem progress at varying speeds.
Handling these challenges is “one of the advantages of working with a vendor,” Heudecker says. “They do that work for you.” As mentioned, a key element of these solutions is SQL – Heudecker refers to SQL as “the Lingua Franca of data management.”
Is there a particular SQL solution that will be the perfect match for Hadoop?
“I think over the next 3-5 years, you’ll actually see not one SQL solution emerge as a winner, but you’ll likely see several, depending on what you want to do,” Heudecker says. “In some cases Hive may be your choice depending on certain use cases. In other cases you may want to use Drill or something like Presto, depending on what your tools will support and what you want to accomplish.”
As for winner or losers in the race for market share? “I think it’s too soon. We’ll be talking about survivors, not winners.”
The emerging community of vendors tends to tout one key attribute: ease of use. Matchett notes that, “If you go to industry events, it’s just chock full of startups saying, ‘Hey, we’ve got this new interface that allows the business [user] just to drag and drop and leverage Big Data without having to know anything.’”
He compares the rapid evolution in Hadoop tools to the
evolution of virtualization several years ago. If a vendor wants to make
a sale, simpler user interface is a selling point. Hadoop vendors are
hawking their wares by claiming, “‘We’ve got it functional. And now
we’re making it manageable,’” Matchett says. “‘We’re making it mature
and we’re adding security and remote-based access, and we’re adding
availability, and we’re adding ways for DevOps people to control it
without having to know a whole lot.’”
Even as Hadoop matures, there continues to be Big Data solutions that far outmatch it – at a higher price for those who need greater capability.
“There’s still definitely a gap between what a Teradata warehouse can do, or an IBM Netezza, Oracle Exadata, and Hadoop,” says Forrester’s Gualtieri. “I mean, if you need high concurrency, if you need tons of users and you’re doing really complicated queries that you need to have perform super fast, that’s just like if you’re in a race and you need race car.” In that case you simply need the best. “So, there’s still a performance gap, and there’s a lot of engineering work that has to be done.”
One development that he finds encouraging for Hadoop’s
growth is the rise of the Hadoop appliance. “Oracle has the appliance,
Teradata has the appliance, HP is coming out with an appliance based
upon their Moonshot, [there’s] Cray Computer, and others,” he notes,
adding Cisco to his list.
What’s happening now is far beyond what might be called
“appliance 1.0 for Hadoop,” Gualtieri says. That first iteration was
simply a matter of getting a cabinet, putting some nodes in it,
installing Hadoop and offering it to clients. “But what they’re doing
now is they’re saying, okay, ‘Hadoop looks like it’s here to stay. How
can we create an engineered solution that helps overcome some of the
natural bottlenecks of Hadoop? That helps IO throughput, uses more
caching, puts computer resources where they’re needed virtually?’ So,
now they’re creating a more engineered system.”
Matchett, too, notes that there’s renewed interest in Hadoop appliances after the first wave. “DDN, a couple years ago, had an HScaler appliance
where they packaged up their super-duper storage and compute modes and
sold it as a rack, and you could buy this Hadoop appliance.”
Appliances appeal to businesses. Customers like being
able to download Hadoop for free, but when it comes to turning it into a
workhorse, that task (as noted above) calls for expertise. It’s often
easier to simply buy a pre-built appliance. Companies “don't want to go
hire an expert and waste six months converging it themselves,” Matchett
says. “So, they can just readily buy an appliance where it’s all
pre-baked, like a VCE appliance such VBlock, or a hyper-converged
version that some other folks are considering selling. So, you buy a
rack of stuff…and it’s already running Hadoop, Spark, and so on.” In
short, less headaches, more productivity.
A discussion – or debate – is now raging within the Big Data community: sure, Hadoop is hot, but now Spark is emerging. Maybe Spark is better – some tech observers trumpet its advantages – and so Hadoop (some observers suggest) will soon fade from its high position.
Like Hadoop, Spark is a cluster computing platform (both Hadoop and Spark are Apache projects). Spark is earning a reputation as a good choice for complicated data processing jobs that need to be performed quickly. Its in-memory architecture and directed acyclic graph (DAG) processing is far faster than Hadoop’s MapReduce – at least at the moment. Yet Spark has its downsides. For instance, it does not have its own file system. In general, IT pros think of Hadoop is best for volume where Spark is best for speed, but in reality the picture isn’t that clear.
Spark’s proponents point out that processing is far faster when the data set fits in memory.
“I think there’s an awful lot of hype out there,”
Gualtieri says. To be sure, he thinks highly of Spark and its
capabilities. And yet: “There are some things it doesn’t do very well.
Spark, for example, doesn’t have its own file system. So it’s like a car
without wheels.”
The debate doesn’t take into consideration how either
Spark or Hadoop might evolve – quickly. For instance, Gualtieri says,
“Some people will say Hadoop’s much slower than Spark because it’s
disc-based. Does it have to be disc-based six months from now? In fact,
part of the Hadoop community is working on supporting SSD cards, and
then later, files and memory. So people need to understand that,
especially now with this Spark versus Hadoop fight.”
The two processing engines are often compared. “Now,
most Hadoop people will say MapReduce is lame compared to the [Spark]
DAG engine,” Gualtieri says. “The DAG engine is superior to MapReduce
because it helps the programmer parallelize jobs much better. But who’s
to say that someone couldn’t write a DAG engine for Hadoop? They could.
So, that’s what I’m saying: This is not a static world where code bases
are frozen. And this is what annoys me about the conversations, is that
it’s as if these technologies are frozen in time and they’re not going
to evolve and get better.” But of course they are – and likely sooner
rather than later.
Like Hadoop, Spark includes an ever growing array of tools and features to augment the core platform. Source: Forrester Research
Ultimately the Hadoop-Spark debate may not matter,
Matchett says, because the two technologies may essentially merge, in
some form. In any case, “What you’re still going to have is a commodity
Big Data ecosystem, and whether the Spark project wins, or the Map
Reduce project wins, Spark is part of Apache now. It’s all part of that
system.”
As Hadoop and Spark evolve, “They could merge. They
could marry. They could veer off in different directions. I think what’s
important, though, is that you can run Hadoop and Spark jobs in the
same cluster.”
Plenty of options confront a company seeking to assemble
a Big Data toolset, Matchett points out. “If you were to white board it
and say ‘I’ve got this problem I want to solve, do I use Map Reduce? Do
I use Spark? Do I use one of the other dozen things that are out there?
Or a SQL database or a graph database?’ That’s a wide open discussion
about architecture.” Ultimately there are few completely right and wrong
answers, only a question of which solution(s) work best for a specific
scenario.
Instead of a choosing one or the other, many Big
Data practitioners point to a scenario in which Hadoop and Spark work in
tandem to enable the best of both.
Clearly, there’s been a lot of hype about Big Data, about how it’s the new Holy Grail of business decision making.
That hype may have run its course. “Big data is essentially turning into data,” opines Heudecker. “It’s time to get past the hype and start thinking about where the value is for your business.” The point: “Don’t treat Big Data as an end unto itself. It has to derive from a business need.”
As for Hadoop’s role in this, its very success may contain a paradox. With time, Hadoop may grow less visible. It may become so omnipresent that it’s no longer seen as a stand alone tool.
“Over time, Hadoop will eventually bake into your information infrastructure,” Huedecker says. “It should never have been an either/or choice. And it won’t be in the future. It will be that I have multiple data stores. I will use them depending on the SLAs I have to comply with for the business. And so you’ll have a variety of different data stores.”
In Gualtieri’s view, the near term future of Hadoop is
based on SQL. “What I would say this year is that SQL on Hadoop is the
killer app for Hadoop,” he says. “It’s going to be the application on
Hadoop that allows companies to adopt Hadoop very easily.” He predicts:
“In two years from now you’re going to see companies building
applications specifically that run on Hadoop.”
Looking ahead, Gualtieri sees the massive Big Data potential of the Internet of Things
as a boost for Hadoop. For instance, he points to the ocean of data
created by cable TV boxes. All that data needs to be stored somewhere.
“You’re probably going to want to dump that in the most
economical place possible, which is HDFS [in Hadoop],” he says, “and
then you’re probably going to want to analyze it to see if you can
predict who’s watching the television at that time, and predict the
volumes [of user trends], and you’ll probably do that in the Hadoop
cluster. You might do it in Spark, too. You might take a subset to
Spark.”
He adds, “A lot of the data that’s landed in Hadoop has
been very much about moving data from data warehouses and transactional
systems into Hadoop. It’s a more central location. But I think for
companies where IOT is important, that’s going to create even more of a
need for a Big Data platform.”
As an aside, Gualtieri made a key point about Hadoop and the cloud, pointing to what he calls “the myth of data gravity.”
Businesses often ask him where to store their data: in the cloud? on
premise? The conventional wisdom is that you should store your data
where you handle most of your analytics and processing. However,
Gualtieri disagrees – this attitude is too limiting, he says.
Here’s why data gravity is a myth: “It probably takes
only about 50 minutes to move a terabyte to the cloud, and a lot of
enterprises only have a hundred terabytes.” So if your Hadoop cluster
resides in the cloud, it would take mere hours to move your existing
data to the cloud, after which it’s just incremental updates. “I’m
hoping companies will understand this, so that some of them can actually
use the cloud, as well,” he says.
When Matchett looks to the future of Hadoop and Big
Data, he sees the affect of convergence: any number of vendors and
solutions combining together to handle an ever more flexible array of
challenges. “We’re just starting to see a little bit [of convergence]
where you have platforms, scale-up commodity platforms with data
processing, that have increasing capabilities,” he says. He points to
the combination of MapReduce and Spark. “We also have those SQL
databases that can run on these. And we see databases like Vertica
coming in to run on databases with the same platforms…Green Plum from
EMC, and some from Teradata.”
He adds: “If you think about that kind of push, the data
lake makes more sense not as a lake of data, but as a data processing
platform where I can do anything I want with the data I put there.”
The future of Hadoop and Big Data will contain a
multitude of technologies all mixed and matched together – including
today’s emerging container technology.
“You start to look at what’s happening, with workload
scheduling and container scheduling and container cluster management,
and there’s Big Data from this side coming in and you realize: well,
what MapReduce is, it’s really a Java job that gets mapped out. And what
a container really is, it’s a container that holds a Java application…
You start to say, we’re really going to see a new kind of data center
computing architecture take hold, and it started with Hadoop.”
How will it all evolve? As Matchett notes, “the story is still being written.”