Conversation with Cloudera on Hadoop and Big Data Trends

Editor’s note: Cloudera recently launched Sentry to bolster the security aspects of its Hadoop platform and also introduced advancements in the platform’s search functionality for structured and unstructured data. We spoke with Charles Zedlewski, Cloudera’s vice president, products, about trends in Big Data solutions and how the Cloudera platform is evolving.
Are Hadoop and the Cloudera platform one and the same thing?
Charles Zedlewski: I think of them as one and the same. Hadoop literally defined is Apache MapReduce and Apache HDFS, and our platform includes both of these. Not only that, but we employ many developers that build and improve upon HDFS and MapReduce; so we are a Hadoop distributor and a Hadoop developer. But what people consume today is a larger platform that has more than a dozen different open source components. The original Hadoop is a smaller and smaller fraction of the overall platform, but it’s still the heart of it. A question of whether to use Hadoop or use Cloudera is the same choice.
Is cost the primary driver for adoption of Cloudera over the past few months?
Charles Zedlewski: Cost is definitely an aspect of our software that gets people’s attention and it often starts the conversation. But what really drives people to use the technology has more to do with how flexible the system is. Because of its flexibility, our platform supports certain families of applications and enables a lot of workloads much better than traditional databases can do.
Most people think the cost of a system that stores and analyzes data is tens of thousands of dollars per terabyte. It gets their attention when they hear that Hadoop is more like $1,000 per terabyte.
When companies adopt Hadoop they typically have a strategic objective they want to accomplish that takes advantage of Hadoop’s flexibility. That objective may take longer to achieve and they can gain a tactical savings just on the infrastructure savings from moving applications or workloads off of expensive data-management platforms to Cloudera.
You recently developed the Sentry feature in the area of security. Tell me about that.
Charles Zedlewski: People have the expectation that they need database-style security with Hadoop, meaning they could secure data per table, column or range of rows. That was a gap that we had to close, so we added a security module that does that. We had a lot of other security features before that such as audit and authentication features. But the need to be able to selectively hide portions of datasets from certain sets of users and groups was missing.
Now Sentry brings database-style security to Hadoop data. Over time we’ll also expand that ability to select the tables, columns and rows to work with other frameworks that we have in our platform.
Is the Sentry feature unique among your competitors in the market?
Charles Zedlewski: Today we are the only one with this capability, and we developed it. But by design we released it in open source so our competitors can adopt it. We know this is something that customers need to feel comfortable in adopting the software. But we also know that if we were the only one that had it, pretty soon our competitors would be compelled to build something similar. Then there would be three or four different competing security frameworks and systems, which would result in a lot of unnecessary confusion for users. It would create a lot of fragmentation without creating a lot of customer benefit.
So we intentionally decided to ship this as a net new open source project so that we could drive some standardization in this part of our industry. Now our competitors can choose to either adopt the standard or they can explain to customers why they don’t have fine-grained security.
You also recently made some advancement in the area of search. Please explain this differentiation.
Charles Zedlewski: It allows customers to take any data that they might acquire — whether it’s structured or unstructured — and we can make it searchable. We build search indices in near real time, which we can apply to very large datasets (multi-hundred terabyte or petabyte scale) in Big Data platforms. Essentially we give users a way to interactively search all the Big Data in their clusters. In doing so, we create a new way for people to access data, and it’s one that non-technical people can use.
How does the Cloudera platform accelerate time to value?
Charles Zedlewski: Because of the way we designed the software, people can get a pretty substantial system up and running and doing a workload much faster than with traditional distributed systems. First the software is open source, which means you can get a copy of it and start working with it without having to talk to a Cloudera sales rep or write a check. You can try it, do a proof of concept, learn a lot of the technology and do all of that without having to go through a multi-month vendor-evaluation process. That’s pretty appealing.
Next, we’ve made it relatively easy to deploy. You can go from a software download to a 10-node distributed data-management system in under an hour. By contrast, if you try to do that with a traditional database, you would need a lot of expertise to do something comparable.
In addition, a lot of analytic databases have a lot of up-front cost invested in schema design and schema definition. In Hadoop you could do that, but you don’t have to do that. You can defer a lot of that work if it’s not necessary. So there’s a lower up-front cost in terms of time and effort to develop a scheme and data model. You can acquire data and apply the schema later. That’s another great example of shortening time to value.
Is time to value a primary concern of potential customers these days?
Charles Zedlewski: Yes, it’s a concern; but the main concerns have less to do with time to value and more to do with all the traditional issues around adopting new technologies, especially concerns around skills and know-how.
For example, people think: “On paper Hadoop is clearly better in many ways than some of the other choices we had. But we know our old choices really well and built up more than a decade of experience working with a particular technology. How are we going to realize all these great benefits if we don’t have the skills?”
That’s probably the number-one issue that people wrestle with — the change management and skills. At Cloudera we try to enable that. We do a brisk business in training. We train about 7,000 people per year. And we’ve built training tracks for all of the roles that might work with our software — if you’re formerly a DBA, we have a track to Hadoop for you; if you’re a business analyst, we have a track for what you need to learn incrementally to be productive with Hadoop.
But people can always download software faster than they can learn it and become productive with it. As long as that’s the case, there is going to be a temporary period where there’s a gap between the skill base of the population and the technology. Eventually it will come back into balance. In the meantime, we’re growing this population of trained, skilled users all the time.
How is the proliferation of cloud-based apps and workloads impacting Big Data solutions?
Charles Zedlewski: There is definitely more demand for us to offer our software in cloud deployment models. So we’ve been improving the cloud experience of our software, and that will continue to get better over time.
Today a lot of the data that our customers generate is data they generate inside their own data centers; therefore, that’s where you find a lot of our Hadoop clusters. But we know that more apps are moving to the cloud and, as a result, more data is moving to the cloud.
On a global perspective, is anybody in other countries using Cloudera or Hadoop as much as Americans?
Charles Zedlewski: Absolutely. That’s the thing about open source; it can spread to far-flung places pretty quickly. That’s been one of the many things we’ve been working on this year: expanding internationally just to keep up with the demand because there is a lot of organic adoption of our open source software in geographies outside the United States. That includes Europe, Asia, the Middle East, India and China.
Someone did a study recently looking at the prevalence of Hadoop distributions in Europe and found that Cloudera was far and away the most popular one. We have about 20 employees now in Europe, and that’s growing quickly as we keep trying to keep up with demand. We’ve had an office in Tokyo for nearly three years, and that’s expanding. We also have customers in Australia and plan to open an office in Sydney later this year. Pretty much all of the large Indian systems integrators are our partners today. And there is a lot of Hadoop adoption on the China front.
What is on the horizon for the next two years? How do you think Hadoop will change during that time, and why?
Charles Zedlewski: Three things will drive our investments in Hadoop for the next two years. We want to make it a better platform to host more data and more applications. In order to do that, we need to make it functionally richer. So we’re going to broaden the SQL support, for example, so we can handle more kinds of SQL applications. We’re going to enhance the search functionality to handle more kinds of search applications. In addition, we will continue to add more and more critical enterprise functionality and features.
And then the last thing is that we collaborate very closely with our commercial partners to make sure there is a rich ecosystem of applications that run on this platform. So we’ll make the software more extensible and more hospitable to more third-party applications. There are more than 200 third-party tools and applications built on the Cloudera platform today, and we want to get to 2,000. We need to do work on our end as the platform company to make that work.
We actually have more than 700 companies in our partner program today, so we’re a platform in the truest sense of the word. Some of those are systems companies and a lot of those are services companies that build custom applications for their customers. But at this point, in excess of 200 of them are software companies, which means there are all kinds of tools and applications that extend the usefulness of this platform in dozens of different directions.
Growing and cultivating this aspect of our platform is a major differentiator for Cloudera today and it is a big part of our long-term future. We have more to do, but we’re off to a good start. We already have a larger commercial ecosystem than a lot of the leading analytic databases.
If you grow that large, I assume the vendor or provider landscape in the area of Big Data and search capabilities will change drastically in a couple of years.
Charles Zedlewski: I think you’re right. There used to be a lot of freestanding search companies, which the incumbent vendors like IBM, Oracle and Microsoft acquired over time. But that resulted in point products that are part of the portfolios of the large vendors. Our vision for search is a bit different.
We reimagined search as not a separate point product but rather a feature of a larger platform. We made search so that you get it all in a single system. So the same system that you have for your predictive analytics, data processing and BI is the same system you can use for your interactive search. And you can do all that in a single environment, whereas the old model said you need to buy one of everything and install them all independently. We made something that is drastically simpler to use.
If we do our job right, it will motivate more of the search solutions to figure out how to fit into a unified platform as opposed to being a point solution that is distinct from the databases and analytics systems that customers already own.
Charles Zedlewski is vice president, products at Cloudera where he is responsible for the future direction of Cloudera’s product portfolio. Prior to Cloudera Charles held a number of management roles at SAP, BEA Systems and a number of venture-backed startups. You can follow Charles on Twitter (@zedlewski).
Kathleen Goolsby is managing editor of SandHill.com.

Tags:

Conversation with Cloudera on Hadoop and Big Data Trends

Tags:

Fresh From The Blog

Learn More