Skip to main content

Demystifying Hadoop Security

By May 4, 2015Article

Hadoop has established a firm foothold as a high-performance, low-cost hub for data management and processing. It allowed organizations to expand analytic and data management capabilities and capacity to a scale that was at best impractical and, in many cases, impossible with conventional technology. At the same time, this rapid increase in scale and performance capability was not initially matched by features addressing security, governance and compliance – all table stakes for enterprise data management. As Hadoop has expanded from processing web-scale “exhaust” data to applications for enterprise data management, it is no longer viable to implicitly trust access to all data within the cluster to any user with access to the cluster and the Hadoop client binaries. 

As Hadoop became more widespread as a centralized data lake or data hub within the enterprise, a variety of open source, vendor-supported and proprietary solutions emerged to meet the needs of security compliance. Hadoop security components are rapidly evolving to simplify deployment and management as well as introduce new security features to meet strict security compliance requirements. 

Establishing a secure platform is an essential first step for proactive stewardship of managed enterprise data assets. Effectively implementing secure Hadoop requires:

  • An understanding of the core components of Hadoop security
  • A plan to use the core components effectively and efficiently
  • Data strategy alignment to evolve platform security capabilities and roadmaps. 

Hadoop security evolution 

Hadoop had proven successful at moving petabytes of data on thousands of nodes before the question of security was even addressed in earnest in the October 2009 document, “Hadoop Security Design” (O’Malley, Zhang, Radia, Marti, Harrell; HADOOP-4487). The document discussed in detail the considerations for security risks and requirements and laid out a roadmap for implementing secure Hadoop. This initiative focused on the introduction of centralized authentication through Kerberos, which is a type of secure authentication protocol already embedded within tools like Microsoft Active Directory. 

This represented a critical next step in the evolution of Hadoop to an enterprise-strength data management platform. But at the same time, it established a principal assumption that “for backwards compatibility and single-user clusters, it will be possible to configure the cluster with the current style of security [simple authentication].” 

By 2013, Hadoop security evolved to incorporate strong security with Kerberos, but it still lacked many of the security features expected for enterprise software while simultaneously growing into a massive ecosystem of components and frameworks. Implementing strong security was a difficult manual and error-prone process. 

The increasing need for security and compliance within the context of big data led to an increased focus on making Hadoop enterprise strength. As a response, Intel, with participation from other key vendors, launched “Project Rhino” as a collective initiative to simplify implementation, address gaps within the current security process and move towards a comprehensive framework for security within Hadoop. 

Since 2013, Project Rhino has addressed a number of key security requirements such as cell-based access control for HBase (HBASE-6222), transparent encryption for HBase (HBASE-7544) and HDFS (HADOOP-10150) and flexible centralized user group mapping (HADOOP-8943). The initiative also continues to make progress on additional features to simplify security through a unified authorization framework. Further enhancements have been made through vendor-supported projects such as Apache Ranger and Cloudera’s Manager and Navigator to enhance automated deployment and simplify configuration and management of security settings as well as provide a finer granularity of access controls for data within the cluster. 

Hadoop authentication in brief 

Initially, Hadoop’s sole mode of simple authentication was not secure or strongly enforced. It assumed trust of all users with access to the cluster network and provided authorization through file system permissions, largely to prevent accidental deletion or modification of data. Using this type of simple authentication, a user with access to the cluster can easily impersonate other users, making authorization capabilities largely irrelevant in the event of malicious use. 

Hadoop continues to provide backwards compatibility for this type of simple authentication through hadoop.security.authentication by default. 

Using simple authentication, Hadoop attempts to use an environment variable to provide the user name; otherwise, Hadoop uses the client OS user, which is based on an OS and Java version-specific implementation. This approach to authentication is inherently overly trusting and insecure. While simple authentication is useful for training and prototype sandbox environments, it is a non-starter for any type of confidential or protected data that might be managed in an enterprise data lake or data hub. 

Secure Hadoop manages users centrally through Kerberos, ensuring that users are who they say they are by validating their identity against a centrally managed store. Kerberos authentication and LDAP group assignment are built into common enterprise identity management solutions like Active Directory, simplifying centralized user management. Improved identity management features and easier implementation of authentication components established the building blocks of secure enterprise Hadoop implementations. 

Platform architecture assessment and planning should take into account enterprise standards, existing security and identity management components and long-term platform goals to effectively take advantage of these improved authentication security features. 

Cluster multi-tenancy: a case study in authorization 

As the Hadoop ecosystem matured, the variety of data, users and workloads increased. An example: introducing a broader range of ad hoc business analysis by users not involved in low-level data preparation. Hadoop file system permissions and access control lists work well for separating authorization between groups of users developing separate low-level data applications but often fail to provide the tools necessary to efficiently enable the end-user consumer of these analytics. 

Security features integrating finer grained authorization permissions are enabling enterprises to maximize the value of the Hadoop environment by providing secure column- and row-level access to data residing in shared datasets without exposing the entire set of information to all users. A common use case is when data ingested into HDFS is processed and integrated using one or more lower-level Hadoop platform components. Then it is finally structured into tables within Hive, providing an SQL access layer for end users. Rather than create separate outputs for each consumer group or user, which is impractical for ad hoc analysis, fine-grained authorization plugins, such as Apache Ranger or Apache Sentry, provide the ability to establish role-based security for access to this data exposed through HiveServer2. 

In order to take advantage of these features, planning assessments should consider roles for each separate producer and consumer groups, using the cluster to decide which groups will rely on coarser file system permissions such as a data engineer or data scientist and which users should be limited to finer-grained access. 

Auditable data access: there’s an app for that 

A recent highly publicized data breach was uncovered when a database administrator discovered an unrecognized query running using elevated credentials within database audit logs. In an era of advanced persistent threats, auditing data access enables administrators to monitor and take action against unauthorized access. Hadoop ecosystem components such as Apache Ranger and Cloudera Navigator emerged to provide fine-grained auditing that can be used to monitor, surface and alert atypical activity within the cluster. 

Increasingly, administrators and security professionals are leveraging the scalability and performance of Hadoop to facilitate analysis of audit logs, both within the Hadoop ecosystem and in the wider enterprise. Planning ahead for access auditing creates another powerful tool to secure enterprise data. 

Taking Action 

While this article addresses only a subset of the considerations for implementing a secure Hadoop ecosystem, a few simple tips can dramatically improve a security plan, streamline data onboarding and reduce security risks and administration headaches:

  1. Align Hadoop components to leverage and maximize the impact of enterprise security investments for centralized authentication, user management and group assignment.
  2. Plan user and role requirements for both producers and consumers of data.
  3. Integrate security and data governance planning processes into Hadoop development life cycles for ongoing stewardship.
  4. Understand and define security requirements prior to onboarding new data or ecosystem components, mapping them to the overall platform security architecture.
  5. Proactively audit data access through monitoring and alerts. 

Developing a plan and processes for Hadoop security, either at the outset of a Hadoop deployment or when initiating new data management or analytics initiatives, is an opportunity to take security questions off the table. An effective Hadoop security and governance plan simplifies processes to make doing the right thing easy, enabling enterprises to take full advantage of Hadoop to achieve analytic agility and scale. 

Tripp Smith is CTO at Clarity Solution Group, a recognized data and analytics consulting firm. Contact him at tsmith@clarity-us.com.