Every CIO wants to know if the company infrastructure will be able to handle data growth, since the amount of data is supposed to reach 40 zettabytes by 2020. Gartner presented in a recent survey that 47 percent of respondents ranked data growth as the #1 infrastructure challenge for data centers.
When data sets become too large, application performance slows and infrastructure struggles to keep up. Data growth drives increased cost and complexity everywhere including power consumption, data center space, performance and availability.
System availability is impacted as batch processes are no longer able to meet scheduled completion times. The “outage windows” necessary to convert data during ERP upgrade cycles may extend from hours to days.
Other critical processes like replication and disaster recovery are impacted because more and more data is just harder and harder to move and copy.
Left unchecked, data growth may also create governance, risk and compliance challenges. HIPAA, PCI DSS, FISMA and SAS 70 mandates all require that organizations establish compliance frameworks for data security and compliance.
How can data be managed such that the inactive data doesn’t clog the infrastructure and impact critical processing?
One correlation we are able to make is that the value of data declines with age because it becomes less active.
ILM: best practice data management
Information Lifecycle Management (ILM) is a data management best practice to manage the life cycle of data from creation to deletion and disposal.
The goals of ILM are:
- Optimize application performance
- Manage data security, risk and compliance
- Reduce infrastructure costs
ILM achieves these goals by moving data to the most appropriate infrastructure tier based on retention policies such as the age of the data. Since older data is less frequently accessed, it is therefore less valuable and less deserving of limited tier-one performance and capacity.
Tier one infrastructure is high cost and may include multi-processor servers with large flash memory arrays and high-speed storage area networks. Data positioned on tier-one infrastructure is generally three years old or less.
Older, less active data is assigned to lesser infrastructure tiers to reduce overall costs while still providing proper access to the data, albeit not at tier-one performance levels.
Hadoop reinvents ILM and delivers dramatic ROI
Apache Hadoop is a free and open source computing framework that is designed to operate powerful, new low-cost infrastructure at a lesser tier while still delivering massive scalability and performance.
Using the MapReduce programming model to process large data sets across distributed compute nodes in parallel, Hadoop delivers highly scalable workload performance and very low cost, bulk data storage. All this means that Hadoop offers dramatic cost savings over traditional tier-one infrastructure.
Consider the following comparison:
According to Monash Research, the cost of tier-one database infrastructure is over $60,000 per TB. At the same time, 1TB of S3 bucket storage at Amazon Web Services is $30 per month according to their recent price list. This means Hadoop is essentially 55.5X cheaper than tier-one infrastructure.
Benefits of improved enterprise application tiering
Enterprise applications such as ERP, CRM and HCM represent an excellent opportunity for improving performance and reducing costs through application tiering with Apache Hadoop.
Enterprise archiving follows an ILM approach to improve performance and reduce costs by supporting four processing tiers integrated with Hadoop:
- Tier one: Highest performance infrastructure reserved for high value, active data. Large flash arrays manage OLTP processing loads in-memory for maximum performance.
- Tier two: In line ILM partitions (still running on tier-one infrastructure) allow a table or index to be subdivided into ranges based on parameters such as the age of the data. Older, less valuable data may be placed in partitions to exclude them from causing processing overhead. Each partition may be assigned its own storage characteristics.
- Tier three: Data that is moved and purged from the source database is called an archive. A tightly coupled archive retains native access to the application as well as the ability to de-archive back into source production database if necessary.
- Tier four: Hadoop provides a point-in-time snapshot of a business record. Because the data represents a complete business object, decoupled from the application, it no longer must be upgraded in sync with the application. Next-generation analytics tools, text search as well as traditional structured query tools provide enhanced access to the data.
The benefits of enterprise application tiering are significant in terms of improved infrastructure performance, reduced costs and higher availability. By positioning data based on business value, infrastructure utilization becomes more efficient.
As a next-generation enterprise data management platform, Apache Hadoop has reinvented ILM and reintroduced extreme ROI back into data archiving projects.
Sai Gundavelli is the founder and CEO of Solix Technologies, Inc. and he is responsible for the company’s overall vision and strategic direction. Prior to founding Solix, he spearheaded several strategic initiatives in enterprise application areas at companies including CISCO Systems and Arix Corp. He is a business and technology thought leader and a distinguished speaker in many forums.