With storage infrastructure costs spiraling out of control, enterprises are looking to the cloud for relief. Cloud storage offers limitless storage capacity on a pay-as-you-go basis; but because the cloud stores data as objects rather than files, enterprises need a cloud storage system to make the translation. One of the key concerns in using a cloud storage system is performance: will users be able to access files in the cloud as quickly as they can from a local filer?
The problem with most cloud storage systems such as cloud gateways is that they are packaged as monolithic appliances. These systems are built around a monolithic local cache and a monolithic interface to the cloud through which all data and metadata flows. Like any such device, these appliances become a bottleneck once the network connection to the cloud and/or the local cache becomes full.
There are two keys to solving this problem: (1) implementing the cloud storage system as software on virtual machines (VMs) and (2) deploying multiple tiers of caching to optimize access speeds throughout the data path from the cloud storage itself to the user (or “endpoint”) device.
By deploying the cloud storage system as one or more virtual machines, performance can scale linearly by adding more VMs as needed, and companies aren’t saddled with the capital and operational expenses of buying and maintaining an appliance in each branch office or other remote locations.
Caching also plays a key role in optimizing end-user performance. Multi-tier caching accelerates access times at several points along the pathway from user to storage platform: at the endpoint, at the branch office site, at the CDN and at the cloud storage system itself. Critically, caching at every stage must not compromise data security in-flight or at rest.
The endpoint cache — This is an encrypted, adaptive cache that uses the endpoint device’s own local storage and memory to store frequently used data and metadata. Combined with granular global per-share deduplication, endpoint caching minimizes the number of times a client is required to communicate with the object store to retrieve or propagate data. Depending on the type of workload, deduplication ratios can range well into the 90%+ range.
Site cache — While endpoint caching provides the performance and latency characteristics required for most users and workloads in most enterprises, there are scenarios in which a large, dedicated cache onsite at a branch office can add significant value.
For example, there may be a very large branch with a relatively “thin” network connection to the object store and a large number of users who frequently share and collaborate around common data sets. A regional cache — deployed as a VM at the branch office — adds another level of caching in the path between the cloud and endpoint.
CDN caching — Increasingly, cloud service providers and geo-dispersed private clouds are adding CDN (content distribution network) capabilities to their infrastructure. The concept behind CDNs is straightforward: by replicating data across one or more CDN edge nodes, geographically distributed workloads can leverage greatly enhanced read performance by accessing the required data from the nearby CDN node rather than the central object store.
Moreover, with some CDNs such as Amazon CloudFront, the service provider offers a high-speed backbone connecting the CDN edge nodes to the central object store, resulting in improved performance even on the first read (since the time it takes for data to traverse the network from the object store to the edge node is minimal, leaving only the “last mile” to the user).
Metadata server caching — Leading software-defined storage solutions separate data from metadata and provide a metadata server, which hosts all metadata, communicates with endpoint agents and arbitrates data operations. The speed and latency associated with metadata transfers is just as important as that of data transfers, so such solutions should have a purpose-built caching mechanism to help optimize metadata performance (and therefore overall performance) across deployments.
Enterprises are right to be concerned about performance when considering cloud storage solutions because monolithic storage appliances can become bottlenecks in the data path. But software-based cloud storage systems that implement multi-tier caching optimize access performance at every step of the data path, eliminating performance as a concern. With a software-based, caching-enabled solution in place, enterprises can confidently use cloud storage for active archiving, Tier 2-3 workloads and other applications.
Dr. Jay Kistler is co-founder and CTO of Maginatics, which delivers a cloud storage platform. He has a longstanding interest in cloud-scale storage systems, beginning with his seminal work with distributed file systems at Carnegie Mellon in the 1980s and 1990s. Prior to Maginatics, Jay was VP of engineering at Yahoo!, where he led the search and advertising infrastructure programs. Earlier, Jay served as chief architect for platform technologies at Akamai.