Evolution of Hybrid NoSQL to Tackle Big Data

The pendulum is swinging for NoSQL. Though the term NoSQL (which stands for “Not Only SQL”) was first coined back in 1998 by Carlo Strozzi, it all really got underway with Google’s BigTable paper (2006) and Amazon’s Dynamo Paper (2007). Since then, there has been a flood of technology to address the challenges of working with Big Data. The march away from traditional RDBMS technology in order to meet the growing challenges of managing the growing volume of data has recently taken an interesting twist.

Before we look at the pendulum effect, let’s first take a look at why NoSQL technology propagated in the first place, understand the new challenges that have been introduced (because as they say, there are no free lunches), and finally take a look at a recent trend that has Google potentially back in front in terms of leading the charge to a newer hybrid approach to database technology that promises the benefits of a traditional RDBMS.

The genesis of NoSQL

First off, why weren’t traditional RDBMS approaches to working with data enough? The answer has to do with the amount of data that is required to be processed as an increasing commodity. As the volume increases, scaling the architecture must also increase. The database quickly became the bottleneck, and the need to queue requests became much more critical.

The answer was to scale horizontally, meaning, across many machines. Database scaling has many challenges as users attempt to share data across nodes. There’s the need to replicate the data in case of a node failure, as well as the increased difficulty and complexity with managing the cluster. The complexities of the applications that need the data are impacted, and the cost that failure incurs becomes unmanageable.

At the root of this challenge is to try to estimate the cost of what has become an extremely complex system to maintain and manage. Add to this the nature of low latency requirements attached to real-time services that are typically tied to the access of the data. It was clear that a different approach to scaling was required.

At the heart of the problem is a pretty straightforward issue: the desire for a guaranteed database transaction by meeting the properties known as ACID — Atomicity, Consistency, Isolation, Durability.

Guaranteeing that all database transactions will result in consistent accessible data across any node is at the heart of the issue here (think CRUD – Create/Read/Update/Delete as typical database operations). So what if we relaxed the requirements of ACID? If we relax one of the properties as a requirement, would this allow us to scale more easily? This is the philosophy behind NoSQL technologies.

To understand this, it is helpful to become aware of the CAP theorem. CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed computer system to meet all three aspects of CAP, which stands for Consistency, Availability and Partition Tolerance. NoSQL technologies basically all choose two of the three guarantees.

When designing an application that uses a distributed database, the technology choice was traditionally based on which property one was willing to forego. Many of the most popular technologies, for example (such as Cassandra or CouchDB) give up consistency. This means that eventually, all nodes will have the same view of the data. Other NoSQL providers choose to relax one of the other properties. Doing this breaks the ACID requirement for a traditional RDBMS.

There’s a popular picture of CAP theorem for NoSQL as described in a blog by Nathan Hurst, which is a little dated, but you get the idea (http://blog.nahurst.com/visual-guide-to-nosql-systems).

As a result of the papers by Google and Amazon, a lot of development has happened to address handling Big Data reliably across a distributed environment. Each of these technologies typically falls into one of several NoSQL families: Key-Value Stores, Column Stores, Graph, Document-based and several variations of these. Each of these technologies gives up something in the way of ACID to make it much easier to scale out and handle large data volumes.

The fine print with NoSQL

I did mention there’s no free lunch didn’t I? For all of the complexities addressed or abstracted from the user, who is faced with implementing and managing a distributed data store, there are some other interesting problems and challenges that one must face. The need for constant data compaction or the fact that deleting data in an eventually consistent store can create an effect called “tombstone” or “phantom reads” (deleted items that stick around) is one such example.

It gets weirder than that as you try to understand the state of a distributed store. In interviewing organizations that have managed different implementations of NoSQL, there’s a clear set of challenges that seems to be prevalent and common regardless of provider or technology:

Tools are immature — Administering environments can be challenging due to lack of mature tools. This creates a counter effect of the promises of NoSQL environments as managing them can become increasingly difficult.
Immature solution — Many times the technology itself is at best a couple of years old and still going through a maturation phase to becoming a truly stable and reliable technology.
Lack of domain expertise — I mentioned one artifact called tombstones. This is very different than what a typical database administrator is used to managing; so with the new challenges, it requires new expertise.
Lack of support — Many companies offering solutions just haven’t been around long enough with enough exposure to establish effective support for the technology.
Lack of SQL grammar — Though many technologies provide a SQL-like language, many do not support what generally in industry is known lingua for working with relational databases.

This isn’t to say that NoSQL is bad or insufficient for the task at hand. It’s merely that there are costs to relaxing ACID in order to abide by the CAP theorem and have support for the distributed data store. However, CAP theorem is being challenged; and as this article stated, the pendulum is now swinging back.

The pendulum swings

Recently, Google published yet another paper – “F1: A Distributed SQL Database That Scales.” This new database was recently developed to replace MySQL for Google’s AdWords. This database achieves ACID as well as being truly distributed. In addition, it supports a more familiar SQL dialect common to most database developers and administrators.

Quite frankly this database is challenging Brewer’s theorem and causing people to pause. Google isn’t alone. There are other providers that claim true ACID support across distributed database deployments, including FoundationDB, MarkLogic, VoltDB, and others forthcoming. I refer to these as Hybrid NoSQL providers, and they are challenging what we think of as the trade-offs between NoSQL and traditional RDBMS.

With so many NoSQL providers entering the market, what impact will this shift have? It will be interesting to see how the landscape shapes up over the next two years. How far will the pendulum swing, and where will it swing next?

Robert Fox is the vice president of application development for Liaison Technologies and the architect for several of Liaison’s data integration solutions. Liaison Technologies is a global provider of cloud-based integration and data management services and solutions. Rob was an original contributor to the ebXML 1.0 specification, is the former chair of marketing and business development for ASC ANSI X12 and a co-founder and co-chair of the Connectivity Caucus.

Tags:

Evolution of Hybrid NoSQL to Tackle Big Data

Tags:

Fresh From The Blog

Learn More