Big Data — everyone seems to have it and everyone wants to do something with it — and why not? The power of information lies in the ability to analyze and use it to your advantage. While Big Data analytics is one of the hottest trends at the moment, it’s also one of the most difficult to deploy. It’s not as simple as just putting together a business intelligence strategy.
First, you must grapple with organizing the mountain of data you have so that you can actually apply analytics to it. Consider the following questions: How can you pre-qualify the data that you are collecting especially when you may not even know what questions you are going to ask of it? How do you develop a strategy to continually qualify and refine your data even after it’s been acquired to ensure the best possible outcomes? And, how do you know you are receiving all the data you need to get accurate answers to questions?
This is the catch-22 effect of Big Data, where the paradox is that IT needs requirements from the business to build a system to analyze data, but business cannot provide these requirements until they understand the data … which they may not be able to do until IT builds something for the business to use.
From the IT side, the challenge of identifying, collecting, retaining and providing access to all relevant data for the business at an acceptable cost, and within the management and maintenance capabilities of the IT organization, is huge. From the business perspective, Big Data opens a whole other can of worms that is difficult to rationalize, including the “how do I get started?” question.
So where do we begin? Let me first take you to the dark side ….
Despite its sinister-sounding title, dark data is more benign than its name suggests. Dark data is data or content that exists and is stored by an enterprise, but typically is not leveraged and analyzed for intelligence. It includes data that is in physical locations or formats that are either not connected (i.e., siloed) or are considered too costly or complex to contribute to analysis. It also includes data that is currently stored and can be connected to other data sources for analysis, but the enterprise has not dedicated sufficient resources to analyze and leverage. Although dark data rarely ever sees the light of day, no one feels comfortable destroying it because it might prove useful someday in the future.
By itself, dark data may not have much value; but combine it with data you already collect or purchase, and you may have a digital gold mine. For example, sales reps out in the field may be taking copious notes pertaining to their prospects, but they don’t capture these details electronically. If there were a way to tap into this wealth of knowledge and analyze it alongside existing data, your business could potentially uncover new selling strategies as well as more prospective customers.
Knowing is half the battle. An initiative to first identify and catalogue sources of dark data is a great first step. Once these sources are identified, where possible, an attempt to qualify if the data is worth analyzing should be determined. The idea is to try and pre-evaluate those dark data sources that may have the greatest potential for contributing to business intelligence. Once this has been completed, it really comes down to analyzing the cost of bringing those sources into the analytics space. This exercise also sheds light on dark data — even if it goes unused.
Half-life of data
Big Data is most commonly measured by three Vs: volume, variety and velocity. In a nutshell, the equation measures how much data you have, the different types of data you have and how fast that data is coming at you. But an often overlooked, but equally important, V is value. This fourth V focuses on qualifying the data for business value. Being able to qualify data is a huge piece to being able to produce high-quality analytics.
In a recent conversation with Nathaniel Rowe of the Aberdeen Group, I had the chance to speak with him at length about how to define the quality, or value, of data and the importance of determining the half-life of data. Being able to identify and qualify the value of data as it is acquired will greatly assist with downstream data maintenance, as discussed in the following section.
Data, like radioactive materials, has a half-life. Over time, data that is no longer relevant or which isn’t contributing to answering key questions needs to be trimmed out. If left to remain in your data store, your volume of data will increasingly inhibit, and potentially cripple, your ability to perform quality analytics. Useful data at one point during its life may have decreasing or even negative efficacy. So a crucial part of a successful business intelligence strategy is being able to score the data that is used to answer questions in order to determine the data’s half-life. Products and prescriptions many times conveniently come with a pre-determined shelf-life, marked right on the package. Data is a little more elusive in that regards — it doesn’t generally come with a freshness indicator.
It is important to know that different types of data have a different half-life. Let’s take, for example, how a retail business identifies buying trends. Customer purchasing information obtained from CRM or order management systems can predict buying patterns that could be relevant to sales projections for years to come. Obviously, this information is worth holding on to for a longer period of time and, therefore, it will have a fairly lengthy half-life. On the other hand, spotting a real-time buying trend from social feeds such as Facebook or Twitter will have a significant short-term impact, but likely a much shorter viability period, or half-life, for forecasting future buying trends and, as such, should be thinned out much sooner.
This is where your data-retention policy has a great impact on Big Data and your ability to leverage it. A data-retention policy is one of the most important policies for a business to have, but it’s often one that isn’t regularly monitored or enforced, making it ineffectual. To be effective, your retention policy should cover all data no matter what form it is in and where it resides. Some things to keep in mind when creating your policy include:
- Identify industry, state and federal regulations or laws that affect your data
- Create a data classification policy to categorize the data and assign the appropriate retention period
- Locate where your data resides
- Create a set of procedures to manage, track and dispose of the data according to policy
- Monitor your data resources to ensure compliance with the data-retention policy
- Update your data-retention policy due to changes in industry requirements, state and federal regulations and technology changes
- Maintain appropriate levels of security at all stages of the data retention policy
- Mark data that contributes to positive analytics results so that, over time, unused (stale) data can be identified and pruned
By having a well-defined data retention policy and tagging or scoring your data — in essence identifying its half-life — your business will be better able to keep its data fresh and relevant, which will lead to higher-quality analytics.
Like a good a cappella group, it takes skill, practice and teamwork to hit all the right notes to produce a beautiful harmony. Similarly, the key to a successful business intelligence strategy, particularly when dealing with Big Data, is in ensuring data quality and harmonization. More accurate answers to questions happen when disparate data types representing common ideas can be normalized, de-duplicated and error-remediated.
Picture a company that aggregates travel information from different providers as a consolidator. They get the same kinds of data in different formats: spreadsheets, CSVs, XML, etc. Some of the providers send duplicate records. Some providers share information that is inaccurate or does not exist (like a typo for an airline carrier’s name). The key to getting good, solid answers to business intelligence questions is directly impacted by how good the data is. This requires a good strategy to empower the data being captured for business intelligence. Part of the catch-22 is to know and assess data quality and to implement the right tools and rules to improve the relevancy score of data.
True harmonization means more than just connecting data — it means making sense of it all so it can be used to speed time to market and enhance secondary use. Finding the right partner to help you capture, manage, normalize and harmonize your data is critical to creating a quality data set that can then be analyzed. A solid data-management strategy can make all the difference in the world in getting the right answers to the right questions. While data harmonization can be one of the most difficult things to do, it’s a critical piece that will allow you to get the most out of your data.
In the end, Big Data can offer many opportunities for businesses, when leveraged the right way. Remember, though, that just because you’ve analyzed some of your Big Data it doesn’t necessarily mean you’re helping to solve your business problems if you’re not starting out with the right questions. And in order to ask the right questions, you must first have the right policies and processes in place that provide you with a quality data set on which to base your questions.
Rob Fox is senior director of EAI/B2B Software Development for Liaison Technologies and the architect for several of Liaison’s data integration solutions. Liaison Technologies is a global provider of cloud-based integration and data management services and solutions. Rob was an original contributor to the ebXML 1.0 specification, is the former chair of marketing and business development for ASC ANSI X12, and a co-founder and co-chair of the Connectivity Caucus.