Data Preparation Pitfalls and How to Extract the Most Value from Data for Decision Making

A small financial forensics team was called upon to examine transaction records of a large bank holding company to ensure the client was complying with federal and international regulations. During its investigative process, which had to be completed in 30 days, the team realized the data was, in one manager’s words, “a mess.”

The potential risks of rushing through the data prep process was high. If analysts began their work using inaccurate or incomplete data, they might overlook a merchant that was laundering money for criminal or terrorist networks. If that were the case, the bank itself would be at risk of heavy fines or sanctions for allowing such activity to occur.

The alternative was to spend a great deal of time getting the data ready, leaving no time for the actual analytic exercise.

As many organizations encounter, the time needed to get the data ready expanded and the agency risked not meeting the deadline because they had to spend a good portion of the project trying to clean up the data and prepare it for analysis. Further exacerbating the situation, they found out they had more data and even a different data set to work with.

Pitfall #1: value leak

The above example of a Paxata client points to a pitfall situation most enterprises face: they have more data than they know what to do with and are often expected to make some important decisions on gut and intuition instead of analyzing the data. Why? Data preparation costs every organization lots of money, time and resources if not done correctly.

Regardless of whether your IT organization uses a traditional ETL (extract, transform and load) process to get data pulled and formatted for use, or dozens of analysts who spend countless hours manually “crafting” data in Excel, or data scientists who perform MapReduce jobs, the reality is that most organizations don’t put a price tag on the resources, time or effort it takes to prepare data.

Think about one analytic exercise your team performs each week or month:

Where does the data come from? One source? Five sources? Too many to think about?
How varied is the data? All from corporate systems or a mix of third-party, corporate and personal data sets?
Who is involved in getting the data ready? One person? Five people?
How long does that process take? Is anyone waiting for it?

Without knowing these answers, you are creating a value leak in every analytic exercise. Take the time to size and scope your various data preparation investments so you know where your biggest pain is and how best to solve it.

So back to the case I mentioned earlier – the small data/analytics investigation firm tasked with examining transaction records of the large bank holding company. What was the outcome?

By using Paxata, the analyst team rapidly incorporated the new data without starting over or sacrificing work that had already been done. That allowed the team to spend less time on preparing the data for analysis and more time actually doing the high-value analytic work the client was paying them to perform. Paxata’s ability to swap in new data on the fly allowed the security consulting team to save several days of effort; to ensure the data was of the required quality and to meet its 30-day deadline.

In fact, that first use case actually changed how the financial forensics analysts do everything within an investigation, from initial scoping of a project to the actual analytics. Today, they start all new projects with data forensics. With Paxata, they put all the data into the system and quickly get a sense of patterns, gaps and recommended opportunities to join the data, etc. They then understand the dynamics of the data in a way that actually shapes the investigation path. This allows them to get the investigation up and running faster and reduce the risk of missing something in the process.

It also means they can take on new clients at a faster pace, avoid underscoping or underbidding a project, avoid long delays in the bidding process and expedite the time to analytics by making sure they have all the data they need up front.

Pitfall #2: Compromising trackability of data

Now that you know what’s involved in getting data ready, here’s another pitfall to avoid: the nearly impossible task of retracing your steps. Let’s not talk about spreadmarts since the focus of the spreadmart discussion tends to be on where the data is sitting, and we’ve all hoarded spreadsheets on our laptops. Instead, let’s focus on data lineage in the data-preparation process, regardless of where the data sits.

For example, imagine a meeting you attend where everyone is staring at two dashboards from two different departments (e.g., finance and marketing). A raging debate ensues as to why the results shown on the dashboards are so wildly different. Where did the data come from? How were the industries aggregated? Did someone round dollars up or down? Did they segment based on five- or nine-digit zip codes. Did they include Japan in the APAC rollup? Suddenly, how the data was prepared (not so much where it sits) – and what decisions were made to get to the underlying data found in these dashboards – becomes really important.

As the example illustrates, data governance is a critical aspect of any data preparation exercise; so don’t compromise trackability for flexibility. Look for tools that can track and replay every step taken during the data preparation process so that teams using the data can clearly demonstrate the syntactic and semantic choices they made along the way.

The Paxata solution allows analysts to decide with whom they want to share their project or AnswerSets™ across the organization. In addition to capturing the sequence of end user steps, time stamps, and the end user who made the changes, Paxata gives analysts the ability to add textual annotations explaining why they made data preparation changes in a given step. This provides additional context to the data preparation process by providing a mechanism to capture knowledge that would otherwise be trapped in an analyst’s head.

[Editor’s note: For more information on end-user data prep, read “Q&A with Paxata CEO on Self-Service BI and Data Governance Issues.”

Pitfall #3: Forgetting the data prep end users

It seems every vendor in the data or business intelligence (BI) market has adopted language about self-service data preparation. But the fact is many enterprises forget who their users are and what they really need from their data preparation tools. To avoid the pitfall, here are some basic things to assess about users:

How technical are your current users? Think about the people who are actually doing data preparation. Are they able to keep up with the demand for “ready” data? Are there bottlenecks in their process?
Is your organization ready for self-service? Are there other users in the organization who could/should be doing their own data preparation but can’t because of the toolset available?
How often do data prep projects get repeated or reused? Once a month? Weekly? If there were a way to automate a set of complex data preparation steps and then apply them to new datasets on a frequent basis, would that be important?
What is the mix between big and small data preparation projects? Do you need the ability to prep massive data sets without slowing down the more ad-hoc data requirements that are smaller in size?
Do data volumes restrict the types of tools that can be used? For example, it is tricky to apply Excel to the project when analysts are working with 200 million rows. Does the dataset size drive the need to sample because it’s impossible to prepare the massive files in their entirety?
What are the typical use cases for which you are getting data ready? Is it a packaged application, where data quality and integration are the most needed capabilities? Is it migrating legacy datasets to a new system, in which case flexible data “engineering” is essential? Is it for agile BI, where business teams ask questions of the data as they visualize it, then require more data, then visualize it again, dynamically iterating on the dataset as they go?

This is not a comprehensive list, but it’s a good start at determining what your data preparation needs really are. Once complete, the landscape won’t look as cluttered.

Data governance solutions

When selecting a data governance solution that will enable your organization to extract the most value for decision making, be sure the solution makes it very easy for both business and IT teams to work together on aspects such as data access and data quality.

And stay tuned … in the next 12-18 months, solutions in the data governance and end-user data prep space will change and … dare I say … make it fun to prepare data.

For more information on end-user data prep, read “Q&A with Paxata CEO on Self-Service BI and Data Governance Issues.”

Prakash Nanduri, co-founder and CEO of Paxata, has 20+ years’ experience in startups and large companies. He was co-founder/VP of Velosel Corporation (acquired by TIBCO). He led the post-merger integration effort at TIBCO, then spent three years at SAP as head of product and technology strategy within the office of the CEO and was responsible for strategic initiatives including the SAP Big Data (Hana) business strategy. Connect with Prakash or follow on Twitter.

Tags:

Data Preparation Pitfalls and How to Extract the Most Value from Data for Decision Making

Tags:

Fresh From The Blog

Learn More