For years, the promise of “the single view of the customer” drove the need for centralizing all related data into a common data warehouse – many sources, one target. An entire data management ecosystem was created to integrate, connect, extract, transform, cleanse, master and secure data in the warehouse. The data pipeline was known and straight. Now, customer data is far from straight; the number, location and variety of sources grow daily, and processes need to adapt to handle ambiguity.
Data decentralization on the rise
The more the data management industry consolidates, the more opposing forces decentralize the market. Rather than the need for customers to manage data centrally, they look for capabilities that manage data where it is.
- Cloud warehouses – even though teams have a data warehouse in their data center, many spin up analytics and warehouse environments on Amazon or Azure creating new sets of data in the cloud.
- Software-as-a-Service (SaaS) – While departments have capabilities in their existing Enterprise Resource Planning Systems (ERPs), many still seek agility by moving customer relationship management and marketing automation applications to a SaaS offering – many of which also have an analytics capability, such as Salesforce Wave.
- Visualization – Although companies standardize on data visualization and reporting tools, more users choose to utilize their personal tool of choice. In many cases, this continues to be Excel, so it becomes yet another data repository.
- Data-as-a-Service (DaaS) – Many data sources that enhance that view of the customer are available “out there” as a service and not “in here” behind the firewall.
Depending on your industry, you can feel free to replace the word “customer” with “supplier,” “patient,” “product” or really any data domain that is important to your business, and the story still holds true. These decentralized data sources still need to be accessed, cleansed, integrated and analyzed. And the tools that access, cleanse, integrate and analyze need to support decentralized data sources.
Is data curation the answer?
At a recent chief data officer event at MIT – CDOIQ Symposium, this year’s Turing Award winner, Dr. Michael Stonebraker, shook the house in his keynote by proposing that data integration solutions, such as ETL (Extract, Transform and Load), do not scale when you have more than 20 data sources. His point was based on the idea that source data models constantly change and the task of data curation (“turning independently created data sources into unified data sets ready for analytics involving data domain experts to guide the process”) requires the help of human-guided machine learning to be feasible. He also suggested that the data domain experts reside where the data sources are created rather than at the point of integration (e.g., an ETL programmer).
In the case where an organization needs to curate data from hundreds to thousands of data sources, including spreadsheets on CFOs’ workstations, data and metadata intelligence with automation make sense. But, on the other hand, for an organization that lets mission-critical and highly sensitive data go unmanaged, it makes you wonder if their culture resembles the Wild, Wild West with bigger issues than curating financial budgets from spreadsheets on laptops.
The process of data curation, also called “data wrangling” by some startups (aligning with the Wild West theme), tackling a similar challenge in Data Lakes, is certainly a necessary component in data preparation. As organizations start to understand how they can harness a variety of existing and new data in predictive and operational analytics, having tools to automate as much as possible significantly improves productivity. Data curation or wrangling certainly can help speed time to insight but will need to be flexible enough to support the process no matter where the data reside.
Comparing data centralization approaches
Let’s consider the different approaches to centralizing data for reporting and analytics.
Data centralization assumes your organization subscribes to having a central repository for data, such as the classic data warehouse or possibly an emerging Hadoop-based Data Lake. In either architecture, data is ideally collected, cleansed, de-duplicated, mastered, integrated and presented to the user or analytics tool. The pros of centralizing data are that if data is in one location, access and use should be simplified. Data management processes should be streamlined and shared, eliminating confusion or duplicate data sets. The definition and quality of data should be consistent and easier to maintain. The cons are centralized approaches may not serve global organizations especially when it comes to localization, performance and accessibility.
On the other hand, data decentralization assumes the approach that data stays where it gets created. The wrangling tasks are conducted in place or virtually where only the cleansed, transformed and desirable data is presented to the user or analytics tool. In larger enterprises with multiple business units, imagine each business unit manages and creates its own data ecosystems independent from the mother ship. Each unit is able to localize, process and analyze data in an agile manner. The cons? Lack of standardization for reporting across business units, lack of common enforcement of data governance policies and redundant infrastructures.
While the data decentralization approach seems to be on the rise, data centralization still drives the majority of IT and data management technology spend, at least for now.
The hybrid compromise
Considering the stark contrast of the pros and cons to data centralization and decentralization, a hybrid approach makes the most sense and, in most cases, will yield the most benefits. In this approach, individual users continue to create their own siloed data stores to help them manage their day-to-day tasks. Organizations continue to strive to collect and curate data in a centralized manner where they achieve confident decision making across business units with acceptable accuracy and risk.
For business initiatives or challenges that are severe or significant enough, successful leaders need to balance a hybrid approach to ensure that the right data management strategies are implemented, supported and effectively maintained throughout the data life cycle.
Julie Lockner is VP product marketing, data security and archive and VP market development at Informatica. Earlier she served as ESG’s VP/senior analyst covering data management solutions and managing end-user consulting. Prior to ESG, Julie was president/founder of CentricInfo, acquired by ESG, specializing in implementing data governance and data center optimization. Julie was VP at Solix Technologies, senior portfolio architect at EMC, and held engineering and product management with Oracle, Cognex and Raytheon. Follow her on Twitter.