The Anatomy of Big Data

Welcome to the twilight zone of information management where every second of every day terabytes and petabytes of structured and unstructured information are relentlessly created by machines, humans and now their devices. The digital exhaust from social media, BYOD and telecommunications represent an almost unmanageable tsunami of information that is challenging nearly every organization, business and individual globally. Many organizations have hundreds of databases and information archives often in silos making it nearly impossible for anyone to find the right and related information for accurate and business-critical decision making.

The real challenge for organizations is that nearly 80 percent of all information is in an unstructured format making it difficult to search, navigate and organize. Knowledge workers spend 25 percent of their time searching for information, structured and unstructured data that don’t work well together and solutions that work well on the Web but not in the enterprise. As inherently analytical scientist types, we like to build our house on a rock of solid primary and secondary research.

In this article we explore the anatomy of Big Data from the perspectives of technology, emerging technologies, its challenges to organizations and individuals, the overall market, its business value and social media. Also included are some seminal insights from this year’s first Big Data Boot Camp held in May in NYC and a Big Data survey.

Big Data technology

The archaic databases of the 1980s aren’t designed for, agile or fast enough to manage all Big Data types and objects. Also, their price is prohibitively expensive for most organizations and the management of backup and recovery processes is extremely complex.

In our view, the most significant challenges that Big Data presents to the enterprise revolve around what we call the “four pillars of Big Data” and the characteristics and data types that make them easy or hard to manage.

The four pillars of Big Data

Big Tables	Big Text	Big Metadata	Big Graphs
Structured	Unstructured	Data about data	Object connections
Relational	Natural language	Taxonomies	Subject predicate object
Tabular	Full text	Ontologies	Triple store
Rows and columns	Grammatical	Glossaries	Semantic discovery
Traditional	Semantic	Facets	Degrees of separation
		Concepts	Linguistic analysis
		Entities	Schema-free

RDBMS, open source and Hadoop

There are many new open source and proprietary emerging technologies and platforms such as NoSQL databases that are designed to handle unstructured data, and a healthy ecosystem of solution and tool vendors has emerged around them. New innovative search enhancement tools such as Applied Relevance’s Epinomy automate taxonomy building and enable organizations to manage and organize unstructured data for high performance search and nearly real-time decision making. One of the best recent blogs on this area is Doculabs’ blog.

Hadoop is clearly the most important Big Data-enabling technology and is now rapidly becoming the heart of any modern data platform. Hadoop is a cost-effective and scalable platform for staging Big Data; however, some NoSQL databases, such as MarkLogic, for example, don’t really require Hadoop.

One of the most insightful presentations at this year’s Big Data Boot Camp came from a Big Data consultant, Alex Gorbochev. Below is a brief overview of his view of RDBMS, NoSQL, Big Data and Hadoop.

When RDBMS’s make no sense

Storing images and video
Processing images and video
Storing and processing other large files
PDFs, Excel files
Processing large blocks of natural language text
Blog posts, job ads, product descriptions
Semi-structured data
CSV, JSON, XML, log files
Ad-hoc, exploratory analytics
Integrating data from many volatile external sources
Data clean-up tasks (data wrangling)
Very advanced analytics (machine learning)
Business domain knowledge is not well defined

Key benefits of Hadoop

Reliable solution based on unreliable hardware
Designed for large files
Load data first, structure later
Designed to maximize throughput of large scans
Designed to leverage parallelism
Designed to scale
Flexible development platform
Solution ecosystem

Some key use cases for Hadoop include analysis of customer behavior, optimization of ad placements, customized promotions and recommendation systems such as Netflix, Pandora and Amazon.

Hadoop also provides inexpensive archive storage with an ETL layer, transformation and data-cleansing engine. One of the key differentiators for Hadoop is reported to be its ability to support 100-1,000 Hadoop cluster nodes (unlike traditional RDBMS, which support maybe dozens of nodes).

Big Data and In-Memory

In-Memory databases and appliances are another welcome addition to the era of Big Data and business intelligence arsenal that is significantly changing and optimizing many business processes. In-Memory databases and appliances have the potential to allow organizations, line-of-business managers and the C-suite to spend more time on creating simulations, scenario planning and analysis of Big Data, and less time on building and waiting for queries. Check out this In-Memory article by Asteria Research on SandHill.

Big Data and graph databases

One of the most important technologies in the Big Data space is graph databases, which are sometimes faster than SQL and greatly enhance and extend the capabilities of predictive analytics by incorporating multiple data points and interconnections across multiple sources in real time. Predictive analytics and graph databases are a perfect fit for the social media landscape where various Big Data points are interconnected.

The Big Data market

According to Gartner, Big Data initiatives will drive more than $200 billion in IT spending over the next four years along with tremendous change in how Big Data is managed. McKinsey has predicted that businesses will get $3 trillion in business value from Big Data in the next several years.

But what is really going on in the Big Data space is the IT departments’ lemming-like move to inexpensive open source solutions that are facilitating the modernization of data centers and data warehouses. And at the center of this universe is Hadoop.

Only a few so-called or positioned “Big Data” companies are actually making money at this time from our perspective. Organizations are moving away from the traditional IT vendors and their business models that often come with pricey software and hardware systems, along with significant maintenance fees exceeding 22 percent per year.

In the evolution of the Big Data market, open source is playing a seminal role as the disruptive technology challenging the status quo.

In a recent survey of 300 data managers conducted by Information Today for the Big Data Boot Camp conference, we identified some interesting trends in the Big Data market, along with the top industries with Big Data projects, business initiatives and the types of data being used.

Top Big Data industries

Retail
Financial services
Technology
Manufacturing
Government
Education
Telecommunications

Current Big Data business initiatives

Customer analysis
Historical data analysis
Machine data, production system monitoring
Website monitoring and analysis
IT systems log monitoring and analysis
Competitive market analysis
Content management
Social media analysis

Big Data signal types

Production or transactional data
Real-time data feeds
Textual data
ERP data
CRM data
Historical data
Web logs
Social media data
Multimedia data
Machine2machine data
Sensor data
Spatial data

Is Big Data Big Brother?

Big Data is a new frontier for data privacy and security. Many of us are now realizing that we are indeed “selling” our data on Facebook, Google, Twitter. In many ways we are not their customer; we are their product. All social media networks are collecting information about us and selling it to advertisers and other organizations that are targeting affinity groups and influencers related to their business.

“If My Data Is an Open Book, Why Can’t I Read It?” is an excellent May 26, 2013 article in The New York Times, which explores how much data is collected about us but more importantly how much of that (your) data you can access. In the case of wireless providers, you don’t get access to your location logs without a subpoena, and few including the social media kings are providing transparency as a feature to their customers/members like Amazon.

Consequently, you don’t own your own data, and at Google it’s really about data events!

Think about your data sphere. Below is partial list of what it could be.

Your Big Data sphere

FICO scores
Time-of-purchase data
HIPAA
Background checks
Videos in your neighborhood
City police surveillance
Business surveillance
Intersection and red-light monitoring
Personal home surveillance
ATMs
YouTube
Smart phone tracking
Smart phone application use
Social media interactions: Yelp, Facebook, Twitter, etc.
Flickr
Your blog
Your search patterns
Your blood pressure monitors, calorie counter and movement sensors

Social media and Big Data

Social media networks are creating large data sets that are now enabling companies and organizations to gain competitive advantage and improve performance by understanding customer needs and brand experience in nearly real time. These data sets provide important insights into real-time customer behavior, brand reputation and the overall customer experience.

Intelligent or data analysis-driven organizations are now monitoring, and some are collecting, these data from propriety social media networks such as Salesforce Chatter, Microsoft, Jammer and open social media networks like LinkedIn, Twitter, Facebook and others.

Figure 1

The majority of organizations today are not harvesting and staging Big Data from these networks but are leveraging a new breed of social media listening tools and social analytics platforms. Many are employing their public relations agencies to execute this new business process. Smarter data-driven organizations are extrapolating social media data sets and performing predictive analytics in real time and in house.

There are, however, significant regulatory issues associated with harvesting, staging and hosting social media data. These regulatory issues apply to nearly all data types in regulated industries such as healthcare and financial services in particular. SEC and FINRA with Sarbanes-Oxley require different types of electronic communications to be organized and indexed in a taxonomy schema and that they be archived and easily discoverable over defined time periods.

Data protection, security, governance and compliance have entered an entirely new frontier with introduction and management of social data.

Figure 2

Social media analytical tools identify and analyze text strings that contain targeted search terms, which are then loaded into databases or data-staging platforms such as Hadoop. This can enable database queries by, for example, data, region, keyword or sentiment. These queries can enable insights and analysis into customer attitudes toward brand, product, services, employees and partners.

The majority of products work at multiple levels and drill down into conversations. Results are depicted in customizable charts and dashboards as shown in the image above.

Social media-Big Data analytics

On the bleeding edge of social media analytics is a new wave of tools and highly integrated platforms that have emerged to provide social media listening tools and enable organizations to understand content preferences (or content intelligence) by affinity groups and what brands they are following or are trending.

There were several early leaders in this space; however, they were acquired (sucked into the vortex of their acquirers) and embedded into larger platforms, resulting in a loss of the original innovators and intellectual property. The good news is that the innovators are here.

Below is a short list of new vendors that are taking social media data to a new level.

Some new social analytic tools

Attensity http://attensity.com
Infinigraph http://www.infinigraph.com
Brandwatch http://www.brandwatch.com
BambooEngine http://www.manumatix.com
Kapow http://kapow.com
Crimson Hexagon http://www.crimsonhexagon.com
Sysmos http://www.sysomos.com
Simplymeasured http://simplymeasured.com
Netbase http://www.netbase.com
Gnip http://gnip.com

Varieties of Business-Critical Big Data

Legal and regulatory
Digital metrics
Energy cost
Senior monitoring
Predictive
Electronic trading
Sustainability and environmental
Supply chain
Genetic
Vehicle telematics
Real time
Oil prices
Consumer behavior
Social
Emerging markets
Location
Disaster
Product

Net/Net

The good news is Big Data provides many more new types of data for analysis that are seminal in this millennium, which is all about the new data-driven culture of real-time decision making.

As Brynjolfsson and McAfee MIT research shows, data-driven companies are in the top third in their respective industries and are five to six percent more profitable than those that are not data driven (HBR, October 2012).

Multiple new and diverse Big Data sets add additional parameters to some business models that weren’t available before. For many CEOs Big Data is all about potentially disruptive business-critical Big Data.

In summary, data is information and the characteristics of Big Data are all about high-volume velocity and a variety of information assets that facilitate new forms of decision-making that leverage these characteristics for competitive advantage.

Peter J. Auditore is principal researcher at Asterias Research, a consultancy focused on information management, traditional and social analytics and Big Data. He was a member of SAP’s Global Communications team for seven years and recently head of SAP Business Influencer Group. He is a veteran of four technology startups including Zona Research (cofounder), Hummingbird (VP marketing Americas), Survey.com (president) and Exigen Group, (VP corporate communications). He has over 20 years’ experience selling and marketing software worldwide.

George Everitt, CEO, founded Applied Relevance in 2006. Prior to AR, George was a senior consultant at Verity (10 years) and Autonomy (one year). While at Verity and Autonomy, George participated in dozens of professional services engagements worldwide implementing high-performance, super-scalable enterprise search applications. The breadth of George’s experience covers most industries from pharmaceutical, telecommunications, retail, manufacturing, public sector and financial services.

Tags:

Fresh From The Blog

Learn More