Data discovery is one of the fastest-growing and rapidly changing segments of the BI market. These tools differ dramatically from the traditional systems of record that enable IT to push reports and dashboards out to the rest of the organization.
In many cases, data discovery tools are purchased by organizations that have already deployed traditional BI systems, in order to solve issues with data access, data preparation and data exploration. Data discovery solutions have also been a godsend for small businesses that can’t afford complex data warehouses and lack the expertise to build them.
The market for data discovery software is complex and highly fragmented. There are a number of different “flavors” of data discovery, and a variety of use cases in which one flavor works better than another.
In this Buyer’s Guide, we’ll explain how data discovery software differs from traditional BI and describe the categories into which these tools break down.
Here’s what we’ll cover:
An easy way to understand this difference is to look at the history of BI solutions.
Traditional BI systems were an attempt to solve the difficulty of writing SQL queries in order to retrieve data such as sales information, customer information, shipping records etc. stored in multiple relational databases. Before BI, users had to be highly familiar with SQL to get the data they needed out of such databases.
Thus, traditional BI systems mapped a layer of familiar business terms (known as a semantic layer) onto the relational databases’ storage schemas, thereby allowing users to retrieve and combine data without knowing SQL at all.
The semantic layer is a way of expressing a data model, or a schematic representation of the relationships between data in one or multiple datasets. In particular, the semantic layer schematizes the relationships between data residing in different data sources/databases. For instance, the dimension “customer” in the semantic layer may be defined as grouping together information from both the “sales orders” database as well as the “customer records” database.
BusinessObjects—later acquired by SAP—was the first BI vendor to use the semantic layer model, and remains one of the most popular semantic layer-based solutions. The semantic layer model is still suitable for large enterprises that need unified access to data stored in numerous operational databases.
The problem with this model is that the semantic layer needs to be standardized across the organization. In other words, various business units must agree on which databases and tables in these databases the dimension “customer” will pull from. Moreover, once the semantic layer has been standardized, it remains under IT control.
As you can see in the above diagram, traditional tools for ad hoc queries pass analysts’ queries through the semantic layer, which automatically translates them into SQL queries to retrieve data from SQL databases and other data sources that support SQL querying. Thus, traditional querying tools can only work with data sources that have already been integrated into the semantic layer.
Data sources outside the semantic layer (a spreadsheet sent in an email, a public data source on the web, 500,000 Tweets about a product recall etc.) can’t be easily integrated with the semantic layer unless IT develops new processes. And, of course, IT can’t develop a process for every new data source.
When the semantic layer is standardized across the organization, the paths that analysts follow to retrieve and combine data get frozen into place. For instance, if the organization defines “store” as a subcategory of “branch,” and “branch” as a subcategory of “sales region,” while neglecting to slot “customer” somewhere into this hierarchy, blended analysis of sales and customer data can become overly complex.
Business terms mapped to operational data in SAP BusinessObjects
Data discovery tools remedy this situation by providing direct access to the operational databases shown in our chart, instead of going through a semantic layer. This allows users to combine spreadsheets and other data sources outside the semantic layer with operational data.
Any data preparation work that needs to be done to combine data sources (e.g., converting “customer_ID” to “customer”) is done on the fly, instead of forcing IT to standardize terminology across the organization.
Additionally, users can develop their own data models during analysis, instead of being bound to the data model encoded in the semantic layer. This allows greater flexibility for sophisticated queries that depend on blending data from multiple sources.
There’s a wide range of data discovery platforms, meaning that listing specific features is pointless. Instead, let’s take a quick look at the broad capabilities that define these solutions:
|(Graphical) front end for data manipulation||Allows for data access and manipulation via visualizations of data sources and patterns in data. Instead of writing a query, you can simply click on a wedge of a pie chart to drill down, or choose a heat-map visualization for your data.|
|In-memory processing||Processes data by storing it in RAM (random access memory) instead of writing it to disk. This gives them the processing power to blend massive data sets on a user’s laptop, instead of doing the blends in the database as traditional BI tools do. See our data blending report for more details.|
|Big data connections||Supports direct connections to data sources, instead of confining access to sources within the semantic layer. Support for flat files (.xlsx, .csv etc.) is nearly universal, as is support for SQL databases. Beyond that, the range of data sources a tool can connect to is generally a point of competitive differentiation.|
|Data cleaning/preparation||Offers features for cleaning and preparing data, since analysts can’t rely on pre-integration of data sources via a semantic layer. These features are for normalizing dimensions, removing trailing spaces, testing the accuracy of joins etc. on the fly.|
Note: Several of these definitions of data discovery capabilities were adapted from Gartner research reports, specifically What Data Discovery Means for You by Joao Tapadinhas and Dan Sommer (available to Gartner clients).
Data discovery has been an emerging market for at least a decade, but instead of solidifying around a core set of concepts and features, the market has continued to evolve.
Data discovery functionality has also been added to traditional systems that use semantic layers, though such systems will still be overkill for many small businesses.
There are essentially three categories of data discovery solutions currently on the market:
Visual data interaction tools are analytics tools that directly access data sources instead of going through a semantic layer. They allow users to process massive datasets on their laptops (via in-memory caching engines) and spot patterns using a visual interface.
Data visualizations in Tableau
The point of a visual data discovery tool isn’t simply to crunch numbers and then output pretty charts and graphs, which can easily be done with Excel and Powerpoint. Instead, these tools are for interactive manipulation of data via visualizations.
For example, you can click on a particular city in a heat-map to begin analyzing sales just within that city’s stores. You can then add another dimension to your map—say, aggregate payroll expenses per store—to blend sales and payroll data and spot new patterns.
As you click on visualization elements and drag and drop dimensions and measures into your visualizations, an engine within the data discovery tool translates your gestures into SQL queries. Changing the visualization automatically refreshes it with newly processed data from your databases.
These tools thus allow for highly interactive and sophisticated database querying without forcing users to learn SQL. Moreover, they allow users to access and blend data from multiple data sources that haven’t been integrated via a semantic layer.
Visual data interaction tools are thus known as “self-service” BI tools, since business analysts can get the data they need and analyze it in the ways they want without involving IT in the workflow.
Originally, visual data interaction tools were designed to supplement the capabilities of an existing BI system. As they’ve evolved, however, they’ve incorporated more and more of the capabilities that used to be found only in traditional systems. Many organizations—especially smaller ones—are now exclusively relying on this form of data discovery as their dominant analytics platform.
Visual data interaction tools make up the bulk of the data discovery market, and frequently data discovery is used as a synonym for business analytics via interactive visualizations.
“Search engine-like” tools are a niche category in data discovery. They’re specifically for performing keyword searches of large collections of files, and they feature an interface similar to that of web search engines such as Google and Bing. Search-based tools harness text mining technology to allow users to search keywords within files and documents:
Data discovery using keyword searches and word clouds in WebFOCUS
Search-based tools are clearly not the best choice for dealing with numerical values, which are, of course, absolutely central to business analysis. Instead, this form of data discovery is used by organizations with massive collections of unstructured textual data (surveys, documents, presentations, product literature etc.) sitting in numerous data siloes.
Without search-based data discovery, employees may never be able to track down the documents they need on their own. These tools thus enable better information-sharing, at the same time cutting down on the time that information “gatekeepers” have to spend tracking down documents for co-workers. Most small businesses won’t need them.
“AI”-based tools. Visual data interaction tools can be used to support pattern via machine learning (or “AI” in layman’s terms). Generally this requires integration with a variety of other tools and technologies ranging from the statistical programming language “R” to Apache Spark (a framework for programming machine-learning algorithms in cluster computing environments).
“AI-based” data discovery tools directly leverage machine learning to spot patterns for users, instead of enabling users to spot patterns themselves through visual analysis. These tools then output visualizations and can even express the patterns they find in narrative form for users (for example, they can output a sentence stating “Q4 revenue down 2.1 percent in Kentucky branch stores served by X, Y and Z distributors.”
Don’t assume that a HAL 9000 will replace your analysts anytime soon, however. Human beings still need to vet the patterns to make sure that they’re truly significant, and once a pattern has been spotted, users can continue to refine the analysis by asking new questions of the tool, similar to the workflow in a visual data interaction tool.
Examples of “AI”-based data discovery tools include IBM Watson and Salesforce BeyondCore. This is still an emerging market, and while promising, these solutions are too expensive and technologically immature for SMB users at present. Most SMBs will be better served exploring the wide range of visual data interaction tools on the market.
Note: Several of these definitions of categories in the data discovery market were adapted from Gartner research reports, specifically What Data Discovery Means for You by Joao Tapadinhas and Dan Sommer (available to Gartner clients).
We're able to offer this service to buyers for free, because software vendors pay us on a "pay-per-lead" basis. Buyers get great advice. Sellers get great referrals.