From Data Lakes to HTAP: 3 Alternatives to OLAP Data Warehouses

by:
on July 22, 2016

Whether you have a data warehouse already in place, or are just researching solutions, you’re probably aware of the splitting migraines that data warehouses can create both for business analysts and the IT department.

(If not, take a look at part one of this two-part series, which explains some of the limits of traditional data warehouses designed to support online analytical processing (OLAP) tools.)

With the evolution of in-memory computing, tools for interactive data visualization and new types of database management systems (DBMSs), the business intelligence (BI) market is now saturated with alternatives to the OLAP data warehouse.

Here, we take a look at some of the simplest workarounds for performing historical analysis without a fully fledged data warehouse. These workarounds can be particularly useful for small businesses that lack the IT or budgetary resources to implement a data warehouse in the first place.

We’ll also discuss advanced solutions like data lakes and HTAP that promise to eliminate the need for data warehousing altogether.

Here’s what we’ll cover:

(Click on a link below to jump to that section.)

Alternative #1: Data Lakes
Alternative #2: HTAP/In-Memory DBMSs
Alternative #3: OLAP-Style Analysis With Self-Service Tools
Moving Forward

Alternative #1: Data Lakes

Data lakes are essentially much less structured data stores than data warehouses.

Traditional data warehouses are based on relational database management system (DBMS) technologies that store data in tables, the most familiar example being SQL databases.

In contrast to the tables used in relational databases, the multidimensional data cubes found in these DBMSs are nonrelational data stores, as data is loaded into a cubic array rather than tables.

However, the data warehouse generally contains more data than the data cube itself to allow analysts to query down below the most granular dimensions in the cube.

Typically, data which is too granular or unstructured for loading into an OLAP cube is stored using a relational database to supplement multidimensional analytics in the cube. In other words, traditional DBMSs for OLAP support both relational and nonrelational data storage.

Generally, relations between tables in a data warehouse are defined using a star schema, which organizes data into fact tables and dimension tables to facilitate joining tables via queries:

Typical Star Schema for Sales Records
 

The star schema consists of two types of tables: fact tables and dimension tables.

  • Fact tables contain measures, i.e., numerical values for sales amounts, units sold etc., as well as foreign keys which are identical to the primary keys found in dimension tables.
  • Dimension tables contain primary keys along with attributes of the dimension—for example, “DayOfWeek” is here an attribute of the “Time” dimension. The dimensions and attributes found in dimension tables provide the drill paths that guide OLAP analysis.

By linking foreign keys to primary keys, star schemas associate the same facts with multiple dimensions, thereby enabling multidimensional analysis in a data cube.

By contrast, data in a data lake is far less structured and schematized. In many cases, data is still preprocessed before storage, but not using rigid structures (such as the star schema) typically used for data warehousing.

With a data warehouse, information is structured using schemas when it is loaded into the warehouse, an approach known as schema on write (since the data is schematized when it’s written to memory).

With a data lake, the schema is defined when the data is queried via an API or SQL, an approach known as schema on read, which requires much less rigid upfront data modeling.

Instead, the complexity of defining schemas is shifted to the query itself. This is why data lakes are ideal for analysts and data scientists who have advanced querying skills and are supported by a nimble IT department.

Schema on Write vs. Schema on Read
The schema-on-read approach allows data lakes to store numerous, highly heterogeneous data sets. Sales records and transactional feeds from mobile devices may be stored along with huge data sets scraped from social media for the purpose of sentiment analysis.

Data lakes overcome the technical challenges involved in using data warehouses to store big data sets. Moreover, they eliminate the need for extensive upfront modeling, since with a typical data warehouse the schema needs to be defined when the data is loaded.

This means that data lakes are ideal for use cases in which you aren’t sure at the outset how exactly you want to analyze your data sets.

Typically, you’ll be looking for transformative business insights in the data lake (new market segments, new ways to engage customers etc.), as opposed to running diagnostics on performance issues (e.g., determining why revenue dropped in Q3).

However, data lakes are notoriously difficult to manage, and business analysts may not have the necessary skills and tools to extract insights from them.

The bottom line: Data lakes are great for businesses that need to deal with big data such as those generated by web scraping, social media and customer interactions with web and mobile applications.
 
Additionally, data lakes are deep enough to handle the terabyte- and petabyte-scale data sets generated by massive sensor deployments in either industrial or consumer-oriented internet of things projects. However, you will probably need a data scientist or at least a highly trained analyst on your team to derive useful insights from them.

A Word of Warning

Currently, data lakes as well as analytics on data lakes are primarily supported by open-source frameworks such as Apache Hadoop and Hive.

Hadoop is “free,” but Hadoop initiatives are infamously complex and expensive to get off the ground. Architecting a solution from the ground up will require considerable technical and budgetary resources, and may wind up being an expensive failure.

Instead, consider looking into packaged Hadoop distributions from vendors such as Cloudera and Hortonworks, as these vendors offer packaged applications and support.

Dennis Duckworth, director of product marketing at VoltDB (an in-memory, operational database management system that we’ll discuss in the next section of this report), observes:

“We have a lot of customers on Cloudera, Hortonworks and MapR. Certainly, Hadoop isn’t replacing all the data warehouses out there, but a lot of companies are implementing data lakes as long-term repositories for data and sandboxes for data scientists to play in.”

Dennis Duckworth, director of product marketing at VoltDB

Some database management system (DBMS) vendors that have traditionally specialized in data warehousing also offer products for creating and managing data lakes, such as Teradata.

Alternative #2: HTAP/In-Memory DBMSs

In-memory computing offers another alternative to the data warehouse by allowing for the consolidation of analytical and transactional databases into a single database that can support both online transactional processing (OLTP) and OLAP.

This is known as hybrid analytical transactional processing, or HTAP. It differs from data warehousing, because in that strategy the OLTP databases feed the OLAP database, whereas with HTAP the step of loading data into a warehouse for historical analytics is eliminated:

Relations Between OLAP, HTAP and OLTP
 

 

One of the primary challenges in moving to a true HTAP solution is getting transactional databases to run reliably in memory (i.e., storing data in RAM rather than writing it to disk), but solutions are already hitting the market. Currently, your best options are in-memory analytical or transactional databases, as solutions that fully support HTAP still haven’t emerged.

A good example of a DBMS that’s moving toward HTAP support is VoltDB. Peter Vescuso, chief marketing officer at VoltDB, describes the solution as an “in-memory operational database or OLTP database.”

VoltDB can also be described as a NewSQL database, which means it made significant changes to traditional relational database architecture in order to be able to scale with high performance.

While NewSQL solutions are relational, Vescuso notes that they distribute data across multiple nodes in a server cluster, allowing organizations to use cheap commodity hardware instead of relying on larger (and more expensive) single computers that traditional RDBMSs used to scale.

VoltDB’s in-memory architecture allows for dramatic reduction in processing overhead, Vescuso explains: “90 to 95 percent of the work with traditional databases had nothing to do with data processing and everything to do with the disk-based structure.”

Monitoring server performance in VoltDB Management Center

 

In-memory OLTP solutions such as VoltDB already support sophisticated analytics, but they tend to be real-time and embedded in automated processes, instead of the complex historical analyses performed by business analysts.

In other words, the application generally makes a decision or recommendation on the basis of the analytics, rather than a human being. The in-memory architecture allows this decision to be made as rapidly as possible for time-sensitive applications.

As Vescuso explains:

“There are two sides to data: big data, or historical data at rest, and fast data, or live operational data before it becomes big data. Analytics on live streaming data is often for consumer-facing, front-end interactions, but can also be used to interact with a network-facing application for automating operations such as policy enforcement, credit card fraud detection etc.”

Peter Vescuso, chief marketing officer at VoltDB

Mobile apps provide rich use cases for this type of analytics. Vescuso gives one example of an automated interaction with a mobile app via real-time analytics:

“We could see that a customer is streaming media on his or her smartphone, and the account balance is dropping rapidly, put an offer in front of the customer, see that the customer has accepted the offer and let the streaming continue.”

Ultimately, HTAP solutions will support both real-time and historical analytics, allowing for transactional and historical data to reside within the same DBMS.

However, VoltDB’s Duckworth observes that true HTAP isn’t yet technically feasible: “If you look at HTAP solutions like SAP Hana, we still hear that they’re much better at the OLAP side than the transactional side. They’re still having issues, and the solutions are expensive.”

“There are other NewSQL solutions that claim to be HTAP,” he says, “but what we’ve found in talking to customers is that they tend to be stronger on one side or the other. We do some HTAP, but we’re stronger on the transactional side, because we think that’s harder to do and more critical to get right.”

HTAP looks to be the most logical way to support both data-intensive operations and analytics moving forward, so if you’re aiming to get ahead of the game, you may want to consider experimenting with in-memory databases, particularly for OLTP.

Indeed, Duckworth notes that the real-time analytics provided by solutions such as VoltDB can complement historical analytics: “We are fully SQL compliant and we support JDBC connectivity, so people can write the reports and tap the data in a VoltDB database, and often do to take a peek and see what’s going on in there.”

The bottom line: HTAP isn’t a mature technology, and you should treat vendors’ claims to support it with some skepticism. However, exploring in-memory, NewSQL solutions for OLTP can help your organization cultivate advanced skills with new DBMS architectures and allow for limited HTAP functionality in the short term.

On the other hand, even moving to “HTAP-lite” will require a dramatic re-architecting of your mission-critical, operational databases, and will require a considerable amount of buy-in from stakeholders across the organization. Your company might not be ready to take the plunge at this point.

Generally, you’ll also need an operational use case for the real-time analytics that are the true focus of NewSQL OLTP solutions, e.g., network-facing, real-time analytics in support of automated processes such as fraud detection, or consumer-facing mobile and web applications that depend on real-time interactions.

Alternative #3: OLAP-Style Analysis With Self-Service Tools

Self-service BI tools use a different technology than traditional OLAP tools supported by data warehouses. In particular, self-service tools use column-store data caches rather than OLAP data cubes.

These data caches can be accessed in memory instead of reading from or writing to disk. This allows for much more rapid querying than OLAP data warehouses can support.

Self-service tools are also designed for analytics via a visual interface, eliminating technical knowledge requirements on the part of the business user and the need for IT assistance. Moreover, they don’t require upfront data modeling, and they can be used with a wealth of different kinds of data sources, including big data sources.

Self-service analysis offers the same kind of multi-dimensional interactions with data sets that originally required the use of OLAP cubes, though users define their own drill paths rather than following those that have been structured into a data cube during data modeling.

Using measures and dimensions to generate visualizations in SAP Lumira
 

The bottom line: Self-service tools are a great way to extend analytics capabilities to business users, get around the inflexibility of data cubes and allow business users to explore a wide range of data sets. However, they can’t serve as a highly governed “single source of truth” like a data warehouse.

Consider the following vendors for self-service BI:

View profile
View profile
View profile
View profile

Moving Forward

Two of the strategies we’ve considered here, data lakes and HTAP solutions, will require a radically new architectural approach to storing historical data. HTAP also requires a total reworking of the ways in which you store transactional data.

If you don’t yet have a data warehouse in place, you need to consider:

  • The volume of data you need to store
  • The degree to which your data is structured and shares common schemas
  • The use cases for real-time analytics at your organization
  • Data sources beyond transactional data that you want to analyze

For organizations that are primarily storing transactional data for historical analytics, a data warehouse still makes sense. Our site offers reviews, pricing and demos for vendors that offer end-to-end BI platforms that support data warehousing.

If you plan to juxtapose transactional data with web-scale or big data sets, a data lake will make more sense.

If you’re dealing with vast quantities of streaming operational data that need to be processed via real-time analytics, then beginning the journey to HTAP via an in-memory OLTP DBMS can offer significant competitive advantages both with respect to performance and enabling new forms of analytics.

If you’re still not sure that you even need a data warehouse in the first place, then consider self-service solutions focusing on data visualization for the time being. These tools provide nearly all of the functionality of traditional, data warehouse-supported OLAP tools.

You may also like

Data Warehouse Problems: Do You Even Need One?

What Is Data Blending, and Which Tools Make It Easier?

ETL vs. ELT: How to Choose the Best Approach for Your Data Warehouse

Compare traditional, end-to-end OLAP platforms with self-service alternatives