Over the past decade, there has been an explosion of new data types: big data, social media data, sensor data, endless behavioral data about website and mobile app users etc. This “big bang” in the volume and types of data businesses must process has put a massive strain on data warehouse architecture.
Using a traditional extract, transform and load (ETL) process to wrangle millions of records in these new formats into an OLAP cube can probably seem about as much fun as doing this:
Thankfully, a new data integration approach for data warehousing has emerged that promises to help with some of the nastiness of warehousing big data: extract, load and transform (ELT).
The differences between these two processes aren’t confined to the order in which you perform the steps. In order to decide whether it’s time to move to ELT, you’ll need to consider the following:
- Underlying storage technologies
- Your design approach to data warehouse architecture
- The business use cases for the data warehouse itself
We’ll break down the following topics:
Everything Hinges on the “T” in ETL/ELT
As you’re aware, the transformation step is easily the most complex step in the ETL process. ETL and ELT thus differ in two major respects:
- When the transformation step is performed
- Where the transformation step is performed
ETL tools arose as a way to integrate data to meet the requirements of traditional data warehouses powered by OLAP data cubes and/or relational database management system (DBMS) technologies, depending on the architecture of the warehouse.
In a traditional data warehousing setup, data is first extracted from “source systems” (e.g., ERP systems, CRM systems).
OLAP tools and SQL queries depend on the standardization of dimensions across data sets to deliver aggregate results. This means that the data must go through a series of transformations, such as:
- Converting numerical values
- Editing text strings
- Matching rows and columns
- Find and replace
- Changing column names
- Recombining columns from different tables and databases
- Precalculating intermediate aggregates
Traditionally, these transformations have been performed before loading data into the “target system”—typically, a relational data warehouse.
As the underlying data storage and processing technologies behind data warehousing have evolved, however, it’s become possible to accomplish transformations within the target system.
Both ETL and ELT processes involve staging areas. In ETL, these staging areas are found within the ETL tool, be it proprietary or custom-built. They sit between the source system (e.g., a CRM system) and the target system (the data warehouse).
By contrast, with ELT, the staging area is within the data warehouse, and the database engine powering the DBMS performs the transformations, as opposed to an ETL tool.
One of the immediate consequences of this aspect of ELT is that you lose the neat visual interface and data preparation/cleaning features that ETL tools provide to ease the data transformation process.
As Merv Andrian and Ted Friedman note in “Hadoop Is Not a Data Integration Solution” (content available to Gartner clients), “transformations in Hadoop stacks are written by Java programmers, and must be maintained by them.”
This means that if your IT department is short on programmers with sufficient Java knowledge to perform custom transformations, ELT may not be right for you. (That is, ELT leveraging Hadoop—we’ll see below that you have some other options too.)
So how do you move to the ELT process, and what are the advantages of doing so? To answer these questions, we’ll need to take a closer look at the characteristics of the target systems used in the ELT process.
Hadoop and Advanced Data Integration Vendors Enable ELT
While some relational DBMSs, such as Microsoft SQL Server, support the ELT process, use of ELT versus ETL is going to be highly dependent on the characteristics of the specific DBMS you’re using.
At a higher level, ELT is a good option if you’re moving to a data warehousing structure to support big data initiatives using Hadoop or a NoSQL/NewSQL analytical DBMS. The diagram below shows how ELT versus ETL works in the context of Hadoop:
(Content available to Gartner clients)
This diagram shows two essential differences between the ETL and ELT processes:
- That transformations are accomplished in Hadoop in the ELT process.
- That the ETL process feeds the traditional warehouse directly, whereas the ELT process feeds the warehouse from Hadoop (after the transformations have been performed).
Gartner analyst Mei Yang Selvage notes in “Use Data Integration Patterns to Build Optimal Architecture” (content available to Gartner clients) that ELT thus simplifies your overall data storage and processing environment by “avoiding a separate ETL server” and leveraging “existing DBMS engines.”
However, she also cautions that these precise characteristics make ELT primarily appropriate for “workloads with limited transformation and cleansing.”
Poor quality data and data that requires substantial integration shouldn’t simply be loaded wholesale into Hadoop, unless you have a huge team of highly skilled programmers to write custom code for complex transformations.
Data sets loaded using the ELT process may be relatively simple yet massive, such as log files and sensor data.
In other cases, you could be loading highly unstructured data, such as tweets for sentiment analysis, that don’t require extensive upfront transformations because data mining and machine learning technologies are used for analysis rather than standard queries.
Selvage notes in “Adopt Logical Data Warehouse Architectural Patterns to Mature Your Data Warehouse” (content available to Gartner clients) that in the latter case, analytical developers and end users define their own schemas at a later time, known as “eventual schema” or “schema on read.”
“Schema on read” is opposed to traditional “schema on write” processes, in which the schema is defined via the ETL process before the data is written to the warehouse.
- If your data warehouse/Hadoop environment is primarily supplying data to end users who are advanced data scientists and business analysts capable of using “schema on read,” the need for complex transformations is low, and an ELT process is appropriate.
- For deployments where you’re planning to blend big data from Hadoop with data from other source systems into highly structured and regularly scheduled reports, transformations in Hadoop can get hairy pretty quick, and ETL as depicted in the above chart will remain your best option for the time being.
Hence, Selvage advises that you can think of Hadoop as “a sandbox for a big data environment” in which your most advanced analysts can play, rather than as a straight up replacement for the warehouse itself.
Key Vendors and Technologies Supporting the ELT Process
The following vendors and open source frameworks enable an ELT approach to loading your warehouse:
(Parts of this list excerpted from “Comparing Four Hadoop Integration Architectures,” full content available to Gartner clients)
We’ve seen that ELT offers the following benefits:
- Ability to ingest big data
- Efficiency for data sets used in “schema on read” analytics
- Ability to leverage DBMS for processing as well as storage
- Elimination of ETL server
- Creation of a “sandbox environment” adjacent to data warehouse
These are just the benefits that ELT is offering now. Down the road, the scope of the technology will potentially expand as native data integration tools for Hadoop and NoSQL solutions continue to evolve.
We’re closely following developments in this space at Gartner. You can sign up for our newsletter to keep current with trends in big data analytics, data preparation, hybrid transactional/analytical processing, real-time analytics, marketing analytics etc.