ETL to ELT journey: Break free your Transformations and discover Happiness and Data Zen

Every data integration pipeline consists of 3 stages: Data Extraction (E), Data Transformation (T) and Data Load (L)

During the Data Extraction stage, the source data is read from its origins: transactional databases, CRM or ERP systems or through data scraping from web pages.

During the Data Transformation stage, the necessary modifications are applied to the source data. This includes data filtering, enrichment or merging with existing or other source datasets, data obfuscation, dataset structure alignment or validation, fields renaming and data structuring, according to the canonical data warehouse model. 

During the Data Loading stage, the data is stored in the pipeline destination, which could be a staging area, data lake or data warehouse.

There are two principal methods for the data integration process: transferring it from where it originated to the destination, where the data will be used for analysis, ETL and ELT.

The difference between ETL and ELT pipelines lies in whether the Data Transformation stage occurs before or after the Data Loading stage. 

Let's look closer, at why this difference is significant.

ETL pipeline: data is transformed within a data integration tool and then loaded into the data storage or management solution.

Transformations coupling & data latency: During the ETL pipeline, multiple data transformations would usually be tightly linked and executed sequentially. Failure of one of them would disrupt other transformations, whether by falling the whole data flow or making monitoring the stage's success harder. Downstream data transformation availability will depend on upstream transformation execution times and delays.

Maintainability: ETL pipelines will usually have complex logic that contains multiple transformation stages making them difficult to maintain, understand or apply changes.

Ability to scale:  The ETL pipeline can be scaled as a whole regardless of the different needs of each transformation operation.

Data processing efficiency: The data transformations will be performed using the data interaction tool's ability to process the data, which would usually be iterative and not performed as a set operation, using the limited compute power of the data integration tool.

Raw data availability: If the data transformation stage is applied before data is loaded from the source system and transformation operations take time, it will take time till the raw data is available for analysis.


ELT pipeline: data is first loaded into the data storage solution and then transformed within a data store.

Transformations decoupling & data latency:  When the data is being loaded into the data warehouse independently and as a first step, this operation can have an independent schedule from the downstream transformation operations, making the raw data available for analysis with minimal latency. Subsequent data transformation stages can operate as a separate pipeline, removing data pipeline complexity and enhancing flexibility. Each transformation can run individually with its own schedule and be scaled independently. Some of them will run very frequently allowing data analysts and data scientists fast access to aggregated data without dependency on additional heavy transformations that require more time and resources to be completed.

Maintainability: ELT pipelines are typically less complex than ETL pipelines, easier to maintain and adapt to frequently changing requirements.

Ability to scale: Each transformation stage can use as many resources as it needs, balancing the pipeline cost, execution time and performance. Each pipeline can be perfectly aligned to its unique requirements and goals.

Data processing efficiency:  Transformation data processing will be performed inside the data warehouse, it will use the warehouse resources as well as various data warehouse features that enable fast data processing, such as data clustering, indexes or intelligent query optimizers.

Raw data availability: ELT pipelines load data in its raw form and make it available for analysis with very low latency, removing any impact from heavy data transformations.

Regardless, of whether we use cloud data integration platforms, like Azure Data Factory, or on-premice data integration tools, requiring dedicated hardware or virtual machine, optimizing for lower latency and faster data processing, it makes sense to separate the Extract and Load stages from Data Transformation stage, allowing different data transformation scenarios. 

Comments

Popular posts from this blog

SQL Awesomeness: Finding a QUALIFY query clause in the depths of the database ocean

Look back and realize how far you came

The Greatest Reasons to use or not to use a Centralized Data Access Architecture