Posts

Showing posts from 2024

ETL to ELT journey: Break free your Transformations and discover Happiness and Data Zen

Every data integration pipeline consists of 3 stages: Data Extraction (E), Data Transformation (T) and Data Load (L) During the  Data Extraction stage, the source data is read from its origins: transactional databases, CRM or ERP systems or through data scraping from web pages. During the  Data Transformation stage, the necessary modifications are applied to the source data. This includes data filtering, enrichment or merging with existing or other source datasets, data obfuscation, dataset structure alignment or validation, fields renaming and data structuring, according to the canonical data warehouse model.  During the Data Loading stage, the data is stored in the pipeline destination, which could be a staging area, data lake or data warehouse. There are two principal methods for the data integration process: transferring it from where it originated to the destination, where the data will be used for analysis, ETL and ELT. The difference between ETL and ELT pipelines lies in wheth

Beyond Schedules and Speakers: Data TLV in a nutshell

Image
Many of us have hobbies. Some hobbies are quite common, such as travelling, painting, or playing computer games. I, however, have a rather unique hobby: organizing conferences. The more complicated the logistics and the more people who sign up, the more enjoyable it becomes. Yet, this hobby is quite time-consuming, time that could be dedicated to family or sleep. As the event day approaches, the tension mounts. There are too many details to manage, too many things to take care of. The feeling of being overwhelmed and terrified at the same time, creeps in, as mishaps can occur at any moment. This is especially true in our small, brave country where the sound of rocket alarms can disrupt seemingly peaceful moments, with potentially dire consequences. But eventually, the day arrives, and the energy is overwhelming. Rooms are filled with eager delegates ready to learn. There are excited speakers, delighted sponsors, and an abundance of delicious food, beer, and networking opportunities. I

Having fun isn't hard when you have a modern data catalog

Image
Data Catalog and Data Fabric are any data architecture enablers. You can use centralized architecture or decentralized, Data Catalogs will enable effective management and help interact with the data. Taking a closer look we figure out that Data Catalog is one of the main technology pillars of Data Fabric which has a much wider approach, including also data semantic enrichment, data preparation as well as data recommendation engines and various data orchestrators. Data Fabric empowered by Data Catalog, is an abstraction layer that helps applications to connect to data, regardless of database technology and data server location, using built-in APIs. However, a traditionally manually managed data catalog does not qualify as a Data Fabric unit. Modern Data Catalog is actively driven by the meta-data and scans data sources regularly with no need for manual maintenance. Modern Data Catalogs usually would have built-in fully-automated end-to-end data lineage and enforc

Coding is a rollercoaster of efficiency and eyebrow-raising discoveries.

Image
Data Engineers or Developers - many of us love to be gourmet chefs in the kitchen. When it comes to planning and design, we would rather throw all ingredients in the pot and see how it comes out.  Coding without a plan is like assembling a puzzle in a dark room. The result will most probably be unexpected and off the canvas. Whether you follow Waterfall or Agile development strategy, planning and design phases are non-negotiable and are essential to reduce development cycles and redo work. Once upon a time, one data engineer created an amazing piece-of-cake automation pipeline. This masterpiece had very complex logic, pulling data from multiple sources, and merging and persisting the data in a complex, incremental way. When the pipeline started to run successfully and automation flows worked, the data engineer got very excited and considered this development done. A few days later QA engineer found out that the result dataset was never created in the destination. Why has that happened?