Posts

MySql default configurations in GCP cloud and in Azure cloud

We recently had a late-night memory troubleshooting session on MySQL in the GCP cloud. I am sharing with you my MySQL learning outcomes and review of GCP default Cloud MySQL configuration related to performance and memory management. I will go over the main MySQL performance configuration parameters and what the GCP defaults are. I have also looked up Azure defaults to see iare f there are any differences. GCP Cloud MySQL configurations seem to favour writing workloads. On Azure Database for Flexible server, some parameters are not present, for instance, unique_checks and foreign_key_checks disabling. innodb_buffer_pool_size What it is & best practices: Memory area where InnoDB caches table and index data. Best practice: ~80% of instance memory (can be smaller if you only use a small fraction of your data). GCP default: 70% of total instance memory Azure default: 25% of total instance memory innodb_log_file_size What it is & best practices: Redo log,...

Snowflake integration with Microsoft Azure Open AI service

Did you know that Snowflake is working on integrating Microsoft Azure Open AI service? Companies that use Snowflake as their Data Warehousing solution, can now use Azure Open AI through Snowflake CortexAI - Snowflake's managed AI services. This is available not only to Snowflake clients using Snowflake on Azure. Clients from any cloud and any region can now build AI-powered apps or data agents. Open AI models will run within the security boundaries of the Snowflake data cloud, providing unified governance, access controls and monitoring. Even more interesting: there will be the opposite collaboration as well. Cortex AI agents will be available from within Microsoft Teams Copilot and Microsoft 365 Copilot so users can interact with their data stored in Snowflake using natural language. This integration will become generally available in June 2025

Understanding the Pillars of Data Quality

Image
Imagine, you are baking a cake. You have all the ingredients except eggs. Of course, you could improvise but most probably instead of the moist chocolate cake, you end up with a dry science experiment. Incomplete data leaves everyone unsatisfied. During the decision-making party, every data quality dimension is an important guest with a unique vibe: Good Data should be C omplete , when all data attributes, that describe data in its fullness, are present as a part of your data. This guest keeps the party snacks stock full and makes sure no one gets hungry. Incomplete data leads to half-baked insights. Good Data should be Accurate . This means that the data correctly describes its objects and accurately reflects the reality. This party guest is a perfectionist, checking that the playlist is perfectly chosen. Good Data should be Timely . This means that the data is fresh. No one wants to eat last week's sushi. It's not only unappetizing; it is downright risky. This guest makes sur...

The backbone your data pipelines have been waiting for.

Image
Kafka isn’t just a buzzword—it’s the backbone your data pipelines have been waiting for. Who new that messaging system, like Apache Kafka should hold a central place in a Data Engineer toolbelt.  Apache Kafka is a low-latency distributed data streaming platform for real-time data processing. Kafka can handle large volumes of data and is very helpful for distributed data integration projects. Top 2 reasons why you might need Kafka in your Data Integration architecture 1. Support multiple destinations by decoupling data producers and data consumers. Data in source will be processed only once, which lowers an overall cost in consumption-based data producer databases and we can add new/change existing destinations without changing the extraction components. 2. Ability to deal with massive amounts of data , supports high throughput and scalability. Decoupling pipeline extract and load stages is an important Data Integration principle and can improve pipeline flexibility, extract and loa...

Beyond clicks and sales

My son told me yesterday that data analysis is boring. "Total money wired from one account type to another. The number of umbrellas sold over time to various age groups. Website clicks from different countries, broken down by time of day. Nothing exciting," he said. So I asked him: "What about analyzing the contents of garbage cans? Wouldn't that be fascinating? Or fridge contents—imagine predicting a family’s potential health issues based on what they eat (or forget to throw out). Dream logs could be just as intriguing. Imagine categorizing dreams by themes and emotions, analyzing trends based on sleep habits and cultural influences. How about haunted places and paranormal activity reports? Tracking ghost sightings by location, identifying patterns, and finding unexpected geographical correlations—now that’s a dataset with mystery! And then there’s one of my favourites: analyzing excuses for being late to work. Categorizing them by season, industry, or geography cou...

Unlocking Microsoft Fabric: A Simple Guide when you only have a personal account.

Image
" You cannot create a Microsoft capacity using a personal account. Use your organizational account instead." What a disappointment and frustration! I've been trying to set up Microsoft Fabric for a while now and figure out how to work around this error. I use my personal Gmail for an MVP subscription to learn and experiment with Azure services. When I visit app.fabric.microsoft.com and try to use my private Gmail address, I keep getting a similar message: It's frustrating because I'm eager to dive into Fabric and explore its capabilities. I would get the same error trying to set up Azure Data Catalog, PowerBI embedded, and more. What are the different account types and how do they differ? A work or school account is created through Active Directory or other cloud directory services, such as Microsoft 365. On the other hand, a personal account is one that's manually created for individual use, consisting simply of a username and password. After digging into ...

ETL to ELT journey: Break free your Transformations and discover Happiness and Data Zen

Every data integration pipeline consists of 3 stages: Data Extraction (E), Data Transformation (T) and Data Load (L) During the  Data Extraction stage, the source data is read from its origins: transactional databases, CRM or ERP systems or through data scraping from web pages. During the  Data Transformation stage, the necessary modifications are applied to the source data. This includes data filtering, enrichment or merging with existing or other source datasets, data obfuscation, dataset structure alignment or validation, fields renaming and data structuring, according to the canonical data warehouse model.  During the Data Loading stage, the data is stored in the pipeline destination, which could be a staging area, data lake or data warehouse. There are two principal methods for the data integration process: transferring it from where it originated to the destination, where the data will be used for analysis, ETL and ELT. The difference between ETL and ELT pipelines...

Beyond Schedules and Speakers: Data TLV in a nutshell

Image
Many of us have hobbies. Some hobbies are quite common, such as travelling, painting, or playing computer games. I, however, have a rather unique hobby: organizing conferences. The more complicated the logistics and the more people who sign up, the more enjoyable it becomes. Yet, this hobby is quite time-consuming, time that could be dedicated to family or sleep. As the event day approaches, the tension mounts. There are too many details to manage, too many things to take care of. The feeling of being overwhelmed and terrified at the same time, creeps in, as mishaps can occur at any moment. This is especially true in our small, brave country where the sound of rocket alarms can disrupt seemingly peaceful moments, with potentially dire consequences. But eventually, the day arrives, and the energy is overwhelming. Rooms are filled with eager delegates ready to learn. There are excited speakers, delighted sponsors, and an abundance of delicious food, beer, and networking opportunities. I ...

Having fun isn't hard when you have a modern data catalog

Image
Data Catalog and Data Fabric are any data architecture enablers. You can use centralized architecture or decentralized, Data Catalogs will enable effective management and help interact with the data. Taking a closer look we figure out that Data Catalog is one of the main technology pillars of Data Fabric which has a much wider approach, including also data semantic enrichment, data preparation as well as data recommendation engines and various data orchestrators. Data Fabric empowered by Data Catalog, is an abstraction layer that helps applications to connect to data, regardless of database technology and data server location, using built-in APIs. However, a traditionally manually managed data catalog does not qualify as a Data Fabric unit. Modern Data Catalog is actively driven by the meta-data and scans data sources regularly with no need for manual maintenance. Modern Data Catalogs usually would have built-in fully-automated end-to-end data lineage and enforc...

Coding is a rollercoaster of efficiency and eyebrow-raising discoveries.

Image
Data Engineers or Developers - many of us love to be gourmet chefs in the kitchen. When it comes to planning and design, we would rather throw all ingredients in the pot and see how it comes out.  Coding without a plan is like assembling a puzzle in a dark room. The result will most probably be unexpected and off the canvas. Whether you follow Waterfall or Agile development strategy, planning and design phases are non-negotiable and are essential to reduce development cycles and redo work. Once upon a time, one data engineer created an amazing piece-of-cake automation pipeline. This masterpiece had very complex logic, pulling data from multiple sources, and merging and persisting the data in a complex, incremental way. When the pipeline started to run successfully and automation flows worked, the data engineer got very excited and considered this development done. A few days later QA engineer found out that the result dataset was never created in the destination. Why has that ...

The Greatest Reasons to use or not to use a De-centralized Data Management Architecture

Image
Imagine having a dancing party for your data. Everyone in harmony waltzing, stepping on their partner's toes from time to time.  Distributed data management is no less amusing. It's chaotic, occasionally hair-raising but with the right approach can even be perfect.  A central huge data warehouse is a struggle to scale efficiently and hard to innovate. There is no clear ownership of the data domains and it is a single point of failure. During peak usage times, data access and processing can be slow. Even implementing updates or upgrades can be quite complex and time-consuming. Centralized databases are attractive targets for cyberattacks and successful breaches can compromise a large amount of sensitive data.  As an alternative to a centralized Data Warehouse, data can be owned and managed by the domains, producing it. When considering a decentralized approach, we need to make sure there is a self-serve data infrastructure platform that allows different domains or teams to...

SQL Awesomeness: Finding a QUALIFY query clause in the depths of the database ocean

Image
After so many years of writing SQL queries, I cannot believe I never came across a QUALIFY query clause. It feels like discovering a treasure in the depths of the database ocean. Even ChatGPT was not aware of it! QUALIFY query clause was invented by Teradata and is not a part of SQL Standart but apparently, multiple database vendors are supporting it, for instance, Snowflake and Databricks. QUALIFY clause solves the challenge when applying a window function as a query filter. SQL Standart and most database vendors will not allow to run of a window function as a part of a WHERE clause because window functions get evaluated after the HAVING clause. We typically create a CTE or subquery adding a window function to be able to filter on a it at a later query stage. Here is an example of how you can use QUALIFY to piece-of-cake filter on the window function.  I will use the Snowflake database, as an example. Let us consider that we have a list of queries and we need to find the last quer...

Data Operations Demystified: mastering 7 core principals

Image
Last week I have attended an Enterprise Data World summit. This is one of my favourite summits as they always have a broad choice of sessions on any level for data professionals. Unfortunately, due to work arrangements, I was only present on the second day of the summit. The first session's original title was “Accelerating Data Management with DataOps” but the speaker didn’t show up. Summit organizers have asked another speaker, Doug Needham from DataOps.live, to fill up and he made up a replacement session just in a few minutes before the session was supposed to start. That takes a lot of knowledge of the topic and courage to stand up for the disappointed audience that was expecting someone else. Doug is a senior data solutions architect and was doing Data Ops long before the term was invented. This amazing hour was full of exciting aspects of data operations, examples from Doug’s own rich experience and an overview of the Data Ops product his company is working on. I have very m...

The Greatest Reasons to use or not to use a Centralized Data Access Architecture

Image
When developing a modern Data Platform Layer, one of the main decisions is whether to opt for centralized or decentralized data access architecture.  There is no “one-fits-all” solution, both have advantages and disadvantages. A Centralized Data Access Architecture would usually mean duplicating data from the operational layer into the analytical layer and applying various transformations to data to support and speed up data analytics. Operational online transaction processing Layer , where all microservices and their operational databases are located. Analytical Data Layer , where we would have data lakes that support data Scientists' work and a data warehouse, that supports Business Intelligence. Transformations, ETL or ELT data pipelines , which are moving data from the operational layer into the analytical layer. I f we opt for a Centralized Data Access architecture, what would be the benefits and the drawbacks? CONSISTENCY : consolidating data into a central location can...