Posts

Showing posts from 2025

MySql default configurations in GCP cloud and in Azure cloud

We recently had a late-night memory troubleshooting session on MySQL in the GCP cloud. I am sharing with you my MySQL learning outcomes and review of GCP default Cloud MySQL configuration related to performance and memory management. I will go over the main MySQL performance configuration parameters and what the GCP defaults are. I have also looked up Azure defaults to see iare f there are any differences. GCP Cloud MySQL configurations seem to favour writing workloads. On Azure Database for Flexible server, some parameters are not present, for instance, unique_checks and foreign_key_checks disabling. innodb_buffer_pool_size What it is & best practices: Memory area where InnoDB caches table and index data. Best practice: ~80% of instance memory (can be smaller if you only use a small fraction of your data). GCP default: 70% of total instance memory Azure default: 25% of total instance memory innodb_log_file_size What it is & best practices: Redo log,...

Snowflake integration with Microsoft Azure Open AI service

Did you know that Snowflake is working on integrating Microsoft Azure Open AI service? Companies that use Snowflake as their Data Warehousing solution, can now use Azure Open AI through Snowflake CortexAI - Snowflake's managed AI services. This is available not only to Snowflake clients using Snowflake on Azure. Clients from any cloud and any region can now build AI-powered apps or data agents. Open AI models will run within the security boundaries of the Snowflake data cloud, providing unified governance, access controls and monitoring. Even more interesting: there will be the opposite collaboration as well. Cortex AI agents will be available from within Microsoft Teams Copilot and Microsoft 365 Copilot so users can interact with their data stored in Snowflake using natural language. This integration will become generally available in June 2025

Understanding the Pillars of Data Quality

Image
Imagine, you are baking a cake. You have all the ingredients except eggs. Of course, you could improvise but most probably instead of the moist chocolate cake, you end up with a dry science experiment. Incomplete data leaves everyone unsatisfied. During the decision-making party, every data quality dimension is an important guest with a unique vibe: Good Data should be C omplete , when all data attributes, that describe data in its fullness, are present as a part of your data. This guest keeps the party snacks stock full and makes sure no one gets hungry. Incomplete data leads to half-baked insights. Good Data should be Accurate . This means that the data correctly describes its objects and accurately reflects the reality. This party guest is a perfectionist, checking that the playlist is perfectly chosen. Good Data should be Timely . This means that the data is fresh. No one wants to eat last week's sushi. It's not only unappetizing; it is downright risky. This guest makes sur...

The backbone your data pipelines have been waiting for.

Image
Kafka isn’t just a buzzword—it’s the backbone your data pipelines have been waiting for. Who new that messaging system, like Apache Kafka should hold a central place in a Data Engineer toolbelt.  Apache Kafka is a low-latency distributed data streaming platform for real-time data processing. Kafka can handle large volumes of data and is very helpful for distributed data integration projects. Top 2 reasons why you might need Kafka in your Data Integration architecture 1. Support multiple destinations by decoupling data producers and data consumers. Data in source will be processed only once, which lowers an overall cost in consumption-based data producer databases and we can add new/change existing destinations without changing the extraction components. 2. Ability to deal with massive amounts of data , supports high throughput and scalability. Decoupling pipeline extract and load stages is an important Data Integration principle and can improve pipeline flexibility, extract and loa...