Posts

Showing posts from 2022

What is Data Governance?

Image
"What is Data Governance?", a curious kid asks, peeking above my shoulder into the laptop screen. He is 14 and frequently asks questions with no interest in knowing the answer. Just like many other people around me. "That's a great question", - the first thing you would say when you have no good strategy for how to approach the question. The second part is to think aloud. After a few minutes of gathering my thoughts: "consider the term "data" as a  synonym of "useful information". We use the information to support decision-making and choosing strategy.   Regardless of whether we are talking about a household or business, having a proper strategy ensures efficient business management and somewhat helps to forecast the future. Data Governance is a system that controls every aspect of the data lifecycle - the series of stages the data goes through, from being captured, stored and used, to data asset destruction. This system helps to ensure

Cloud Scalability vs Cloud Elasticity

Image

Modern Modular Data Pipelines Example

Image
  I would like to share with you my favourite example of the modern data pipeline. It's amazing. The first cool thing  that we see is that this great pipeline is utilizing a full range of cloud services built for diverse use cases. Choosing the correct tool for each use case can be one of the key factors for the success of your idea, allowing you to get things running as fast as possible without reinventing the wheel. Another cool thing is if we want to pull data from any non-trivial data source, like Twitter or Jira or GitHub, Azure Databricks is our first friend. However, t he most noticeable advantage  of this pipeline, is that instead of having a monolithic data flow, this pipeline is actually multiple pipelines that are running in parallel. Short, simple, and independent pipelines. Multiple independent pipelines can work in parallel and on different frequencies. One pipeline failure would not impact others. This is an easy way to scale each pipeline separately to speed up only

When should you use Azure Databricks?

Image
  Once upon a time, Sql Server was our central tool for data management, for both OLTP (online transactions processing) or OLAP(online analytical processing) database systems. We have used Sql Server Agent Jobs to pull the data from FTP or any other source. We have used Sql Server stored procedures to pull the data into the Staging database. We have used Sql Server stored procedures to enrich and aggregate the data. And we have used Sql Server as a data serving layer.  These days we need to consider utilizing various cloud services. A ttempts to lift and shift existing systems into the cloud often end up being quite expensive if we tend to keep Sql Server taking care of all data pipeline stages.  There are multiple great services in the Azure cloud and Microsoft tends to build each product with multiple features allowing it to take care of multiple pipeline stages. This does not mean that we need to go back to the monolith architecture, let's find out where each service fits. Data

Over time you learn more things and do better (DATA TLV puzzles and excitement)

Image
         I took the first sip from my morning coffee when the last summit session had started, somewhere around 15:00. I heavily sank to a sofa and realized that this is the first time I had a chance to sit down today. Another organizer was sitting nearby with the exact "stick a fork in me, I am done" look, staring nowhere. "Oh, I am so tired" - that was my 18-year-old nephew Sonya anchored on the other side of the sofa. She had helped us with registrations and timekeeping in classes and has absorbed all speakers' disappointment with outdated presentation equipment. "Most speakers finish before their time", she said with an exhausted look. "Very smart and intelligent people. I didn't understand anything they talked about." I wish we knew how to make sure the summit meets everyone's expectations. Every sponsor wants to deliver a session. The more sponsored sessions mean fewer slots for community speakers sessions. The toughest part of

Inverted Index for full-text searches or common words detection

Image
  Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to the data structures search engines use.  Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched. Usually, the documents are "parents," and the words inside the document are "children." To build an inverted index, we invert this relation to make the words "parents" and documents "children": Take all or a subset of keywords from the document and pair it with the document ID DocId1: keyword1 DocId1: keyword2 DocId1: keyword3 DocId2: keyword DocId2: keyword1 Re

Data Orchestration, Ingestion, and Data preparation in Azure cloud – which tool should you choose?

Image
There is a lot of tooling around data enrichment and data orchestration in the Azure cloud and many services with similar features. Azure Data Factory, Azure Databricks, Azure Synapse Pipelines, and SSIS services can move data from one data store to another, clean the data, enrich the data by merging entities, and perform data aggregations. They can transform the data from one format into another and partition it according to the business logic. Why would we choose one service over another? Read my new blog post to find out: https://www.mssqltips.com/sqlservertip/7380/azure-cloud-data-processing-strategy-and-tools-selection/  

If you want to torture data - store it in CSV format

Image
Are you using CSV files as a primary file format for your data? CSV is a very useful file format if you want to open the file in Excel and analyze it right away. CSV format stores tabular data in plain text, it is old and was wildly used in the early days of business computing. However, if you plan to keep raw data in the data lake, you should reconsider using CSV. There are many modern file formats that were designed for data analysis. In the cloud world of data lakes and schema-on-read querying systems, like AWS Glue or Databricks, CSV files will slow you down. Today I want to talk about Parquet, a modern file format, invented for fast analytical querying. Parquet files organize data in columns, while CSV files organize data in rows. Columnar storage allows much better compression so Parquet data files need less storage, 1 TB of CSV files can be converted into 100GB of parquet files – which can be a huge money saver when cloud storage is used. This also means that scanning parquet f