Posts

Showing posts from September, 2022

Over time you learn more things and do better (DATA TLV puzzles and excitement)

Image
         I took the first sip from my morning coffee when the last summit session had started, somewhere around 15:00. I heavily sank to a sofa and realized that this is the first time I had a chance to sit down today. Another organizer was sitting nearby with the exact "stick a fork in me, I am done" look, staring nowhere. "Oh, I am so tired" - that was my 18-year-old nephew Sonya anchored on the other side of the sofa. She had helped us with registrations and timekeeping in classes and has absorbed all speakers' disappointment with outdated presentation equipment. "Most speakers finish before their time", she said with an exhausted look. "Very smart and intelligent people. I didn't understand anything they talked about." I wish we knew how to make sure the summit meets everyone's expectations. Every sponsor wants to deliver a session. The more sponsored sessions mean fewer slots for community speakers sessions. The toughest part of ...

Inverted Index for full-text searches or common words detection

Image
  Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to the data structures search engines use.  Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched. Usually, the documents are "parents," and the words inside the document are "children." To build an inverted index, we invert this relation to make the words "parents" and documents "children": Take all or a subset of keywords from the document and pair it with the document ID DocId1: keyword1 DocId1: keyword2 DocId1: keyword3 DocId2: keyword DocId2: keyword1 Re...

Data Orchestration, Ingestion, and Data preparation in Azure cloud – which tool should you choose?

Image
There is a lot of tooling around data enrichment and data orchestration in the Azure cloud and many services with similar features. Azure Data Factory, Azure Databricks, Azure Synapse Pipelines, and SSIS services can move data from one data store to another, clean the data, enrich the data by merging entities, and perform data aggregations. They can transform the data from one format into another and partition it according to the business logic. Why would we choose one service over another? Read my new blog post to find out: https://www.mssqltips.com/sqlservertip/7380/azure-cloud-data-processing-strategy-and-tools-selection/