Posts

How Data Mesh architecture and Data Catalogs help decentralized data teams.

Image
Not too long ago, Data Administrators had to change their long habit of having a monolith database. They were forced to accept and agree to the  Polyglot persistence - the developer's teams have started to choose different data storage and technologies that would support each application team's data model requirements. The time has arrived to break down  also the Data Lake monolith paradigm .  Refactoring monolith Data Lake makes a lot of sense.   The central data lake as well as the central data team is often a huge bottleneck . The central data team is usually busy with fixing broken data pipes and taking care of constant data changes made by the domain owners/development teams.  Data Mesh architecture is coming to the rescue here. Instead of a centralized data team, there would be multiple decentralised domain data teams, producing data sets or consuming other teams' data sets. Domain data team usually knows their domain data very well and are aware ...

Are you familiar with DATAIKU?

Image
If you want to make DATA a part of EVERYDAY decision-making, then you must try this amazing Data Analysis Platform. Dataiku is a tool for everyone, it has Notebooks and Python for Coders, Visual data flows for Clickers, relationships, statistics and visual data forecasting for Decision Makers. It's technology agnostic, you can install it on a public cloud, use it as SaaS service or install on-premises. You also can choose ANY DATA PROCESSING ENGINE that will process your workload, use Azure Synapse, Spark or Sql Server and analyze the data WITHOUT ANY DATA MOVEMENT, in "a spreadsheet" like manner.  Dataiku has many enterprise-scale features, like build-in flow audit, Data Quality features, easy deployments between Dataiku environments and much more. https://www.dataiku.com/

Everything you need to consider when choosing COSMOSDB API

Image
  Azure CosmosDB is a modern distributed data store that can handle any data volume, any data velocity ( data arrival speed) and any data variety (different types of data). CosmosDb requires minimal setup and management efforts. It is very easy to integrate CosmosDB into your existing data infrastructure using various APIs that can either mimic your existing data management systems, like MongoDB, PostgreSQL or Cassandra and provide you with under 10s latency from anywhere, 99.999% availability and instant scalability. From the cost perspective, storage costs and utilization costs are almost the same regardless of which API you are planning to use. There is neither an autoscale nor serverless option for PostgreSQL API. Serverless NoSql API,  Serverless   Gremlin API,  Serverless  MongoDB API,  Serverless  Cassandra API and Serverless Table API are available as only as Single Region write architecture. If you are interested in Multi-region write clu...

SQLBITS session summary: How to start BLOGGING

Image
Attending conferences is a lot of fun.  Networking with smart people during the conferences is a lot of fun and learning. Attending conference sessions is a lot of learning and fun for the first 10 min until the mischievous phone flashes and pulls your attention. Twitter notifications, work emails and kids in WhatsApp behave like they have agreed to pull your attention from the session. The moment that I open the sneaky device - I completely lose focus and attention.  I have found a way that keeps me actively engaged during the session, I write a colourful summary of things being said. It's a very intensive and exhausting process but I like the result that I can share with you and I succeed in keeping my focus sharp. Here is my summary from Steve Jones session at SqlBits on How to start blogging.

Doing things right or do right things? How to find row count of every table in database efficiently.

Image
One Data Engineer had to replicate data from one well-known database vendor to a less-known database vendor. He used select count(*) to validate that tables row counts were equal on the source and target. It worked so slowly, that he got fired without ever knowing whether the table's content was equal or not. Often laziness is a first step towards efficiency. Rather than doing count(*) on each table, the unfortunate DBA could have used internal statistics that every decent database vendor is maintaining, stored in the system views. Need to take into consideration that it will never be 100% accurate and will depend on a few things How often do database objects change What is the internal or manual schedule for statistics refresh. For a lot of database vendors, statistics will get refreshed automatically only when the changed data is more than 10% of the total table rows but this is usually configurable per table. The percentage of rows used to calculate the statistics. The most accu...

What is Data Governance?

Image
"What is Data Governance?", a curious kid asks, peeking above my shoulder into the laptop screen. He is 14 and frequently asks questions with no interest in knowing the answer. Just like many other people around me. "That's a great question", - the first thing you would say when you have no good strategy for how to approach the question. The second part is to think aloud. After a few minutes of gathering my thoughts: "consider the term "data" as a  synonym of "useful information". We use the information to support decision-making and choosing strategy.   Regardless of whether we are talking about a household or business, having a proper strategy ensures efficient business management and somewhat helps to forecast the future. Data Governance is a system that controls every aspect of the data lifecycle - the series of stages the data goes through, from being captured, stored and used, to data asset destruction. This system helps to ensure ...

Cloud Scalability vs Cloud Elasticity

Image