Posts

If you want to torture data - store it in CSV format

Image
Are you using CSV files as a primary file format for your data? CSV is a very useful file format if you want to open the file in Excel and analyze it right away. CSV format stores tabular data in plain text, it is old and was wildly used in the early days of business computing. However, if you plan to keep raw data in the data lake, you should reconsider using CSV. There are many modern file formats that were designed for data analysis. In the cloud world of data lakes and schema-on-read querying systems, like AWS Glue or Databricks, CSV files will slow you down. Today I want to talk about Parquet, a modern file format, invented for fast analytical querying. Parquet files organize data in columns, while CSV files organize data in rows. Columnar storage allows much better compression so Parquet data files need less storage, 1 TB of CSV files can be converted into 100GB of parquet files – which can be a huge money saver when cloud storage is used. This also means that scanning parquet f...

Embrace Delta Lakes and reduce the SQL Server compute resources contention

Image
Data management tools are evolving at a great speed, and this can be overwhelming. Data volumes and variety evolve and grow as well. Data Engineers are required to transform those waterfalls of data into business insights. The data is arriving from a vast range of sources, like social-media networks, 3rd party partners or internal micro services. If you are experienced SQL Server DBA, you know how versatile the product is. It is very tempting and feels correct to use the tool that you know the best. We can use SQL Server for almost any data management task. We can use SQL Server to watch over the storage for a new unprocessed files. We can load the raw data into SQL Server Staging Area SQL Server database. We can efficiently clean, enrich and aggregate the data using highly expensive relational database resources (even if you are not using SQL Database in the cloud, every Enterprise edition core still cost about $7K) SQL Server Relational engine high cost echoes the product complex...