If you want to torture data - store it in CSV format


Are you using CSV files as a primary file format for your data?

CSV is a very useful file format if you want to open the file in Excel and analyze it right away.

CSV format stores tabular data in plain text, it is old and was wildly used in the early days of business computing.

However, if you plan to keep raw data in the data lake, you should reconsider using CSV. There are many modern file formats that were designed for data analysis.

In the cloud world of data lakes and schema-on-read querying systems, like AWS Glue or Databricks, CSV files will slow you down.


Today I want to talk about Parquet, a modern file format, invented for fast analytical querying.


Parquet files organize data in columns, while CSV files organize data in rows.

Columnar storage allows much better compression so Parquet data files need less storage, 1 TB of CSV files can be converted into 100GB of parquet files – which can be a huge money saver when cloud storage is used. This also means that scanning parquet file is much faster than scanning CSV files – fewer data would be scanned and there is no need to load unneeded columns into memory and aggregations will run faster. Parquet files contain both data and metadata, information about data schema and structure. When you load the file, having metadata helps the querying tool define proper data types.


To test both formats' performance, I will use January 2022 New York Yellow taxi dataset available in both CSV and Parquet online:

https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2022-01.parquet

https://s3.amazonaws.com/nyc-tlc/csv_backup/yellow_tripdata_2022-01.csv

Both files have 2463931 rows. However, the CSV file is 239MB, and the Parquet file is only 36MB.


I will be using the Azure Databricks Spark cluster to analyze the dataset. I want to see the total number of passengers that took a yellow taxi between the 20th and 26th of January. I will run the script twice, first using the parquet file and then using the CSV file.



Querying parquet took 5.19 sec

Querying CSV took 2.28 min

As you see, querying CSV files is slower, requires more IO, and will take much more space on your cloud storage.


Yours,

Maria


"If you torture the data long enough, it will confess."

Comments

Popular posts from this blog

Unlocking Microsoft Fabric: A Simple Guide when you only have a personal account.

ETL to ELT journey: Break free your Transformations and discover Happiness and Data Zen

The backbone your data pipelines have been waiting for.