Posts

Data Exploration with Python and SQL Server using Jupyter Notebooks

Image
  When it comes to data-driven research, presenting the results is a complicated task. It is convenient to have an initial dataset handy, if anyone asks to re-run the computations or wants to interact with the data.  Reproducibility across a number of fields is a tough task and there aren’t too many tools that can help. It’s easy to show numbers in Excel or in Power Point, but in many use cases, the context and the pathway to the results is lost. What is the Jupyter Notebooks? Jupyter Notebooks is a great tool that is becoming more and more popular these days. Jupyter Notebook combines live code execution with textual comments, equations and graphical visualizations. It helps you to follow and understand how the researcher got to his conclusions. The audience can play with the data set either during the presentation or later on. Some people say that Project Jupyter is a revolution in the data exploration world just like the discovery of Jupiter's moons was a revolu...

Data Quake

Image
Data Quake. That's what it is. Dave Wells  have just gave this great definition that clearly describes what's happening in the data management world during the recent years. I am greatly enjoying Dave’s session today at Enterprise Data World summit and couldn't resist writing down the summary. Everything that we did in the last decade becomes wrong now. We have used to believe that application logic can run faster and do better if it sits inside the database layer. Now this architecture is being considered a wrong choice. Same goes for data normalization or strong schema. Some people even say that data warehouses are dead. We need to rethink everything. Data schema used to be defined during the design phase. Now we define schema-on-read, after the data have been persisted. Good news - I have always believed that and Dave have just mentioned - there is no schema-less data. Despite the fact that we do not get to design the schema anymore, for Big Data we need to un...

Serverless ETL: Read, Enrich and Transform Data with AWS Glue Service

Image
More and more companies are aiming to move away from managing their own servers and moving towards a cloud platform. Going server-less, offers a lot of benefits like lower administrative overhead and server costs. In the server-less architecture, developers work with event driven functions which are being managed by cloud services. Such architecture is highly scalable and boosts developer productivity. AWS Glue service is an ETL service that utilizes a fully managed Apache Spark environment. Glue ETL that can clean, enrich your data and load it to common database engines inside AWS cloud (EC2 instances or Relational Database Service) or put the file to S3 storage in a great variety of formats, including PARQUET. I have recently published 3 blogposts on how to use AWS Glue service when you want to load data into SQL Server hosted on AWS cloud platform. 1.  Serverless ETL using AWS Glue for RDS databases 2. Join and Import JSON files from s3 to SQL Server RDS instance...

How MySQL and PostgreSQL are different?

Image
Have you ever wondered how MySQL and PostgreSQL are different? Take a look at the mapping that I have made for myself:  https://www.mssqltips.com/sqlservertip/5745/compare-sql-server-mysql-and-postgresql-features/ I would be happy to see comments posted for this blog and add more things to the mapping. Yours Maria  P.S. The ranking screenshot was taken from db-engines site:  https://db-engines.com/en/ranking/relational+dbms  

Discover how SQL Server can use Python to access any NoSQL engine

Image
Many companies these days keep their data assets in multiple data stores.  Many companies that I have worked at have used other database systems alongside SQL Server, such as PostgreSQL instances, Redis, Elasticsearch or Couchbase. There are situations when the application, that uses SQL Server as their main database, needs to access data from another database system. Some datastores have ODBC/JDBC drivers so you can easily add a linked server. Some datastores do not have ODBC/JDBC drivers. Want to learn how to to access noSQL platforms with SQL Server and Python? Read my article here:  https://www.mssqltips.com/sqlservertip/5738/discover-how-sql-server-can-use-python-to-access-any-nosql-engine/ Happy NoSQLing Maria 

My #AWSsummit keynote liveblog

Image
The company I work for,  Perion , chose Amazon Web Services as the main cloud provider for managing and operating our applications. These days I am learning how to manage databases and data related flows in the AWS cloud. Which is the reason I have attended the AWS summit in Tel Aviv today. Today’s AWS summit was the first from the series of AWS summits held in major cities around the world. There were ~ 5000 attendees today, 25 technical sessions and it was sponsored by 33 AWS partners. Most sessions which I have attended today had a lot of sponsor content mixed with technical not-too-deep dive into various AWS services.  Sessions were divided into 5 tracks, led by AWS Solution Architects and Sponsors representatives: Big data & Analytics sessions on data ingestion, data transformation and data analysis using AWS services AI and Deep Learning – sessions on AWS offerings for data enrichment and learning models Innovate – Rapid Application Development services C...

How would you do dynamic pivot SQL in Amazon Redshift? (This elegant script might be used in many other databases as well)

Image
Several days ago I have spent a few hours of my life figuring out how to do dynamic pivot in Amazon Redshift. To tell you the truth, I have expected much more from this DBMS SQL language. Redshift is based on Postgre SQL 8.0.2 ( which was released in 2005 !!!! ) Anything you would want for this, not too difficult task,  does not exits.  No stored procedures. No JSON datatype. No variables outside of UDF, no queries inside UDFs. “UDF can be used to calculate values but cannot be used to call SQL functions”. Python UDFs also cannot query the data, only perform calculations. Finally I have found one useful function LISTAGG that helped me to get distinct values of all pivoted columns. For each group in a query, the LISTAGG aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string. LISTAGG( [DISTINCT] aggregate_expression [, 'delimiter' ] )  [ WITHIN GROUP (ORDER BY order_list) ] ...