Memory is a new disk

In the database world, disk based data stores are slowly being replaced by memory-based data stores. Memory prices are becoming more affordable and operational databases can usually fit totally into memory.

According to the Gartner (the research company that provides independent technology reports) by the end of 2015 all enterprise DBMS will use memory optimizations and most of this transformation will happen this year.

I am playing with the Hekaton tables these days and thinking to whom of my customers it might be relevant. Most of them prefer the new stuff and are quite eager to put new terms on their CV. They like to say “Why SQL Server? Everyone is using Redis as an in-memory database, it’s free and working blazingly fast. And the other department is using the Couchbase cluster. We don’t want to stay behind…” In such situations I need to step outside of the wardrobe where I’m hiding and peek around. The DBMS market keeps growing and many great new technologies are being introduced. I believe that by learning other technologies we understand better the advantages and disadvantages of our favorite ones. It is good to know what our competitors are doing. I also want to be capable to identify those situations where the SQL Server is the better solution and have enough expertize to explain why.

Here is the Gartner Magic Quadrant from October 2013 for operational (OLTP) DBMS providers, some already have in-memory optimizations:

The Big Vendors, Microsoft, Oracle, SAP and IBM are still Leaders on the field, all others are way behind. I believe that the true Leaders are always aware of all the trends in the database world, they adjust to them and adopt the new ideas, which sometimes means buying emerging breakthrough solutions.

Oracle have understood finally that having several database products is not too easy for their customers. They have announced an In-Memory Database option for Oracle 12c, it should be available this July. As opposed to TimesTen, which is a separate caching product, an in-Memory option will be totally integrated into the 12c version. It will use the in-memory column-store format which is usually being used for data analysis but they say that this feature will speed up also OLTP workloads. They will keep the data simultaneously in two formats – in the old row-based and in the in-memory column-based. Both formats will be transactionally consistent.

SAP has a High performance Analytic Appliance (HANA) DBMS since 2011. It supports 2 formats and stores in memory both row-based tabular data and column-based tables. Naturally, tabular orientation is recommended more for OLTP environments and column-store for OLAP-like queries. SAP HANA can be scaled out by adding nodes, sharing the same storage. In such case the master node will store row-based data and statistics on the rest of the data.

IBM DB2 10.5 has a BLU acceleration. This is a memory-resident index which helps to find data, stored in columnar form on disk. There is no scale out solution yet but they claim that it will come soon. IBM BLU is not really for OLTP environments, as opposite to SQL Server: for writes IBM recommends “to commit 10,000 rows at a time”. Transactions are using locks to make changes to the data and latches to access shared structures.

Microsoft SQL 2014 has a new query engine “in-memory OLTP” (“Hekaton”) where the row-based data resides completely in memory keeping a copy of the data on the disk for durability purposes. Indexes on memory-resident tables live only in memory. Queries can span all 3 existing query engines: relational traditional query engine where the data sits on disk, column-store query engine, where data is using both memory and disk and this 3rd new query engine that manages memory resident data. Hekaton tables are not using locks and latches for concurrency and locking management, it’s using a combination of timestamps and row versioning.

Here are some of the proven alternative technologies on the DBMS market that are worth noticing:

Aerospike: a key-value row-based NoSQL DBMS with in-memory indexes and data on SSD. It has many customers in gaming and advertising-driven sites. Key value stores are very simple and powerful structures if you always use key-matching searches. They are extremely fast and can scale easily by adding more servers to the cluster. The main downside of the key-value structure is a limited ability to search inside the value. Aerospike, unlike many of it’s competitors, allows creation of secondary indexes on the value and stores them in memory. Redis, for instance, is a quite popular really fast in-memory key-value store but it is not scalable beyond one node and does not support secondary indexes on value. Probably , due to those two limitations, it’s not yet in the Gartner quadrant. Aerospike is using latches and short-term locks to ensure isolation between transactions.

VoltDB: an all-in-memory NewSQL relational database, designed to run as a cluster of servers with tables partitioned across the cluster. Transactions in the VoltDB are running in a serial timestamp order, do not use locks or latches, and translate SQL into the Java code. They have many customers in telecom, utilities and financial services.

MemSQL: a distributed in-memory row-based NewSQL relational DBMS. It stores all data in memory in a row format and maintains the write-ahead log for durability purposes. SQL is being compiled into the C++ code. Data distribution is implemented using an aggregator node, which is splitting the query and sends across all leaf nodes for processing. The main node will aggregate it back before sending the results to the client. Recently they have added a new column-store table type which must reside on SSD. It is sort of in contradiction to the current trend where everyone who adds column-based technology, makes it memory-resident. This IMDBMS is used by customers such as Comcast and Zynga.

MongoDB and CouchBase are the most popular document based data stores. Their greatest advantage is a well known open schema in which work with Json documents becomes very easy. Both of them are not really in-memory products, they are working with data on disk. MongoDB however has some mode of work where they use the memory-mapped files to access the data on disk. This is in fact a Linux feature. Linux makes the files appear as regular files but they are sitting entirely in memory. Couchbase is “almost” an in-memory database, it caches as much as possible in RAM but data is persisted.

There are many developers eager to play with the new technologies. However, the biggest challenge of the new DBMS is to convince organizations to put aside well known products and start using their new solutions. I don’t think this will be an easy task despite the fact that many of them are great products. They can be extremely fast but it takes years to understand all users requirements and provide the complete flexible solution that will suit more than just a few cases.

Enterprise solutions are like LEGO. You can build out of it anything that you want. However there are situations when all your kid wants is just a toy car. He doesn’t care if it can be transformed into a toy plane or into a castle, and doesn’t intend investing the time needed to build it.

Yours,

Maria

Search This Blog

Maria's journey in the data fields

Memory is a new disk

Comments

Post a Comment

Popular posts from this blog

Unlocking Microsoft Fabric: A Simple Guide when you only have a personal account.

The backbone your data pipelines have been waiting for.

Understanding the Pillars of Data Quality