Columnar DBs: Future of Structured Data

Library_of_Congress_Entrance_Hall_Columns

Contrary to popular perception, structured data analytics still forms a big potion of overall analytics/big data activity. Although unstructured data (text, data with variety etc.) is gaining more and more importance, structure data (transactions, numeric data etc.) still gives a lot of actionable insights for businesses. The complexity though is that even structured data is growing dramatically in volume and velocity.

Today’s transnational data cannot be easily analysed using traditional RDBMS. At the same time in-memory databases cannot handle the volume of data at a cheap cost. Also NoSQL databases are designed more for unstructured data than structured data analysis. The only solution is either to adopt a cluster computing solution like Spark or to use a denormalized columnar database.

Denormalized columnar databases provide fantastic performance for analytical functions like filtering, grouping, aggregation etc. The way data is represented in a columnar database reduces the time for these kinds of queries significantly. Although the negative is that insertions and deletions take much more time than traditional RDBMS. The other advantage of columnar databases is the significant amount of compression of data. Both these features- high compression and massive increase in performance- have led to the use of columnar databases for the back-end of analytics. If one uses columnar DBs, there is no need to create cubes and pre-stored reports for online analytical processing (OLAP). Analytics can be performed in real-time directly on the transactions.

One relatively less known open-source columnar database we found to be giving superb performance is MonetDB. Its does not have great community support but the performance is unmatched. Developers have to struggle a little initially to navigate through the quirks of the DB, but once one gets a grasp of the database it is ideal for analytics requirements. MonetDB for example is 100 times faster than MySQL for group-by aggregations and about 20 times faster than Postgresql as well. Infact we were surprised to find out that it is faster than even in-memory solutions like PySpark and Pandas (in Python).

To make development easier, we have created an abstraction on top of MonetDB. The abstraction layer works similar to Blaze or Pandas in Python ecosystem, but with MonetDB as the back-end. We are also in the process of building a machine learning library that can work on large amounts of data with MonetDB as the back-end. This takes care of both simple business intelligence and advanced analytics for massive amounts of structured data.