Column stores: The future of analytics

The most common myth pertain to BigData is considering BigData and Hadoop as synonymous. Data in most business cases is structured and small enough to fit in a hard disc. In very extreme cases (think FB/Google level) is the data larger than something that cannot fit into a hard-disc. Most use cases in data science dont need Hadoop. In fact a large part of structure data analysis (think dataframes/tables/matrices/factors/features etc. etc.) can all be done using well structured solutions. Agreed typical RDBMS ( MySQL etc) will not be able to handle all analytics easily. An alternative is to use memory based solutions (Pandas/R/Spark). But eventually one is going to hit limits of RAM or the RAM will become too expensive. Column stores come to help here.

Column stores save data in a columnar fashion (i.e. each column in a separate memory mapped file). Sometimes they apply compression also to reduce on disc space. What is surprising is that a well written column store can outperform even momory based solutions for querying. This makes them ideal for online analytics processing. What is right now lacking is a good machine learning/statistical modeling solution on top of column stores. Once such solutions are available column stores will become ideal for analytics.

There are several popular column stores, but the once we like are: BColz (its a python based solution.) and MonetDB (its written in C and works across platforms.). MonetDB is the fastest querying platform I’ve seen till date. For aggregation, its 100 times faster than MySQL, 20 times faster than Postgres and twice as fast as in Pandas. Of course it come with the limitation of poor documentation. Also most column stores, because of their design, have poor performance when it comes to inserting data. Having said that MonetDB is an ideal replacement for data analysis in R or Python, especially when data size crosses affordable RAM.