AAEAAQAAAAAAAAyBAAAAJGI5MTQyZmJkLTVlNmYtNDYxMy1hYTU4LTliOTQ4MjAyYWEzOQ

Most often than not, the amount of data one needs to analyse will fit in a hard disc. It may not fit in memory though. That’s where databases and out-of-core analytics solutions (like SAS) trump over in-memory solutions like R or Pandas (Python) for analytics. For sometime we have been arguing that column stores/columnar databases will be ideal for analytics (https://www.linkedin.com/pulse/column-stores-future-analytics-gopi-suvanam). Most powerful of these columnar solutions is a nifty database called MonetDb (https://www.monetdb.org/).”Read more” tag!

The problem with MonetDB is that it has its quirks in terms of querying language and relational constraints. Also most data scientists/analyst would rather prefer a dataframe kind of interface (available in Python/R) than an SQL kind of interface. Thus we have developed a wrapper around MonetDB to give a consistent interface to analytical queries on tables in MonetDB. This is similar to Blaze ecosystem (Python), except because it works exclusively on MonetDB, the overhead costs are low.

A typical analytics query, where you want city-wise sales, would look something like this in various platforms:

  1. Python/Pandas: df.groupby(‘city’)[‘sales’].sum()
  2. SQL query: Select sum(‘sales’) from data group by ‘city’
  3. Monet DB wrapper: df.groupby(‘city’).sum(‘sales’).compute()

In the last expression we have additional step “.compute()”. that is because of lazy evaluation. The dataframe queries are stored in the object and evaluated only explicitly asked to be computed. Its surprising to find that this solutions is not only faster than traditional databases by an order of magnitude (100 times faster than MySQL) but also twice as fast as in-memory Pandas based solution. We call this approach NSDB: Not So Big Data. NSBD combines the power of column stores yet hides away the fact that data is stored on the hard disc or an external system. The analyst/scientist can continue to work as if the data is stored in local variables. In a subsequent article we will discuss our of core machine learning using MonetDB and Python.

Follow