Distributed file systems have sort of solved the problem of elastically storing data in several systems and search and retrieving them parallelly. But some of the problems related to the speed of analytics on structured data still remains. Most of the analytics algorithms have parallel implementations, but I haven’t come across a good solution that can be used scalably on a series of commodity servers to do complex analytics without user side coding. At the minimum such a solution should be able tso do the following operations parallelly:
- Join
- Matrix operations: multiplication, inversion, dot product
- Optimisation
- Maximum likelihood estimates of some statistical models like regression and logistic regression
- Decision tree algorithms
Infact these are some of the analytics “primitives” used across industries for building models.
The solution should also be able to linearly scale just like Hadoop or any MapReduce solution scales with introduction of more commodity processors.
Follow