Parallel Processing in Python - G-Square Solutions

There has been much talk about hadoop and MapReduce. Map and reduce are not new concepts and have been in use for a long time (more than 50 years!) in functional programming. In-fact most embarrassingly parallel problems can be generalised using a parallel map function. In python this can be achieved using a simple code (the package pp is being used here):

import pp

def map(func,args,nodes=None):
	if nodes is not None:
		job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
	else:
		job_server=pp.Server(ncpus=num_local_procs)
	jobs=[job_server.submit(func,input) for input in args]
	return [job() for job in jobs]

One the higher order function map is defined it is easy to parallelise any function. For example the code below parallelizes finding all prime numbers less than 20000000:

def numprimes(x,y):
	prime_nos=[]
	for num in range(x,y + 1):
	   # prime numbers are greater than 1
	   if num > 1:
		   for i in range(2,int(num**0.5)+1):
			   if (num % i) == 0:
				   break
		   else:
			   prime_nos.append(num)
	return prime_nos
upper=20000000
num_steps=20
args=[(upper*i/num_steps,upper*(i+1)/num_steps) for i in range(0,num_steps)]
allprimes_(reduce(lambda x,y:x+y,map(numprimes,args)))

This code can run parallelly in the local machine on all the cpus or it can also run on the network provided pp server is running on the network nodes. with some modifications the above code can also be made fault tolerant. We have shared the code as a public repository on Git with GPL license here:

https://github.com/gopiks/mappy

Happy parallel processing 🙂

PS: G-Square uses parallel processing extensively in delivering its analytics solutions. (Check out g-square.in/products for some of the products built using parallel processing)

Leave a Reply Cancel reply