Sunday, 20 December 2015

Python: map Vs imap

First let us see how map works internally.

map(func, iterable[, chunksize])
It is parallel equivalent of map function. Apply the function to every item of iterable and return the result. The chunksize parameter will cause the iterable to be split into pieces of approximately that size, and each piece is submitted as a separate task.


If you provide the chunk size, map converts the iterable to list and divide it into chunks and submit these chunks to the processes.
from multiprocessing import Pool, Process
from time import sleep
import os


def process(task):
    print("Started task ", task, " PID :", os.getpid())
    sleep(task)
    return str(task)+" Finished"

if __name__=="__main__":
    myPool = Pool(5)

    tasks=[]

    for i in range(20):
        tasks.append(i)

    print("Submitted tasks to pool")
    results = myPool.map(process, tasks, 4)
    print("Got the results")

    for result in results:
        print(result)


Output
Submitted tasks to pool
Started task  0  PID : 91427
Started task  1  PID : 91427
Started task  4  PID : 91428
Started task  8  PID : 91429
Started task  12  PID : 91430
Started task  16  PID : 91431
Started task  2  PID : 91427
Started task  3  PID : 91427
Started task  5  PID : 91428
Started task  9  PID : 91429
Started task  6  PID : 91428
Started task  13  PID : 91430
Started task  7  PID : 91428
Started task  17  PID : 91431
Started task  10  PID : 91429
Started task  14  PID : 91430
Started task  11  PID : 91429
Started task  18  PID : 91431
Started task  15  PID : 91430
Started task  19  PID : 91431
Got the results
0 Finished
1 Finished
2 Finished
3 Finished
4 Finished
5 Finished
6 Finished
7 Finished
8 Finished
9 Finished
10 Finished
11 Finished
12 Finished
13 Finished
14 Finished
15 Finished
16 Finished
17 Finished
18 Finished
19 Finished

Main problems with map
a.   First map should convert iterable to chunks, so it has to load entire iterable into memory and convert this to list.if the iterable is large. However, turning the iterable into a list can have a very high memory cost, since the entire list will need to be kept in memory.
b.   You will get the results only after all the tasks finished execution. No partial results.
c.    Another problem is processes which finishes tasks early sits idle, which impact performance. In our case Process1 finishes tasks 0, 1, 2, 3 early than process 4. Process1 sits idle after completion of 4(0, 1, 2, 3) tasks. In this kind of scenarios, we are not using the multiprocessors effectively.

Please go through the code once, I defined a pool of 5 processes and chunk size of 4. When I submitted 20 tasks, map divides these 20 tasks into a chunk of size 4.

So Process 1 get the chunk with tasks 0, 1, 2, 3
So Process 2 get the chunk with tasks 4, 5, 6, 7
So Process 3 get the chunk with tasks 8, 9, 10, 11
So Process 4 get the chunk with tasks 12, 13, 14, 15

So Process 5 get the chunk with tasks 16, 17, 18, 19



Previous                                                 Next                                                 Home

No comments:

Post a Comment