Using MPI, why would my parallel code be slower than my serial code?
Among others, 3 key factors that determine the performance of the parallel model are:
- Parallel task granularity;
- Communication overhead;
- Load balancing among processes.
Parallel task granularity
The granularity of the parallel tasks must be big enough to overleap the overheads of the parallelism (e.g., parallel task creation and communication between them). Because communication overhead is normally higher with processes in distributed memory (DM) models than thread synchronization, the processes should have a higher task granularity. This granularity also should not jeopardize the load balancing.
tl;dr: Your parallel tasks must be "big" enough to justify the overheads of parallelization.
Communication overhead
Whenever one process intends to communicate with others, it has the cost of creating/sending the message, and in the case of synchronous communication
routines, there is also the cost of waiting for the other processes to receive the message. So to increase the performance of your application with MPI it is necessary to reduce the number of messages exchanged between processes.
You may use computational redundancy between processes, instead of waiting for the result from one particular process, this result can be performed directly in each process. Of course, this is normally justified when the overhead of exchanging the result overlaps the time taken by the computation itself. Another solution is to replace synchronous communication
with asynchronous communication
. While in synchronous communication
the process that sends the message waits until the other process receives it, in asynchronous communication
the process resumes its execution immediately after returning from the send call. Thus, overlapping communication with computation. However, to take advantage of asynchronous communication
it may be necessary to rewrite the code, also it may still be hard to achieve a good overlap ratio.
It is possible to reduce communication overhead by using higher-performance communication hardware, but it may turn out expensive. Collective communications can also improve communication performance since it optimizes the communication based on the hardware, network, and topology.
tl;dr: Reduce the amount of communication and synchronization between parallel tasks. Using: redundant computation, asynchronous communications, collective communications, and faster communication hardware.
Load balancing among processes
A good load balancing is essential since it maximizes the work done in parallel. Load balancing is affected by both task distribution among processes and the set of resources that the application is running.
In applications running in a fixed set of resources, you should focus on the task distribution. If the tasks have roughly the same amount of computation (e.g., for iterations), then it is only necessary to perform the most equality distribution of tasks among processes.
But, some applications may run in systems with processors with different speeds or it may have sub-tasks with different amounts of computation. For this type of situation, to promote better load balancing, a task farming model
can be used, since it can be implemented with a dynamic task distribution. However, in this model, the amount of communication used can jeopardize efficiency.
Another solution is to you manually perform tuning of the task distribution. This may turn out to be complex and hard. However, if the set of resources is not speed homogeneous and is constantly changing between application execution the performance portability of the task distribution tuning may be jeopardized.
tl;dr: Each process should take approximately the same time to finish their work.