Load Balancing

July 1, 2024

Load Balancing for Heterogeneous CFD Simulations

Efficiency can be measured in many ways, and is commonly measured by Central Processing Unit (CPU) time or wall time. A metric which can also indicate efficiency can be defined as follows: the amount of computational time spent advancing the solution compared to the total amount of time spent for a given computational step; this will be referred to hereafter as the CPU Active Ratio (CAR). CAR can be computed simply as

\text{CAR} = \frac{t_{\text{Active}}}{t_\text{Total}}

The CAR ratio represents how much of the available computational power is being used to advance the problem, thus defining the efficient utilization of all available resources. Because not all computational resources can be made parallel, there is an upper limit to the CAR. See also: Amdahl’s Law.

For standard CFD codes which make use of the Euler and Navier Stokes equations, the time required to advance one time step of the solution is relatively consistent across all computational cells. Apart from MPI (or other multithreading) overhead, the CAR is expected to be quite high. For more complicated methods whereby the time to compute the data in a computational cell is variable, the time required to advance one time step of the solution may vary significantly between different computational cells. If some regions can be solved quickly while others are slow to get a result, the computational time can be overrepresented as a result of a few expensive computational cells hogging the CPU time. Such a case has been observed in my studies and results in a CAR that can be as low as 0.5 (indicating half of the computational resource is being wasted).

Many multidimensional codes to break the solution problem into blocks which are assigned to different CPUs. The easy way of assigning these blocks is to simply assign an equal number of blocks to each CPU. While the CAR ratio can be kept high for many cases, instances where some cells might take a long time to compute can cause the simulation to stall.

A solution to this problem proposed here involves a simple load balancing scheme based on the classical Round Robin load balancing method. Load balancing has been a topic of research for many years; some previously explored methods include the use of hybrid parallelization techniques using both a Message Passing Interface (MPI) with the OpenMP multiprocessing API [1], and the use of an optimization algorithm which runs after each time step to move blocks of computational cells between MPI processes [2]. There was recently a publication by Micale, Bracconi, and Maestri [3] which employs a method similar to the one which will be presented here, however the method presented here was developed independently prior to the authors’ publication (this is confirmed by my Github comits to a private repo).

The Round Robin

The method proposed herein requires that the program keep track of the computational time expended per process; therefore, the time from the start of a simulation step to the end must be computed without the interference of blocking MPI commands. The computational time expended for just the calculations is related to the number of calculations the CPU has to do on the block. For a homogeneous computing environment, no scaling is needed; however, for a heterogeneous system whereby one CPU does not equal the performance of another CPU, it is recommended that some scaling criterion be established (for example, one might use the amount of time to compute pi to 50 decimal places as a benchmark scaling metric for floating point operations); a more effective strategy would be to define a standard computational block and run it on each compute device to get an idea of the performance for a problem which is more true-to-life.

The average compute time of all computational blocks shall be referred to as $t_b$ . Note that the average should be corrected for the scaling factor for heterogeneous compute environments. The final step of this process is to assign computational blocks to each process until the computational time on that process is equal to at least the mean compute time; this is repeated for all computational cells. A critical assumption made here is that the computational time required for timestep $T_i$ is similar to the computational time required for timestep $T_{i+1}$ .

The reason a more complicated time approximation for $T_{i+1}$ is not implemented is because the load balancing operation should be extremely lightweight so as not to negate its own existence; any additional calculations add overhead which could reduce the effectiveness of load balancing. It would theoretically be possible to store the time $T_{i-1}$ and compute a linear approximation of $T_{i+1}$ relatively cheaply. For now, I have not explored this technique as I do not believe it will yield enough of a performance gain for the added complexity.

The described approach was tested using the known computational times from a complex CFD code from my Master’s thesis. In this case, load balancing was done on a cell-by-cell basis rather than block-based since the one-dimensional code does not split the domain into blocks. Figure 1 shows a plot of time with an overlay of the load balancing for a specific time step.

Figure 1: An example of a load-balanced simulation result. This is a quasi-2D simulation (one dimensional simulation with time slices exctracted at regular intervals); load balancing results are shown for a slice at time t = 0.5 (presented here as x = t).

If we compute the time for each cell grouping assuming an even distribution across all CPUs, the longest cell takes 132ms to compute the result. Compare this to the load-balanced method which achieves the same in just 53ms; this represents a more than 50% reduction in compute time.

In practice, performance is not equal to the ideal result; this is because there is overhead in moving cells from one CPU to another, there is overhead in calculating the average cell time, and from one time step to the next the computational time may change suddenly. The first two of these problems can be addressed directly by simply reducing the frequency of re-balancing. It is important to be mindful of Amdahl’s Law; the theoretical 50% reduction in compute time is an ideal which may not be attainable without considerable tuning. The proposed load balancing algorithm is just an extra data shift tagged on to the end of the normal finite volume solution step.

Result from cell-based balancing

If applied at each computational step, this procedure ends up increasing computational time. If, instead, only applied every 100 time steps, this solution produces a computational speed-up of 30% for this case. It is clear that triggering too frequently can cause the performance of this load balancing scheme to cause problems; it is also likely that triggering too infrequently may negate any positive effects or even hinder the simulation by not updating regions where a computationally intensive entity is no longer present. There is also the possibility of timing noise; a computational cell may not take exactly 800ns every single time, so cells on the boundary may end up being frequently swapped between processes.

For block-based solutions, it may be possible to track the L2-norm of the solution and trigger load re-balancing only when the solution norm is changing beyond some baseline on multiple CPUs. Further research may reveal better criterion for triggering a load re-balance. A further issue remains: if we use this approach, we risk undersubscribing the last CPU because each assignment to a CPU node subscribes to at least the mean compute time. This issue could be alleviated by over- and under-subscribing CPU nodes alternately, however as of this time I have not looked in to this. Another alternative would be to keep track of the “total time” metric and adjust the balancing on-the-fly if it appears that the final node will be under-subscribed. Again, I haven’t done this as I want to keep the load-balancing algorithm as simple as possible.

Block-Based Approaches

The solution for the cell-based procedure was for a quasi-2D simulation with 500 computational cells. The solution was first order, meaning only one piece of information was being shared between any given CPUs. If a multidimensional block-based approach were to be specified, an additional latency penalty would have to be accounted for due to the requirement of sending information from ghost cells in adjascent CPUs.

A major source of performance loss is data transfer. On a single CPU, data transfer can be relatively efficient, however data transfer over a network (regardless of the technology) is inherently slow. The slowdown is worse if considering offloading data to a GPU. It is therefore expected that multidimensional codes will require even less frequent load balancing calls to avoid the increased time waste inherent in data transfers.

I have not yet applied this to a multidimensional block-based approach to determine how bad the additional latency will be. This work is planned, but I will first need access to a well-written code to which I can add this algorithm.

Load Balancing for Heterogeneous Systems

One final note is the load balancing of a non-uniform workload across a non-uniform compute architecture. The use of GPU acceleration for CFD is not exactly new; many codes use CUDA, SYCL, or similar methods to offload highly parallel computations to streaming multiprocessors. I have considered the possibility balancing a load between these devices; of particular interest is: if we have a high performance CPU node paired with multiple high-performance GPUs, how do we split the load? More importantly, what if those GPUs are not matched in performance; can we split the load effectively between them?

I have written a sample load-balancing program which attempts to do exactly this. Once again, I do not have a test code to play around with; instead, I have written a simple C++ simulation which attempts to “guestimate” what the approximate performance savings will be and what the block distribution will look like. Be forewarned: I wrote this in an afternoon while attending a child’s birthday party. The code can be found here.

Final Comments

I have a system which appears to perform quite well. Some researchers have subsequently discovered similar systems, which I feel validates the work I put into this. It will be interesting to see what developments are made going forward.

I have prepared this result for presentation with my latest meeting with my Ph. D. comittee; regretably, nobody seemed curious to ask about it. I rather enjoy talking about the load balancing problem (as my unfortunate colleagues can attest when I get on a roll). I will update this page if I present these findings at any conference; otherwise, feel free to reference this website and my Master’s thesis.

Brooks

Beyond conventional thinking