After all compiler transformations were manually incorporated in our code, the performance of the parallelized version was compared on the SGI Challenge and the Alliant machines. The impact of architectural bottlenecks was more profound on the SGI multiprocessor (as shown in Section 6) than on the Alliants, mostly due to the cost of enforcing cache coherence (a problem that does not exist for the Alliants which use shared cache), and the limited bus bandwidth. Therefore, our experiments and solutions to the architectural bottlenecks focused on the SGI Challenge.
Table 2: Cost of Scheduling and Synchronization Operations on SGI (in sec)
In order to put the following experimental results in perspective, we first measured the overhead associated with primitive operations and bookkeeping functions on the SGI Challenge, such as the cost of issuing a parallel loop iteration, barrier synchronization, and setting up the threads used in parallel loop execution. We used SGI-supplied timers to make detailed performance measurements.
Table 2
shows the cost of creating a thread and the cost of re-using existing
threads. Each loop iteration is assigned to an empty thread at a cost of about 20 secs.
The cost of creating threads is extremely high as indicated in
Table 2, and
it approaches 0.5sec; however
this cost is paid once per thread and threads are re-cycled on subsequent parallel loops.
Nevertheless, the accumulated run-time overhead of thread creation and thread assignment to
loop iterations is high and can be detrimental to performance for applications with a small
number of loops and/or loops with small loop bodies.
Figure 5: Diagram of 8 processors
Barrier synchronization overhead was measured as a pair of numbers: the barrier entry and the exit. The former was computed as the time difference between the completion of the last and the first processor, and it includes skew due to load balancing; the latter was measured as the difference between the time the last iteration completed execution and the time the loop exited. Figure 5 shows the space-time diagram of the execution of a parallel loop with 8 iterations and a large loop body. The figure shows the start-up phase (with existing threads) and the completion phase; each processor was assigned one iteration of the parallel loop. The top horizontal bar indicates the time just before execution entered the loop to the time the loop completed execution. Although the right part of Figure 5 shows a noticeable unbalance, all processors completed within an interval of 5-7% of the total execution time of the loop.
Although SGI supports gang scheduling for parallel loops (all processors start on a loop
simultaneously), we observe a noticeable difference among processors in starting up execution.
This difference is due to the synchronization overhead associated with exclusive access
of the loop index. In the case of 8 processors, a total of approximately 10secs was
spent on locking and updating the loop index - or about 1
sec for each lock operation.
In approximate terms, the total overhead associated with each processor participating in the
execution of a parallel loop has a lower bound of 60secs, or equivalently, approximately
10,000 clock cycles.
Figure 6: Program for false sharing
Figure 7: Performance of false sharing
False sharing occurs when multiple processors access (and cache) a small section of memory; although memory accesses may be truly independent, they may be treated as accesses to shared data. Figure 6(a) shows a loop that gives rise to false sharing, like many more complicated loops in our CFD code. Figure 7(a) shows its time-space diagram. False sharing is enforced at the cache line level, not at the byte or word level. Thus, when a processor writes a byte of a cache line, the entire line is invalidated, not just the modified byte. We discuss two techniques to avoid false sharing and measure their effect on performance on sample codes.
The first solution to false sharing is the use of local variables wherever possible, as shown in Figure 6(b) - the loop has been manually rewritten in order to make explicit use of local variables. The execution time of the loop of Figure 6(b) for varying number of processors is shown in the second column of Figure 7(b), under the header ``local''.
The second approach is spreading the allocation of shared data in memory, so that they are separated by addresses which differ more than the cache line size; this improvement is illustrated in the code example of 6(c). The last three columns of Figure 7(b) show the execution time of the loop of Figure 6(c) for three different allocations of the shared array: ``D'' is the distance, or the number of array elements between variables accessed by successive loop iterations.
Notice that the worst performance is observed for D=1 which corresponds to the maximum amount of false sharing. Since the cache line size of the SGI Challenge is 128 bytes, we can eliminate false sharing by keeping 32 elements between the threads. The numbers shown in italics display poor performance because the shared array is not spread wide enough to avoid false sharing. It is worth noting that the execution times of the local variable version is better than those of the shared array version. This is due to the effect of local variables, whose values do not have to be written back to memory; we suspect registers are more effectively used for them. When local variables are not allowed or supported, we have to write back to memory all values thus reducing performance. This is implicitly suggested in [10]. Although we can avoid false sharing by either method, using local variables appears to be more effective, possibly due to the opportunity of extensive register reuse. None of the compilers tested was able to handle effectively the problem of false sharing.
Figure 8: Program to emulate bus contention
Clearly, locality of data is important not only for cache performance, but also for minimizing memory accesses and network traffic. On bus-based multiprocessors such as the SGI Challenge, bus contention can be a serious performance bottleneck. A few modules in our CFD code performed poorly due to a combination of false sharing and bus contention. Bus contention was measured by comparing the performance of two almost identical loop kernels which are shown in Figure 8. The first version of the loop needs only data in local cache (iadd=0), while for (iadd=1) the loop accesses data not present in the local caches.
Figure 9: The influence of bus contention on performance
When processors request a significant amount of data from shared memory or from remote caches, we can anticipate cache miss and bus contention. We measured the performance effects of failing data locality by changing the size of the array and the number of processors used to execute the kernel of Figure 8. The size of array (nmax) was taken to be 262,144 or, equivalently, 1MB. The execution time of the parallel loop is shown in Figure 9. The best case senario was a 1.7msec (4%) increase in execution time for 2 processors and the worst a 6.0msec (13%) for 8 processors. 1.7msec should be dominated mainly by the cache miss penalty and the additional 4.3msec (9%) mainly by the bus contention. False sharing did not play a role in this case since the data in shared memory were skewed appropriately in order to avoid false sharing. Although we attempted to separate the performance loss due to cache misses from that due to bus contention, it was not possible to measure events at the level of clock period with software timers.
At least when we use up to 8 processors, we do not have to be afraid of the effect of bus contention. Although the penalty of bus contention is not so large in this case, it can be safely expected to grow significantly for larger configuration (e.g. 30 procs on a maximum configuration of a Challenge).
The Alliants avoid network contention due to false sharing and coherence in general since neither is applicable. The performance on both of the Alliants is limited by raw bus bandwidth which can be saturated when large arrays are accessed and cached simultaneously by different processors.