ANU - -

Untitled Document

Efficient parallel search using the Neighbourhood Algorithm

P. Rickwood &amp M. Sambridge

Over the past year, we have undertaken development of the Neighbourhood Algorithm, to improve its performance as a distributed/parallel algorithm for geophysical inversion. This has involved re-formulating the Neighbourhood algorithm so that it is equivalent (in a probabilistic sense) to the existing formulation (described on the NA home page), but is capable of being more efficiently parallelised. A parallel implementation has been implemented which can be an order of magnitude faster than the currently available parallel implementation.

The existing (old) parallel implementation of the NA used simple MPI capabilities to achieve a limited form of parallelism. However, this parallelism has two important limitations:

Barrier synchronisation overhead

In a naive parallel implementation of the NA, the objective function (i.e. the forward problem) is solved (with different parameter values) on each computer and the results are collected. If we wait for all computers to complete their computation, we can waste valuable clock cycles, as all computers must wait for the slowest computer to finish. If we take a staged running race as our analogy, the naive implementation requires that every runner wait, at every stage for all other runners to catch up. We can eliminate this waste entirely by removing the need to synchronise at barriers.

Figure 1: Barrier synchronisation overhead.
If barrier synchronisation is used, the performance of the algorithm as a whole is determined by the slowest participant. Everyone else has to wait (doing nothing) until the slowest participant finishes.
Book-keeping overhead
The existing (old) parallel implementation of the Neighbourhood Algorithm only distributed the task of evaluating the objective function. The other elements of the algorithm (calculating the Voronoi cell structure and generating new models) were performed on every computer -- a large waste of computational resources. The new implementation distributes these other elements as well as the objective function evaluation.

The re-worked implementation of the NA takes advantage of more sophisticated MPI capabilities (asynchronous non-blocking send-receive), that, together with minor modifications to the algorithm itself (that preserve the essential nature of the algorithm), allow for a vastly more efficient implementation. Barrier synchronisation overhead is eliminated entirely, and book-keeping is distributed across all computers involved in the computation. Figure 2 demonstrates the speed-up achieved by the new implementation of the NA algorithm.

Figure 2: Performance -- old versus new
Timing results on a 128-node beowulf cluster for the receiver function inversion problem described in Shibutani et al., illustrating the sort of dramatic performance increase the new algorithm can produce.

The improvement in efficiency is most appreciable for problems with objective functions that are relatively quick (&lt 1 second) to evaluate, and for problems where there is a large degree of variability in the time taken to evaluate the objective function ¹.

Other work and improvements

In addition to being more efficient, the new implementation of the Neighbourhood Algorithm is also more robust, as the re-formulation of the algorithm allows an implementation that is resistant to the failure of any individual node. In other words, the parallel implementation will continue to work even in the event that nodes fail mid-computation. As clusters become larger (and cheaper), and parallel computing becomes the norm, this feature will become especially important.

The Neighbourhood algorithm has also been re-implemented in the Java programming language, to take advantage of Java's cross-platform portability. This means that the Neighbourhood Algorithm can now be distributed in pre-compiled form, and run on any platform without the need for recompilation. In addition, the Java implementation of the Neighbourhood algorithm can now be easily distributed with the CADI Toolkit, and benefit from the point-and-click interface and other plotting and visualisation capabilities of the CADI toolkit.

We plan to make both Fortran and Java implementations of the parallel Neighbourhood Algorithm available from the NA home page. Future development may involve some re-engineering of the second stage of the Neighbourhood Algorithm (the appraisal stage) for more efficient parallel execution.

¹ This may happen in varying circumstances. In some cases, the time taken to evaluate the objective function is parameter dependent. In other cases, the computers performing the computation are not equivalent (in terms of speed), or have varying loads.

² Details of which can be found in:
Shibutani, T., Sambridge, M. &amp Kennett, B., 1996. Genetic algorithm inversion for receiver functions with application to crust and uppermost mantle structure beneath Eastern Australia, Geophys. Res. Lett., 23 (14), 1829-1832.