Improving the Performance of a Landscape Analysis Package through Parallel Processing
Problem Description
Ecologists use landscape metrics to compare landscapes, to evaluate temporal changes within landscapes, and to predict the effects of various disturbance processes. Programs that analyze large datasets in an efficient manner are needed. Our objective was to find a computationally efficient method for calculating landscape metrics at thousands of randomly selected forested sites across the continental US.
Currently there are two software packages available that were designed to calculate landscape metrics - FRAGSTATS (McGarigal and Marks) and APACK (Mladenoff and DeZonia). Both are standalone
programs that run under the Windows 95/98/NT/2000 OS, but neither is designed to process very large datasets. APACK, however, was reported to be able to process larger landscapes significantly faster than FRAGSTATS. For this reason, we contacted the developers to inquire about the possibility of parallelizing their code. They not only welcomed our inquiry, but their code turned out to be quite amenable to parallelization.
APACK is an analysis package written in C++. From an input file, which can be either an ERDAS GIS file or in ASCII text format, it calculates up to 25 different landscape metrics. These include basic measures such as area, shape measures such as fractal dimension, information theoretic measures such as diversity, textural measures such as lacunarity, probabilistic measures such as electivity, and structural measures related to connectivity. APACK's output consists of ASCII files, detailing the metrics collected on the entire input map and by individual input classes(Mladenoff and DeZonia).
Most computer programs are designed to run on one processor. Even when two or more processors are available, programs tend to use only one processor at a time. The computer programmer must explicitly instruct the program when it should use more than one processor, and how many processors to use. There are many advantages to using symmetric multiple processors (SMP), but there are also disadvantages. A well written SMP program can cut wall-time by nearly 50%, but a poorly written one could increase wall-time by 100% or more.
There are multiple techniques for parallelizing programs. The designer can choose from fine or coarse grained levels of parallelization, static threads or dynamic threads, and a myriad of other choices. The problem is that each choice has its own strength and weakness. The designer should consider the outcome of almost every scenario. This requires that the author is closely familiar with the task to be performed and an understanding of how different variables affect the program's runtime execution. The main goal of parallel programming is to achieve the highest percentage of execution time possible using all available CPU's. On some SMP machines this number is 2, on others it could be 64.
Project Methodology
Two parallelized designs were implemented and tested. The first method uses static threads (the number of threads were determined at compile time) and does not consider scheduling. The second method uses dynamic threads (determined at run time) and intelligent scheduling (longest calculations first). The latter method enables the number of processors used to be decided at runtime, allowing the user to optimize the number of threads based on the analysis to be performed. After performing a runtime analysis of APACK, it was determined that relatively few functions account for a very high percentage of APACK's wall-clock time. Scheduling these functions to run on the first available threads, and pushing the shorter functions onto the last available threads makes much more efficient use of available CPU's.
Testing
Goal
To compare single processor versus dual processor implementations of APACK under different conditions. Before implementation of our parallel design, a series of tests were designed. This sequence was constructed in a manner to reveal patterns in our implementation's performance when compared to the performance of APACK on a single processor.
Specifications
All tests were performed on a single machine with dual 2000+ AMD processors and 2700 DDR Ram, powered by SuSe Linux version 8.2 with a custom compiled kernel (v.2.4.21 release candidate 7). All executables were compiled with Intel's C++ compiler version 7.0 for Linux, without optimization flag. The parallel implementations were compiled with the openMP flag set to true. For each test, APACK was run with the -a flag, which instructs APACK to calculate all statistics on the input data.
Results
The Effects of Increased Size
For our initial tests, input imagery was varied by number of pixels. GNU time was used to measure 'wall clock' or 'elapsed' time of program execution. Improved elapsed time is the true measure of increased performance when parallelizing a program.
The figure above shows that in all cases, elapsed time for a single processor was longer than the parallelized versions and dynamic threads tended to be faster than static, particularly as the pixel count increased. As the image size increases, the overhead caused by the creation of dynamic threads becomes less significant when compared to the overall runtime. This test was repeated multiple times to be sure there were no systematic errors caused by the usage of wall-clock time instead of CPU time.
An important measure of increased performance is based on Amdahl's Law and the law of diminishing returns. Amdahl's law says that the amount of relative performance increase possible is limited by the amount of time the area improved is used. We can adapt Amdahl's law to find our own limiting factor for increased performance with the following equation:
The interesting principal of this equation is its behavior as the size of the image increases. Time spent on nonparallel code increases linearly, whereas the time spent in parallel code is a more rapidly increasing function. This means that the larger the image becomes, the more insignificant the amount of time spent in nonparallel code, allowing us to approach perfect performance:
Effects of Increased Complexity by Number of Classes
For this test, image complexity across a consistent region was varied by adjusting class representation. Three images from the same area and extent (4096x4096) were used in this series of tests. These images were created by a series of hierarchical merging of classes.
The first image contains all classes possible in the National Land Cover Dataset (NLCD), 19 of which were present at this extent. The second image was produced by merging the first image into multiple classes based on lifeform; 9 different classes were actually present in this extent. In the final image, only two classes were present, forested and nonforested regions. Complexity of each image is quantified by the number of contiguous regions, or 'patches', reported by APACK. For this test, only single processor and dual processor with dynamic threads are reported.
The results presented in the graph above indicate that the causes for increased runtime on a single processor may not affect the relative increase in performance when run on two processors. As with the previous tests, this test demonstrates that as the task gets longer on a single processor, the relative performance increase moves toward 50%.
Effects of Increased Complexity by Patch Count
For this test, image complexity was varied by number of patches present in each image. A series of progressively simplified test images were generated by the continued merging of a neutral landscape.
A neutral image with class proportions similar to the most complex image in the previous test, but with random class distribution was generated. An image segmentation algorithm eliminated single pixels (minimum mapping unit of 2 pixels) to produce the most complex image for this test. The 19 additional test images were created using the same algorithm with a progressively larger minimum map unit. The nature of this merging algorithm is such that the dominating class (the mosaic) emerges but sparse classes do not disappear.
The results presented in the graph above are consistent with those presented in the previous tests. This indicates that the correlation between the processing times is independent of complexity and size, and instead is a function of elapsed time. The fact that the relative increase in speed approaches 50% as elapsed time gets longer, again suggests that the following equation can be applied to our implementation
Conclusions
By using a parallelization method consisting of OpenMP dynamic threads and intelligent task scheduling, we were able to produce an SMP version of APACK that operates more efficiently on multiple processor machines than nonparallelized versions run on a single processor. Tests have indicated that this speedup of elapsed time is independent of the content or size of the input map, but is a function of the elapsed time on a single processor. In all observed cases, the percent speedup as a function of elapsed time indicates an outer limit approaching 50% from below when run on a dual processor machine.
Our parallel implementation is designed such that the inclusion of up to four processors is expected to provide a proportional increase in performance. After four processors the percentage of increase for each added processor will most likely be negligible or negative due to the structure of our intelligent scheduling algorithm and the number of large functions in APACK.
In conclusion, when using large or complex input maps, APACK's performance is notably improved. The same speedup is not seen for small or trivial inputs. The percent speedup is also dependent on the calculations to be performed. If more than one APACK function is calculated, speedup will occur; if only one function is calculated, no speedup will occur. It is not known to what degree each APACK function affects runtime.
Future Work
As noted earlier, a large percentage of APACK's runtime is spent on three functions: lacunarity, connectivity between patch centroids and connectivity between circular patches. It is reported that each of these functions run in an amount of time proportional to the square of the size of the input map, while all other functions execute in a time proportional to the size of the map itself (Mladenoff and DeZonia). We propose a study to determine how each function is a affected by complexity and how that effects our implementation's speedup when running only a few calculations..
It can be shown that if any of the three large functions listed above are run with any other functions, our current implementation will show speedup. We propose that once the effects of complexity on each function is fully understood this will lead to a new implementation that internalizes parallelization to the larger functions. This will help increase speedup when only one function is used. The program would decide at runtime which parallelization method to use based on the complexity of the input map and the specific calculations that need to be made.
References
David Mladenoff. and Barry Dezonia, APACK 2.17 User's Guide; Electronic Publication (http://landscape.forest.wisc.edu/projects/APACK/apack.html, 4- 03-01).
Kevin McGarigal and Barbara Marks, FRAGSTATS User Guidelines, Version 3; Electronic Publication (http://www.umass.edu/landeco/research/fragstats/documents, 10-01-03)
Funding
This research was funded by The National Science Foundation EPSCOR program (grant #_____) and the Environmental Protection Agency's STAR program (grant #R827673-01-0). Special thanks to Barry DeZonia and David Mladenoff of the Forest Landscape Ecology Lab, University of Wisconsin, Madison, for their cooperation and assistance with the APACK source code.
| Principal Investigators | Roland L. Redmond and Don Morton |
| Senior Image Analyst | J. Chris Winne |
| Chief Research Programmers | Shane C. Mason and Michael Morgan |
|