27-5 Multithreading a simple stencil calculation

Computational science is replete with algorithms that require the entries of an array to be filled in with values that depend on the values of certain already computed neighboring entries, along with other information that does not change over the course of the computation. The pattern of neighboring entries does not change during the computation and is called a stencil. For example, Section 15.4 presents a stencil algorithm to compute a longest common subsequence, where the value in entry c[i,j]c[i, j] depends only on the values in c[i−1,j]c[i - 1, j], c[i,j−1]c[i, j - 1], and c[i−1,j−1]c[i - 1, j - 1], as well as the elements xix_i and yjy_j within the two sequences given as inputs. The input sequences are fixed, but the algorithm fills in the two-dimensional array cc so that it computes entry c[i,j]c[i, j] after computing all three entries c[i−1,j]c[i - 1, j], c[i,j−1]c[i, j - 1], and c[i−1,j−1]c[i - 1, j - 1].

In this problem, we examine how to use nested parallelism to multithread a simple stencil calculation on an n×nn \times n array AA in which, of the values in AA, the value placed into entry A[i,j]A[i, j] depends only on values in A[i′,j′]A[i' , j'], where i′≤ii' \le i and j′≤jj' \le j (and of course, i′≠ii' \ne i or j′≠jj' \ne j). In other words, the value in an entry depends only on values in entries that are above it and/or to its left, along with static information outside of the array. Furthermore, we assume throughout this problem that once we have filled in the entries upon which A[i,j]A[i, j] depends, we can fill in A[i,j]A[i, j] in Θ(1)\Theta(1) time (as in the LCS-LENGTH\text{LCS-LENGTH} procedure of Section 15.4).

We can partition the n×nn \times n array AA into four n/2×n/2n / 2 \times n / 2 subarrays as follows:

A=(A11A12A21A22).(27.11) A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \tag{27.11} \end{pmatrix} .

Observe now that we can fill in subarray A11A_{11} recursively, since it does not depend on the entries of the other three subarrays. Once A11A_{11} is complete, we can continue to fill in A12A_{12} and A21A_{21} recursively in parallel, because although they both depend on A11A_{11} , they do not depend on each other. Finally, we can fill in A22A_{22} recursively.

a. Give multithreaded pseudocode that performs this simple stencil calculation using a divide-and-conquer algorithm SIMPLE-STENCIL\text{SIMPLE-STENCIL} based on the decomposition (27.11)\text{(27.11)} and the discussion above. (Don't worry about the details of the base case, which depends on the specific stencil.) Give and solve recurrences for the work and span of this algorithm in terms of nn. What is the parallelism?

b. Modify your solution to part (a) to divide an n×nn \times n array into nine n/3×n/3n / 3 \times n / 3 subarrays, again recursing with as much parallelism as possible. Analyze this algorithm. How much more or less parallelism does this algorithm have compared with the algorithm from part (a)?

c. Generalize your solutions to parts (a) and (b) as follows. Choose an integer bâ‰Ĩ2b \ge 2. Divide an n×nn \times n array into b2b^2 subarrays, each of size n/b×n/bn / b \times n / b, recursing with as much parallelism as possible. In terms of nn and bb, what are the work, span, and parallelism of your algorithm? Argue that, using this approach, the parallelism must be o(n)o(n) for any choice of bâ‰Ĩ2b \ge 2. (Hint:\textit{Hint:} For this last argument, show that the exponent of nn in the parallelism is strictly less than 11 for any choice of bâ‰Ĩ2b \ge 2.)

d. Give pseudocode for a multithreaded algorithm for this simple stencil calculation that achieves Θ(nlg⁥n)\Theta(n\lg n) parallelism. Argue using notions of work and span that the problem, in fact, has Θ(n)\Theta(n) inherent parallelism. As it turns out, the divide-and-conquer nature of our multithreaded pseudocode does not let us achieve this maximal parallelism.

(Removed)