27-4 Multithreading reductions and prefix computations

A โŠ—\otimes-reduction of an array x[1โ€ฆn]x[1 \ldots n], where โŠ—\otimes is an associative operator, is the value

y=x[1]โŠ—x[2]โŠ—โ‹ฏโŠ—x[n].y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].

The following procedure computes the โŠ—\otimes-reduction of a subarray x[iโ€ฆj]x[i \ldots j] serially.

cpp REDUCE(x, i, j) y = x[i] for k = i + 1 to j y = y โŠ— x[k] return y

a. Use nested parallelism to implement a multithreaded algorithm P-REDUCE\text{P-REDUCE}, which performs the same function with ฮ˜(n)\Theta(n) work and ฮ˜(lgโกn)\Theta(\lg n) span. Analyze your algorithm.

A related problem is that of computing a โŠ—\otimes-prefix computation, sometimes called a โŠ—\otimes-scan, on an array x[1โ€ฆn]x[1 \ldots n], where โŠ—\otimes is once again an associative operator. The โŠ—\otimes-scan produces the array y[1โ€ฆn]y[1 \ldots n] given by

y[1]=x[1],y[2]=x[1]โŠ—x[2],y[3]=x[1]โŠ—x[2]โŠ—x[3],โ‹ฎy[n]=x[1]โŠ—x[2]โŠ—x[3]โŠ—โ‹ฏโŠ—x[n], \begin{aligned} y[1] & = x[1], \\ y[2] & = x[1] \otimes x[2], \\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\ & \vdots \\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned}

that is, all prefixes of the array xx "summed" using โŠ—\otimes operator. The following serial procedure SCAN\text{SCAN} performs a โŠ—\otimes-prefix computation:

cpp SCAN(x) n = x.length let y[1..n] be a new array y[1] = x[1] for i = 2 to n y[i] = y[i - 1] โŠ— x[i] return y

Unfortunately, multithreading SCAN\text{SCAN} is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure P-SCAN-1\text{P-SCAN-1} performs the โŠ—\otimes-prefix computation in parallel, albeit inefficiently.

cpp P-SCAN-1(x) n = x.length let y[1..n] be a new array P-SCAN-1-AUX(x, y, 1, n) return y

cpp P-SCAN-1-AUX(x, y, i, j) parallel for l = i to j y[l] = P-REDUCE(x, 1, l)

b. Analyze the work, span, and parallelism of P-SCAN-1\text{P-SCAN-1}.

By using nested parallelism, we can obtain a more efficient โŠ—\otimes-prefix computation:

cpp P-SCAN-2(x) n = x.length let y[1..n] be a new array P-SCAN-2-AUX(x, y, 1, n) return y

cpp P-SCAN-2-AUX(x, y, i, j) if i == j y[i] = x[i] else k = floor((i + j) / 2) spawn P-SCAN-2-AUX(x, y, i, k) P-SCAN-2-AUX(x, y, k + 1, j) sync parallel for l = k + 1 to j y[l] = y[k] โŠ— y[l]

c. Argue that P-SCAN-2\text{P-SCAN-2} is correct, and analyze its work, span, and parallelism.

We can improve on both P-SCAN-1\text{P-SCAN-1} and P-SCAN-2\text{P-SCAN-2} by performing the โŠ—\otimes-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of xx into a temporary array tt, and on the second pass we use the terms in tt to compute the final result yy. The following pseudocode implements this strategy, but certain expressions have been omitted:

cpp P-SCAN-3(x) n = x.length let y[1..n] and t[1..n] be new arrays y[1] = x[1] if n > 1 P-SCAN-UP(x, t, 2, n) P-SCAN-DOWN(x[1], x, t, y, 2, n) return y

cpp P-SCAN-UP(x, t, i, j) if i == j return x[i] else k = floor((i + j) / 2) t[k] = spawn P-SCAN-UP(x, t, i, k) right = P-SCAN-UP(x, t, k + 1, j) sync return ____ // fill in the blank

cpp P-SCAN-DOWN(v, x, t, y, i, j) if i == j y[i] = v โŠ— x[i] else k = floor((i + j) / 2) spawn P-SCAN-DOWN(____, x, t, y, i, k) // fill in the blank P-SCAN-DOWN(____, x, t, y, k + 1, j) // fill in the blank sync

d. Fill in the three missing expressions in line 8 of P-SCAN-UP\text{P-SCAN-UP} and lines 5 and 6 of P-SCAN-DOWN\text{P-SCAN-DOWN}. Argue that with expressions you supplied, P-SCAN-3\text{P-SCAN-3} is correct. (Hint:\textit{Hint:} Prove that the value vv passed to P-SCAN-DOWN(v,x,t,y,i,j)\text{P-SCAN-DOWN}(v, x, t, y, i, j) satisfies v=x[1]โŠ—x[2]โŠ—โ‹ฏโŠ—x[iโˆ’1]v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1].)

e. Analyze the work, span, and parallelism of P-SCAN-3\text{P-SCAN-3}.

(Removed)