27-4 Multithreading reductions and prefix computations

A $\otimes$-reduction of an array $x[1 \ldots n]$, where $\otimes$ is an associative operator, is the value

$$y = x[1] \otimes x[2] \otimes \cdots \otimes x[n].$$

The following procedure computes the $\otimes$-reduction of a subarray $x[i \ldots j]$ serially.

cpp REDUCE(x, i, j) y = x[i] for k = i + 1 to j y = y ⊗ x[k] return y

a. Use nested parallelism to implement a multithreaded algorithm $\text{P-REDUCE}$, which performs the same function with $\Theta(n)$ work and $\Theta(\lg n)$ span. Analyze your algorithm.

A related problem is that of computing a $\otimes$-prefix computation, sometimes called a $\otimes$-scan, on an array $x[1 \ldots n]$, where $\otimes$ is once again an associative operator. The $\otimes$-scan produces the array $y[1 \ldots n]$ given by

$$ \begin{aligned} y[1] & = x[1], \\ y[2] & = x[1] \otimes x[2], \\ y[3] & = x[1] \otimes x[2] \otimes x[3], \\ & \vdots \\ y[n] & = x[1] \otimes x[2] \otimes x[3] \otimes \cdots \otimes x[n], \end{aligned} $$

that is, all prefixes of the array $x$ "summed" using $\otimes$ operator. The following serial procedure $\text{SCAN}$ performs a $\otimes$-prefix computation:

cpp SCAN(x) n = x.length let y[1..n] be a new array y[1] = x[1] for i = 2 to n y[i] = y[i - 1] ⊗ x[i] return y

Unfortunately, multithreading $\text{SCAN}$ is not straightforward. For example, changing the for loop to a parallel for loop would create races, since each iteration of the loop body depends on the previous iteration. The following procedure $\text{P-SCAN-1}$ performs the $\otimes$-prefix computation in parallel, albeit inefficiently.

cpp P-SCAN-1(x) n = x.length let y[1..n] be a new array P-SCAN-1-AUX(x, y, 1, n) return y

cpp P-SCAN-1-AUX(x, y, i, j) parallel for l = i to j y[l] = P-REDUCE(x, 1, l)

b. Analyze the work, span, and parallelism of $\text{P-SCAN-1}$.

By using nested parallelism, we can obtain a more efficient $\otimes$-prefix computation:

cpp P-SCAN-2(x) n = x.length let y[1..n] be a new array P-SCAN-2-AUX(x, y, 1, n) return y

cpp P-SCAN-2-AUX(x, y, i, j) if i == j y[i] = x[i] else k = floor((i + j) / 2) spawn P-SCAN-2-AUX(x, y, i, k) P-SCAN-2-AUX(x, y, k + 1, j) sync parallel for l = k + 1 to j y[l] = y[k] ⊗ y[l]

c. Argue that $\text{P-SCAN-2}$ is correct, and analyze its work, span, and parallelism.

We can improve on both $\text{P-SCAN-1}$ and $\text{P-SCAN-2}$ by performing the $\otimes$-prefix computation in two distinct passes over the data. On the first pass, we gather the terms for various contiguous subarrays of $x$ into a temporary array $t$, and on the second pass we use the terms in $t$ to compute the final result $y$. The following pseudocode implements this strategy, but certain expressions have been omitted:

cpp P-SCAN-3(x) n = x.length let y[1..n] and t[1..n] be new arrays y[1] = x[1] if n > 1 P-SCAN-UP(x, t, 2, n) P-SCAN-DOWN(x[1], x, t, y, 2, n) return y

cpp P-SCAN-UP(x, t, i, j) if i == j return x[i] else k = floor((i + j) / 2) t[k] = spawn P-SCAN-UP(x, t, i, k) right = P-SCAN-UP(x, t, k + 1, j) sync return ____ // fill in the blank

cpp P-SCAN-DOWN(v, x, t, y, i, j) if i == j y[i] = v ⊗ x[i] else k = floor((i + j) / 2) spawn P-SCAN-DOWN(____, x, t, y, i, k) // fill in the blank P-SCAN-DOWN(____, x, t, y, k + 1, j) // fill in the blank sync

d. Fill in the three missing expressions in line 8 of $\text{P-SCAN-UP}$ and lines 5 and 6 of $\text{P-SCAN-DOWN}$. Argue that with expressions you supplied, $\text{P-SCAN-3}$ is correct. ($\textit{Hint:}$ Prove that the value $v$ passed to $\text{P-SCAN-DOWN}(v, x, t, y, i, j)$ satisfies $v = x[1] \otimes x[2] \otimes \cdots \otimes x[i - 1]$.)

e. Analyze the work, span, and parallelism of $\text{P-SCAN-3}$.

(Removed)