Algorithm Explanation
Checkerboard
Most parallel matrix multiplication functions use a checkerboard distribution of the matrices. This means that the processes are viewed as a grid, and, rather than assigning entire rows or entire columns to each process, we assign small sub-matrices. For example, if we have four processes, we might assign the element of a 4x4 matrix as shown below, checkerboard mapping of a 4x4 matrix to four processes.
Fox Algorithm
Fox‘s algorithm is a one that distributes the matrix using a checkerboard scheme like the above.
In order to simplify the discussion, lets assume that the matrices have order \(n\), and the number of processes, \(p\), equals \(n^2\). Then a checkerboard mapping assigns \(a_{ij}\), \(b_{ij}\) , and \(c_{ij}\) to process (\(i\), \(j\)).
Fox‘s algorithm takes \(n\) stages for matrices of order n one stage for each term aik bkj in the dot product
\(C_{i,j} = a_{i,0}b_{0,j} + a_{i,1}b_{1,i} +. . . +a_{i,n−1}b_{n−1,j}\)
Initial stage, each process multiplies the diagonal entry of A in its process row by its element of B:
Stage \(0\) on process\((i, j)\): \(c_{i,j} = a_{i,i} b_{i,j}\)
Next stage, each process multiplies the element immediately to the right of the diagonal of A by the element of B directly beneath its own element of B:
Stage \(1\) on process\((i, j)\): \(c_{i,j} = a_{i,i+1} b_{i+1,j}\)
In general, during the kth stage, each process multiplies the element k columns to the right of the diagonal of A by the element \(k\) rows below its own element of B:
Stage \(k\) on process\((i, j)\): \(c_{i,j} = a_{i,i+k} b_{i+k,j}\)
Example 2x2 Fox's Algorithm
Consider multiplying 2x2 block matrices \(A\), \(B\), \(C\):
\[ \begin{aligned} A \times B &= C \\ \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \times \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} &= \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix} \end{aligned} \]
Stage 0:
Process (i, i mod 2) | Broadcast along row i |
---|---|
(0,0) | \(a_{00}\) |
(1,1) | \(a_{11}\) |
\[ \begin{bmatrix} a_{00}, b_{00} & a_{00}, b_{01}\\ a_{11}, b_{10} & a_{11}, b_{11}\\ \end{bmatrix} \]
Process (\(i\), \(j\)) computes:
\[ \begin{aligned} C_{\text{Stage 0}} = \begin{bmatrix} 1 \times 1 & 1 \times 1 \\ 1 \times 1 & 1 \times 1 \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \end{aligned} \]
Shift-rotate on the columns of \(B\)
Stage 1:
Process (\(i\), (\(i\) + 1) mod 2) | Broadcast along row \(i\) |
---|---|
(0,1) | \(a_{01}\) |
(1,0) | \(a_{10}\) |
\[ \begin{matrix} a_{01}, b_{10} & a_{01}, b_{11}\\ a_{10}, b_{00} & a_{10}, b_{01}\\ \end{matrix} \]
Process (\(i\), \(j\)) computes:
\[ \begin{aligned} C_{\text{Stage 1}} = \begin{bmatrix} 1+1 \times 1 & 1+1 \times 1 \\ 1+1 \times 1 & 1+1 \times 1 \end{bmatrix} &= \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix} \end{aligned} \]