Algorithm Explanation
Checkerboard
Most parallel matrix multiplication functions use a checkerboard distribution of the matrices. This means that the processes are viewed as a grid, and, rather than assigning entire rows or entire columns to each process, we assign small sub-matrices. For example, if we have four processes, we might assign the element of a 4x4 matrix as shown below, checkerboard mapping of a 4x4 matrix to four processes.
Fox Algorithm
Fox‘s algorithm is a one that distributes the matrix using a checkerboard scheme like the above.
In order to simplify the discussion, lets assume that the matrices have order n, and the number of processes, p, equals n2. Then a checkerboard mapping assigns aij, bij , and cij to process (i, j).
Fox‘s algorithm takes n stages for matrices of order n one stage for each term aik bkj in the dot product
Ci,j=ai,0b0,j+ai,1b1,i+...+ai,n−1bn−1,j
Initial stage, each process multiplies the diagonal entry of A in its process row by its element of B:
Stage 0 on process(i,j): ci,j=ai,ibi,j
Next stage, each process multiplies the element immediately to the right of the diagonal of A by the element of B directly beneath its own element of B:
Stage 1 on process(i,j): ci,j=ai,i+1bi+1,j
In general, during the kth stage, each process multiplies the element k columns to the right of the diagonal of A by the element k rows below its own element of B:
Stage k on process(i,j): ci,j=ai,i+kbi+k,j
Example 2x2 Fox's Algorithm
Consider multiplying 2x2 block matrices A, B, C:
A×B=C[1111]×[1111]=[2222]
Stage 0:
Process (i, i mod 2) | Broadcast along row i |
---|---|
(0,0) | a00 |
(1,1) | a11 |
[a00,b00a00,b01a11,b10a11,b11]
Process (i, j) computes:
CStage 0=[1×11×11×11×1]=[1111]
Shift-rotate on the columns of B
Stage 1:
Process (i, (i + 1) mod 2) | Broadcast along row i |
---|---|
(0,1) | a01 |
(1,0) | a10 |
a01,b10a01,b11a10,b00a10,b01
Process (i, j) computes:
CStage 1=[1+1×11+1×11+1×11+1×1]=[2222]