Processing math: 100%

Algorithm Explanation

Checkerboard

Most parallel matrix multiplication functions use a checkerboard distribution of the matrices. This means that the processes are viewed as a grid, and, rather than assigning entire rows or entire columns to each process, we assign small sub-matrices. For example, if we have four processes, we might assign the element of a 4x4 matrix as shown below, checkerboard mapping of a 4x4 matrix to four processes.

Fox Algorithm

Fox‘s algorithm is a one that distributes the matrix using a checkerboard scheme like the above.

In order to simplify the discussion, lets assume that the matrices have order n, and the number of processes, p, equals n2. Then a checkerboard mapping assigns aij, bij , and cij to process (i, j).

Fox‘s algorithm takes n stages for matrices of order n one stage for each term aik bkj in the dot product

Ci,j=ai,0b0,j+ai,1b1,i+...+ai,n1bn1,j

Initial stage, each process multiplies the diagonal entry of A in its process row by its element of B:

Stage 0 on process(i,j): ci,j=ai,ibi,j

Next stage, each process multiplies the element immediately to the right of the diagonal of A by the element of B directly beneath its own element of B:

Stage 1 on process(i,j): ci,j=ai,i+1bi+1,j

In general, during the kth stage, each process multiplies the element k columns to the right of the diagonal of A by the element k rows below its own element of B:

Stage k on process(i,j): ci,j=ai,i+kbi+k,j

Example 2x2 Fox's Algorithm

Consider multiplying 2x2 block matrices A, B, C:

A×B=C[1111]×[1111]=[2222]

Stage 0:

Process (i, i mod 2)Broadcast along row i
(0,0)a00
(1,1)a11

[a00,b00a00,b01a11,b10a11,b11]

Process (i, j) computes:

CStage 0=[1×11×11×11×1]=[1111]

Shift-rotate on the columns of B

Stage 1:

Process (i, (i + 1) mod 2)Broadcast along row i
(0,1)a01
(1,0)a10

a01,b10a01,b11a10,b00a10,b01

Process (i, j) computes:

CStage 1=[1+1×11+1×11+1×11+1×1]=[2222]