Maybe one of the most common operation that a computer does as of today is matrix multiplication. The traditional mathematic way a person will multiply a matrix by hand is to compute an element by scanning the rows of the first matrix and the columns of the seconds.
+- -+ +- -+
| A0 B0 C0 | | X0 Y0 Z0 |
| A1 B1 C1 | | X1 Y1 Z1 |
| A2 B2 C2 | | X2 Y2 Z2 |
+- -+ +- -+
+- -+
| A0*X0 + B0 * X1 + C0 * X2 ... |
| A1*X0 + B1 * X1 + C1 * X2 ... |
| A2*X0 + B2 * X1 + C2 * X2 ... |
+- -+
We can compute the N by N resultant matrix using 3 for loops. The outer loop iterates over N-rows, the middle loop iterates over N-columns, and the inner loop sums N-products to produce the resultant matrix.
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
M[i][j] += A[i][k] * B[k][j];
}
}
}
Assuming that every element of the resultant matrix is unique and therefore independently computed, the time complexity of the algorithm is O(N^3).
Lets also assume the matrix is stored contiguously in memory and in row-major order.
+-----+ +-----+
| A0 | | X0 |
+-----+ +-----+
| B0 | | Y0 |
+-----+ +-----+
| C0 | | Z0 |
+-----+ +-----+
| A1 | | X1 |
+-----+ +-----+
| B1 | | Y1 |
+-----+ +-----+
| C1 | | Z1 |
+-----+ +-----+
| A2 | | X2 |
+-----+ +-----+
| B2 | | Y2 |
+-----+ +-----+
| C2 | | Z2 |
+-----+ +-----+
Lets assume that little computer ( ͡° ͜ʖ ͡°) that we are working with has a single level of cache where the cache can only hold 3 cache lines. Lets additionally assume that a cache line for us is 3 words of memory.
Assuming both matricies start at the beginning of a cache line, we can visualize them in the following way.
+----+----+----+ +----+----+----+
| A0 | B0 | C0 | | X0 | Y0 | Z0 |
+----+----+----+ +----+----+----+
+----+----+----+ +----+----+----+
| A1 | B1 | C1 | | X1 | Y1 | Z1 |
+----+----+----+ +----+----+----+
+----+----+----+ +----+----+----+
| A2 | B2 | C2 | | X2 | Y2 | Z2 |
+----+----+----+ +----+----+----+
We will represent the resultant matrix in a similar manner
+-----+-----+-----+
| M00 | M01 | M02 |
+-----+-----+-----+
+-----+-----+-----+
| M10 | M11 | M12 |
+-----+-----+-----+
+-----+-----+-----+
| M20 | M21 | M22 |
+-----+-----+-----+
Notice that during the execution of the matrix multiplication based on the code above, since ABC is being scanned in a row-major order, a single cache line is being used to compute a entries of the resultant matrix. However, at the same time, for every element ABC in the cache line, it must be multiplied with an element from XYZ in 3 different cache lines. Since our little computer's cache can only hold 3 cache lines at a time, every multiply will incur a cache miss.
Cache Content: Empty
Operation: M[0,0] += A0 * X0
Cache Content: +----+----+----+
| A0 | B0 | C0 | Miss
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Miss
+-----+-----+-----+
+----+----+----+
| X0 | Y0 | Z0 | Miss
+----+----+----+
Operation: M[0,0] += B0 * X1
Cache Content: +----+----+----+
| A0 | B0 | C0 | Hit
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Hit
+-----+-----+-----+
+----+----+----+
| X1 | Y1 | Z1 | Miss
+----+----+----+
Operation: M[0,0] += C0 * X2
Cache Content: +----+----+----+
| A0 | B0 | C0 | Hit
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Hit
+-----+-----+-----+
+----+----+----+
| X2 | Y2 | Z2 | Miss
+----+----+----+
For computing a single entry in the resultant matrix, we had to access memory 5 times (3 + 1 + 1), which is a total of 4 cache misses.
Lets assume that a cache hit costs 1 time unit or tu, and a cache miss that results in a memory acees costs 200 tu. (This is based on the "Latency Numbers Every Programmer Should Know")
The total time cost of our current algorithm will is 600 + 201 + 201 or 1002 tu for a single entry (or 3 iterations) in the resultant matrix.
To make the matrix multiplication algorithm more efficient, we change the order of the loops.
for (int k = 0; k < N; k++) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
M[i][j] += A[i][k] * B[k][j];
}
}
}
Notice that the layout of the matrix is still the same, the only difference is that the inner loop is now moved
Cache Content: Empty
Operation: M[0,0] += A0 * X0
Cache Content: +----+----+----+
| A0 | B0 | C0 | Miss
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Miss
+-----+-----+-----+
+----+----+----+
| X0 | Y0 | Z0 | Miss
+----+----+----+
Operation: M[0,1] += A0 * Y0
Cache Content: +----+----+----+
| A0 | B0 | C0 | Hit
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Hit
+-----+-----+-----+
+----+----+----+
| X0 | Y0 | Z0 | Hit
+----+----+----+
Operation: M[0,2] += A0 * Z0
Cache Content: +----+----+----+
| A0 | B0 | C0 | Hit
+----+----+----+
+-----+-----+-----+
| M00 | M01 | M02 | Hit
+-----+-----+-----+
+----+----+----+
| X0 | Y0 | Z0 | Hit
+----+----+----+
After 3 iterations, we have computed 1/3 of 3 entries. This is equivalent to that amount of work required to compute a single entry. Notice that unlike the first algorithm, there are only 3 misses and the rest are cache hits. This will result in a runtime of (600 + 3 + 3) or 606 tu.
Let do a runtime analysis of the entire matrix muliplication.
Original Algorithm:
cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 3 miss, 0 hits cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 1 miss, 2 hits
cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 3 miss, 0 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 1 miss, 2 hits
cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 3 miss, 0 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 1 miss, 2 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 1 miss, 2 hits
Total: 33 miss, 48 hits Runtime: 33(200) + 48(1) = 6648 tu.
Improved Algorithm:
cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 3 miss, 0 hits cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 0 miss, 3 hits cache: {[A0, B0, C0], [M00, M01, M02], [X0, Y0, Z0]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 2 miss, 1 hits cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X0, Y0, Z0]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 2 miss, 1 hits cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X0, Y0, Z0]}, 0 miss, 3 hits
cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 3 miss, 0 hits cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 0 miss, 3 hits cache: {[A0, B0, C0], [M00, M01, M02], [X1, Y1, Z1]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 2 miss, 1 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X1, Y1, Z1]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 2 miss, 1 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X1, Y1, Z1]}, 0 miss, 3 hits
cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 3 miss, 0 hits cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 0 miss, 3 hits cache: {[A0, B0, C0], [M00, M01, M02], [X2, Y2, Z2]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 2 miss, 1 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 0 miss, 3 hits cache: {[A1, B1, C1], [M10, M11, M12], [X2, Y2, Z2]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 2 miss, 1 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 0 miss, 3 hits cache: {[A2, B2, C2], [M20, M21, M22], [X2, Y2, Z2]}, 0 miss, 3 hits
Total: 21 miss, 60 hits Runtime: 21(200) + 60(1) = 4260 tu.
The runtime improved from 6648 to 4260, or a decrease of 35.92%.