Understanding Cache-Friendly Code

Online shopping website with product pictures and price in a browser window, collage and paper cut composition

Caching is a general term for similar techniques utilized in many parts of a computer system. This happens when two data storages with different read/write speeds are involved. For example, the CPU has several internal registers that perform read and write operations quickly.

In addition, the CPU must fetch data from the main memory, which is many times slower than its registers. To manage this simultaneously, a caching mechanism is needed here; otherwise, the lower speed of the main memory becomes dominant, and it hides the high computational speed of the CPU.

This article is an excerpt from the book Extreme C by Kamran Amini. C still plays a critical role in 21st century programming, remaining the core language for precision engineering, aviation, space research, and more. Kamran teaches you to boost your existing C knowledge. You will gain new insight into algorithm design, functions, and structures.

Cache

Working with database files is another example. Database files are usually stored on an external hard disk, which is far slower than the main memory, by orders of magnitude. Definitely, a caching mechanism is required here; otherwise, the slowest speed becomes dominant, and it determines the speed of the whole system.

Caching and the details around it deserve to have a whole dedicated article since there are abstract models and specific terminology that should be explained. Using these models, one can predict how well a cache would behave and how much performance gain could be expected after introducing the cache. Here, we try to explain caching in a simple and intuitive manner.

Suppose that you have slow storage that can contain many items. You also have another fast storage, but it can contain a limited number of items. This is an obvious tradeoff. We can call the faster but smaller storage a cache. It would be reasonable if you bring items from the slow storage into the fast one and process them on the fast storage, simply because it is faster.

From time to time, you have to go to slow storage in order to bring over more items. It is obvious that you won’t bring only one item over from the slow storage, as this would be very inefficient. Rather, you will bring a bucket of items into the faster storage. Usually, it is said that the items are cached into the faster storage.

Suppose that you are processing an item that requires you to load some other item from the slow storage. The first thing that comes to mind is to search for the required item inside the recently brought bucket, which is in the cache storage at the moment.

If you could find the item in the cache, there is no need to retrieve it from the slow storage, and this is called a hit. If the item is missing from the cache storage, you have to go to the slow storage and read another bucket of items into the cache memory. This is called a miss. It is clear that the more hits you observe, the more performance you get.

The preceding description can be applied to the CPU cache and the main memory. The CPU cache stores recent instructions and data read from the main memory, and the main memory is slow compared to the CPU cache memory.

Cache-friendly code

When the CPU is executing an instruction, it has to fetch all required data first. The data is located in the main memory at a specific address that is determined by the instruction. The data has to be transferred to the CPU registers before any computation. But the CPU usually brings more blocks than are expected to be fetched and puts them inside its cache.

Next time, if a value is needed in the proximity of the previous address, it should exist in the cache, and the CPU can use the cache instead of the main memory, which is far faster than reading it from the main memory. As we explained in the previous section, this is a cache hit. If the address is not found in the CPU cache, it is a cache miss, and the CPU has to access the main memory to read the target address and bring required data which is quite slow. In general, higher hit rates result in faster executions.

But why does the CPU fetch the neighboring addresses (the proximity) around an address? It is because of the principle of locality. In computer systems, it is usually observed that the data located in the same neighborhood is more frequently accessed. So, the CPU behaves according to this principle and brings more data from a locality reference. If an algorithm can exploit this behavior, it can be executed faster by the CPU. This is why we refer to such algorithm as a cache-friendly algorithm.

Example 1.1 demonstrates the difference between the performances of a cache-friendly code and a non-cache-friendly code:

[sourcecode language=”plain”]#include <stdio.h> // For printf function
#include <stdlib.h> // For heap memory functions
#include <string.h> // For strcmp function
void fill(int* matrix, int rows, int columns) {
int counter = 1;
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
*(matrix + i * columns + j) = counter;
}
counter++;
}
}
void print_matrix(int* matrix, int rows, int columns) {
int counter = 1;
printf(“Matrix:\n”);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
printf(“%d “, *(matrix + i * columns + j));
}
printf(“\n”);
}
}
void print_flat(int* matrix, int rows, int columns) {
printf(“Flat matrix: “);
for (int i = 0; i < (rows * columns); i++) {
printf(“%d “, *(matrix + i));
}
printf(“\n”);
}
int friendly_sum(int* matrix, int rows, int columns) {
int sum = 0;
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
sum += *(matrix + i * columns + j);
}
}
return sum;
}
int not_friendly_sum(int* matrix, int rows, int columns) {
int sum = 0;
for (int j = 0; j < columns; j++) {
for (int i = 0; i < rows; i++) {
sum += *(matrix + i * columns + j);
}
}
return sum;
}
int main(int argc, char** argv) {
if (argc < 4) {
printf(“Usage: %s [print|friendly-sum|not-friendly-sum] “)
;
printf(“[number-of-rows] [number-of-columns]\n”, argv[0]);
exit(1);
}
char* operation = argv[1];
int rows = atol(argv[2]);
int columns = atol(argv[3]);
int* matrix = (int*)malloc(rows * columns * sizeof(int));
fill(matrix, rows, columns);
if (strcmp(operation, “print”) == 0) {
print_matrix(matrix, rows, columns);
print_flat(matrix, rows, columns);
}
else if (strcmp(operation, “friendly-sum”) == 0) {
int sum = friendly_sum(matrix, rows, columns);
printf(“Friendly sum: %d\n”, sum);
}
else if (strcmp(operation, “not-friendly-sum”) == 0) {
int sum = not_friendly_sum(matrix, rows, columns);
printf(“Not friendly sum: %d\n”, sum);
}
else {
printf(“FATAL: Not supported operation!\n”);
exit(1);
}
free(matrix);
return 0;
}
[/sourcecode]

Code Box 1.1: Demonstrating the performances of a cache-friendly code and a non-cache-friendly code

The preceding program computes and prints the sum of all elements in a matrix, but it also does more than that. The user can pass options to this program, which alters its behavior. Suppose that we want to print a 2 by 3 matrix that is initialized by an algorithm written in the fill function. The user has to pass the print option with the desired number of rows and columns. Next, you can see how the options are passed to the final executable binary:

[sourcecode language=”plain”]$ gcc example1_1.c -o ex1_1.out
$ ./ex1_1.out print 2 3
Matrix:
1 1 1
2 2 2
Flat matrix: 1 1 1 2 2 2
$[/sourcecode]

Shell Box 1-1: Output of example 1.1 showing a 2 by 3 matrix

The output consists of two different prints for the matrix. The first is the 2D representation of the matrix and the second is the flat representation of the same matrix. As you can see, the matrix is stored as a row-major order in memory. This means that we store it row by row. So, if something from a row is fetched by the CPU, it is probable that all of the elements in that row are fetched too. Hence, it would be better to do our summation in row-major order and not column-major order.

If you look at the code again, you can see that the summation done in the friendly_sum function is row-major, and the summation performed in the not_friendly_sum function is column-major. Next, we can compare the time it takes to perform the summation of a matrix with 20,000 rows and 20,000 columns. As you can see, the difference is very clear:

[sourcecode language=”plain”]$ time ./ex1_1.out friendly-sum 20000 20000
Friendly sum: 1585447424

real 0m5.192s
user 0m3.142s
sys 0m1.765s

$ time ./ex1_1.out not-friendly-sum 20000 20000
Not friendly sum: 1585447424

real 0m15.372s
user 0m14.031s
sys 0m0.791s
$[/sourcecode]

Shell Box 1-2: Demonstration of the time difference between the column-major and row-major matrix summation algorithms

The difference between the measured times is about 10 seconds! The program is compiled on a macOS machine using the clang compiler. The difference means that the same logic, using the same amount of memory, can take much longer – just by selecting a different order of accessing the matrix elements! This example clearly shows the effect of cache-friendly code. The time utility is available in all Unix-like operating systems. It can be used to measure the time a program takes to finish.

Summary

In this article, we discussed what caching is and what techniques can be used to gain some performance by writing cache-friendly code. Extreme C will act as a high-intensity guide to the most advanced capabilities of C for programmers to push their limits.

About the Author

Kamran Amini is an expert software architect with more than 10 years of experience in the analysis, design, development, and building large-scale, distributed enterprise software systems. His skills are not limited to a specific development platform and Kamran’s architectural solutions include a variety of technologies, patterns, and concepts based on C and C++, Java, Python, etc. His passion towards C and C++ has started since his teenage as a lead for his high school’s soccer simulation team and he’s just put it to be his main axis in the career. Recently, blockchain and cryptocurrencies have been the target of his research and interest and because of his deep knowledge about classic cryptography and PKI, working on the expansion of the future possible usages and alternative blockchains are among his interests.

STAY UP TO DATE

Sign up today to stay informed with industry news & trends