Convolutions have become a fundamental part of modern neural networks because of their ability to capture local information and reduce the number of parameters with weight sharing.

Since almost all vision-based models (and a few NLP-models) use convolutions of one form or the other, it’s obvious that we would like to make these operations as fast as possible.

To emphasis the need for fast convolutions, here’s a profiler output of a simple network with a single 2D convolution layer followed by a Fully Connected layer:

Profiler Output for a Simple Conv Network

The convolutional layer followed by the linear layer (addmm) are responsible for ~ 90% of the total execution time. As a consequence, it’s no surprise that several tricks have been developed to speed up this computation as much as possible.

In this blog, we’ll look at 2 tricks that PyTorch and TensorFlow use to make convolutions significantly faster. We’ll use 2D convolutions since that’s the easiest to visualize, but the exact same concept applies to 1D and 3D convolutions

Naive Convolution Implementation

Let’s start with a naive implementation for 2D convolution. We’ll use a simple 2x2 kernel with a 3x3 input matrix (with 1 channel):

input_matrix
array([[3., 9., 0.],
[2., 8., 1.],
[1., 4., 8.]], dtype=float32)kernel
array([[8., 9.],
[4., 4.]], dtype=float32)bias
array([0.06], dtype=float32)

def conv_2d(x, kernel, bias):
kernel_shape = kernel.shape[0]

# Assuming Padding = 0, stride = 1
output_shape = x.shape[0] - kernel_shape + 1
result = np.zeros((output_shape, output_shape))

for row in range(x.shape[0] - 1):
for col in range(x.shape[1] - 1):
window = x[row: row + kernel_shape, col: col + kernel_shape]
result[row, col] = np.sum(np.multiply(kernel,window))
return result + bias

Naive 2D Convolution

The naive implementation is quite simple to understand, we simply traverse the input matrix and pull out “windows” that are equal to the shape of the kernel. For each window, we do simple element-wise multiplication with the kernel and sum up all the values. Finally, before returning the result we add the bias term to each element of the output.

We can quickly verify that we’re getting the correct result by checking the output with PyTorch’s own conv2d layer.

naive_conv_op = conv_2d(input_matrix, kernel, bias)
print(naive_conv_op)torch_conv = nn.Conv2d(1, 1, 2)
torch_conv_op = torch_conv(input_matrix)
print(torch_conv_op)Output:
naive_conv_op
array([[145.06, 108.06],
[108.06, 121.06]])torch_conv_op
tensor([[[[145.07, 108.07],
[108.07, 121.07]]]])

And here’s the execution time for both of them:

%%timeit
conv_2d(input_matrix, kernel, bias)%%timeit
torch_conv(input_matrix)Output:
Naive Conv:
26.9 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)Torch Conv:
59.5 µs ± 935 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now let’s check how the execution time changes when the kernel size is kept the same and the size of the input matrix is slowly varied.

Naive Convolution vs PyTorch Convolution

The 2 for-loops in our implementation are responsible for O(n²) execution time and as the input size increases beyond 250 x 250, Naive Conv takes 1–3 secs per matrix. If we had a huge network like Inception Net with hundreds of convolutions and thousands of large input matrices, naive convolution would be an absolutely terrible idea.

However, notice that PyTorch’s own implementation scales very well with the input matrix size. Clearly, PyTorch does convolutions differently.

Trick #1 : im2col

While multiplying each window with the kernel we did 2 operations:

Multiplied the terms
Added them all together.

….and we did this for each window in the input matrix.

Now the important question to ask here is: Can we vectorize this entire operation?

The answer is Yes and that’s exactly what im2col helps us do (which stands for Image Block to Column)

Simply put, im2col is a technique where we take each window, flatten it out and stack them as columns in a matrix. Now, if we flatten out the kernel into a row vector and do matrix multiplication between the two, we should get the exact same result after reshaping the output.

Let’s try it out:

def im2col(x, kernel):
kernel_shape = kernel.shape[0]
rows = []

# Assuming Padding = 0, stride = 1
for row in range(x.shape[0] - 1):
for col in range(x.shape[1] - 1):
window = x[row: row + kernel_shape, col: col + kernel_shape]
rows.append(window.flatten())

return np.transpose(np.array(rows))

Naive Implementation of Im2Col

im2col(input_matrix, kernel)Output:
array([[3, 9, 2, 8],
[9, 0, 8, 1],
[2, 8, 1, 4],
[8, 1, 4, 8]])

And now we flatten the kernel and do matrix multiplication:

output_shape = (input_matrix.shape[0] - kernel.shape[0]) + 1im2col_matrix = im2col(input_matrix, kernel)
im2col_conv = np.dot(kernel.flatten(), im2col_matrix) + bias
im2col_conv = im2col_conv.reshape(output_shape,output_shape)
print(im2col_conv)torch_conv = nn.Conv2d(1, 1, 2)
torch_conv_op = torch_conv(input_matrix)
print(torch_conv_op)Output:
im2col_conv
array([[145.06, 108.06],
[108.06, 121.06]])torch_conv_op
tensor([[[[145.07, 108.07],
[108.07, 121.07]]]])

Now let’s check how it scales:

Vectorizing definitely helped but there’s still room for improvement. Before we move on to the next trick, let’s see why vectorizing actually helps.

Why does this work?

All modern CPUs and GPUs come with optimized matrix algebra libraries that allow code to take advantage of hardware acceleration. These libraries fall under the umbrella term of BLAS or Basic Linear Algebra Subroutines. When we vectorize code and call np.dot() it allows numpy to use the BLAS Library allowing for faster execution.

In fact, in the earlier profiler output, you might have seen this:

MKLDNN stands for Math Kernel Library for Deep Neural Networks which is Intel’s BLAS library. Since I ran the PyTorch model on my Intel i7, PyTorch automatically called Intel’s BLAS library. If you ran this on an Nvidia GPU, PyTorch would have used cuBLAS (Nvidia’s BLAS library).

The next trick involves getting rid of the 2- for loops and creating the im2col matrix efficiently.

Trick #2: Memory Strides

While creating the windows in im2col we still used 2 for loops to index the input matrix, which slows down execution. To understand how to improve this we need to take a look at how numpy arrays are stored in memory.

Just like all other arrays, numpy arrays are stored as contiguous blocks in memory. Each numpy array also has a .strides attribute that tells us how many bytes need to be jumped to access the next element.

For eg:

x = np.arange(10, dtype = 'int64')
print(x)
print(x.strides)Output:
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])x.strides
(8,)

Each element is int64 i.e. 64bits or 8bytes and this is why x.strides tells us we need to jumpy 8bytes to access the next element in the array.

When dealing with 2D arrays we get two stride values telling us how many bytes to jump in the column direction and the row direction.

x = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(x)
print(x.strides)Output:
x
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])x.strides
(24,8)#Jump 24bytes to access next row, 8bytes to access next column

Now here’s the interesting part, numpy gives us the ability to change the strides of any numpy array by using a function callednp.lib.stride_tricks.as_strided. Based on what stride values we provide, this function simply changes the way we look at the array in memory and generates a new “view”.

Here’s an example:

x = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(x)x_newview = np.lib.stride_tricks.as_strided(x, shape = (5, 4), strides = (8,8))
print(x_newview)Output:X
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])X_newview
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])

Instead of jumping 24bytes (3 elements) to start the next row we used as_strided to jump only 8 bytes (1 element) when starting the next row. Using the shape parameter we can also set the output shape as needed.

Note: As mentioned earlier, as_strided changes the way we look at the array in the memory. This means if we change a value in the “view” it changes the value in memory which changes the element in the original matrix.

X_newview[1,3] = -99
print(X_newview)
print(X)Output:
X_newview
array([[  1,   2,   3,   4],
       [  2,   3,   4, -99],
       [  3,   4, -99,   6],
       [  4, -99,   6,   7],
       [-99,   6,   7,   8]])X
array([[  1,   2,   3],
       [  4, -99,   6],
       [  7,   8,   9]])

Since as_strided does not use any loops to create these “views” we can use it to efficiently generate the windows for convolution. All we need to do is calculate the right stride values and output shape and as_strided does the rest for us.

However, if we provide wrong stride values,as_strided will access memory locations that are outside the array and return junk values. Luckily, the view_as_windows function in the scikit-images library does all the heavy lifting for us by calculating the shape and stride values automatically while using as_strided in the background:

from skimage.util.shape import view_as_windowsinput_matrix = np.array([[3,9,0], [2, 8, 1], [1,4,8]])
print(input_matrix)kernel = np.array([[8,9], [4,4]])
print(kernel)windows = view_as_windows(x, kernel.shape)
print(windows)Output:
input_matrix
array([[3, 9, 0],
[2, 8, 1],
[1, 4, 8]])kernel
array([[8, 9],
[4, 4]])windows
array([[[[3, 9],
[2, 8]], [[9, 0],
[8, 1]]],
[[[2, 8],
[1, 4]], [[8, 1],
[4, 8]]]])

And now we just reshape:

output_shape = (input_matrix.shape[0] - kernel.shape[0]) + 1
windows = windows.reshape(output_shape**2, kernel.shape[0]*2)
print(windows)Output:
windows
array([[3, 9, 2, 8],
       [9, 0, 8, 1],
       [2, 8, 1, 4],
       [8, 1, 4, 8]])

Here’s the final function that does all of these:

Im2Col with Memory Strides

Now we can do matrix multiplication, in the same way we did previously:

output_shape = (input_matrix.shape[0] - kernel.shape[0]) + 1
mem_strided_mat = memory_strided_im2col(input_matrix, kernel)
mem_strided_conv = np.dot(kernel.flatten(), mem_strided_mat) + biasmem_strided_conv = mem_strided_conv.reshape(output_shape, output_shape)
print(mem_strided_conv)torch_conv = nn.Conv2d(1, 1, 2)
torch_conv_op = torch_conv(input_matrix)
print(torch_conv_op)Output:
mem_strided_conv
array([[145.06, 108.06],
[108.06, 121.06]])torch_conv_op
tensor([[[[145.07, 108.07],
[108.07, 121.07]]]])

Let’s check how it compares against all the other implementations so far:

📷📷Plot for Mem Strided Im2Col

Using as_strided has significantly increased the speed of our implementation! In fact, it’s almost as fast as PyTorch.

Also, if you noticed in the profiler output, PyTorch uses its own as_strided function before convolution:

Memory Trade-off

Since we need to create columns for each window of the input matrix, the im2col matrix ends up consuming more memory than the naive implementation.

def memory_strided_im2col(x, kernel):
output_shape = (x.shape[0] - kernel.shape[0]) + 1
return view_as_windows(x, kernel.shape).reshape(output_shape*output_shape,
kernel.shape[0]*2)

Memory Consumption of Strided Im2Col

However, the improvements in speed(shown in table below) far outweigh the difficulties with increased memory consumption.

Summary

Here’s a summary of the execution times for all implementations. The kernel size (2 x 2) was held constant while the input size was changed. I ran all of these on my Intel i7 processor.

It’s incredible that we were able to get almost 150x improvement over Naive convolution with just 2 simple tricks. The PyTorch implementation is still 2x faster than our Memory Strided im2col implementation. This is most likely because PyTorch has its own tensor implementation that might be optimized for bigger matrices. In fact, our implementation is faster than PyTorch for matrix sizes below 50 x 50.

Although we used only PyTorch here, TensorFlow also performs the exact same set of operations while performing convolutions (docs).

And finally, here’s how our implementation would change when padding, strides or 1D/3D convolutions are used:

Padding: If we added padding it would make no difference to our implementation, as padding is generally applied before convolution. However, the output shape would have to be correctly calculated.

Strides: Here we assumed a stride of 1. A larger stride would just slide the window with bigger jumps, which means the strides in as_strided would have to be re-calculated. However, the concept remains the same. ( In fact, view_as_windows has a step parameter that takes care of strides as well. )

More Filters: In our examples, we assumed a single filter for the kernel. If we had more filters, each filter would be flattened out to give us a matrix instead of vector. Next, we would multiply this matrix with the im2col matrix. This means we would multiply a matrix by a matrix instead of vector by matrix to get the output.

1D or 3D Convolution: The columns in the im2col matrix would just be shorter or taller since the size of the window changes (depending on the kernel as well).

I hope you enjoyed and found this useful! Feel free to connect with me for any questions or comments.