Human perception is multidimensional and a balanced combination of hearing, vision, smell, touch, and taste. Recently, many pieces of research have tried to step forward on the road of improving machine perception by transitioning from single-modality learning to multimodality learning.

If you are wondering what modality is, it is the classification of a single independent channel of sensory input/output between a computer and a human (like vision is one modality and audio is another).

In this blog, we will talk about the use of audio and visual information (representing the two most important perceptual modalities in our daily life) to make our machine perception smarter without using any labeled data (self-supervision).

What is the problem we are solving?

When you hear a person’s voice you know, can you recall their face? Or can you recall a person’s voice on seeing their face? This shows how humans can ‘hear faces’ and ‘see voices’ by cultivating a mental picture or an acoustic memory of the person. The problem is, can you teach a machine to do that?

Our world generates a rich source of auditory and visual signals. The visual signals are a result of light reflections, whereas the sounds originate from object motions and vibrations of the surrounding air.

Often correlated at the time of naturally occurring events, these two modalities combine to jointly affect human perception. In response to this perceptual input, humans show a remarkable ability to connect and integrate signals from these two modalities.

As a matter of fact, the interplay among senses is one of the most ancient schemes that explains how the human brain’s sensory organization works to understand the complex interactions of the physical dimension. Inspired by our capability of interpreting sound sources from how objects move visually, we can create learning-models that learn to perform this interpretation on its own.

A Graphical Illustration to showcase the problem of sound source separation and localization-source

While auditory scene analysis is majorly studied in the fields of environmental sound source separation and recognition, the natural synchronization between sound and vision can provide a rich self-supervisory signal for grounding auditory signals into the visual signals, which is all we need for self-supervision to show it’s magic.

In this blog, we will learn how to leverage this cross-modal context as a self-supervisory signal to extract information beyond the limits established by individual modalities.

We will acknowledge the importance of temporal features that are based on significant changes in each modality and design a probabilistic formalism that can identify temporal coincidences between these features to yield visual localization and cross-modal association.

The intuitive solution

The most intuitive solution which will come to our mind is to design a probabilistic formalism that can exploit the inherent coherence of audio-visual signals from large quantities of unlabelled videos to learn sound localization and separation.

This can be done by making a computational model that can learn the relationship between visuals and sounds in an unsupervised way by recognizing objects from the sounds they make, to localize them in images, and to separate the audio component coming from each object.

With such inspiration in mind, many researchers have developed models that can effectively do sound localization and sound recognition. We will also work our way to one such solution that can do sound source separation and its visual localization by distinguishing the components of sound and their association with the corresponding objects.

The solution we will work on is two-fold. First, we will use a simple architecture that will rely on static visual information to learn the cross-modal context. Next, we will take a step further to include the motion cues of the video into our solution. The motion signals are of crucial importance for learning the audio-visual correspondences.

This fact can be more clearly understood by taking a simple case of sound production from two similar-looking objects. Consider a case of two artists playing violin duet. This case constructs an impossible situation for humans to separate the melody from harmony by analyzing the single picture.

However, if we observe the movements of the artists for a while and try to match these motion cues with the musical beats, we can probably conjecture according to this motion-beat observation.

This case illustrates the importance of the temporal repetition of the motion for solving the complex multi-modal reasoning of sound source separation even for humans. Our aim is to computationally mimic this ability to reason the synergy between audio, visual, and motion signals.

Pixel-level sound embedding Visualisation for a model that learned the cross-modal context-source

Computational models of this relationship can be utilized as a fundamental unit for many applications like combining videos with automatically generated ambient sound for better immersion in VR or for enabling equal accessibility by linking sound with visual signals for visually impaired people.

The Approaches

For our initial approach, we will construct a three-component network as suggested in [1] for processing video frames and audio signals separately, followed by their features’ combined processing in an audio synthesizer network.

The three-component (VAN, AAN, ASN) network as suggested in [1]-source

The first component, the Video Analysis Network (VAN) takes the video frames as input and extracts the appearance features. For the feature extraction part, we will use a dilated Resnet-18 model with an input size of TxHxWx3, and an output stride of 16 followed by a temporal max-pooling layer to output a K channel feature map. In the code snippet below, you can find the PyTorch code for a VAN.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Code from Hang Zhao (@hangzhaomit)
class Unet(nn.Module):
def __init__(self, fc_dim=64, num_downs=5, ngf=64, use_dropout=False):
super(Unet, self).__init__()

# construct unet structure
unet_block = UnetBlock(
ngf * 8, ngf * 8, input_nc=None,
submodule=None, innermost=True)
for i in range(num_downs - 5):
unet_block = UnetBlock(
ngf * 8, ngf * 8, input_nc=None,
submodule=unet_block, use_dropout=use_dropout)
unet_block = UnetBlock(
ngf * 4, ngf * 8, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
ngf * 2, ngf * 4, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
ngf, ngf * 2, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
fc_dim, ngf, input_nc=1,
submodule=unet_block, outermost=True)

self.bn0 = nn.BatchNorm2d(1)
self.unet_block = unet_block

def forward(self, x):
x = self.bn0(x)
x = self.unet_block(x)
return x

# Defines the submodule with skip connection.
# X -------------------identity---------------------- X
# |-- downsampling -- |submodule| -- upsampling --|
class UnetBlock(nn.Module):
def __init__(self, outer_nc, inner_input_nc, input_nc=None,
submodule=None, outermost=False, innermost=False,
use_dropout=False, inner_output_nc=None, noskip=False):
super(UnetBlock, self).__init__()
self.outermost = outermost
self.noskip = noskip
use_bias = False
if input_nc is None:
input_nc = outer_nc
if innermost:
inner_output_nc = inner_input_nc
elif inner_output_nc is None:
inner_output_nc = 2 * inner_input_nc

downrelu = nn.LeakyReLU(0.2, True)
downnorm = nn.BatchNorm2d(inner_input_nc)
uprelu = nn.ReLU(True)
upnorm = nn.BatchNorm2d(outer_nc)
upsample = nn.Upsample(
scale_factor=2, mode='bilinear', align_corners=True)

if outermost:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3, padding=1)

down = [downconv]
up = [uprelu, upsample, upconv]
model = down + [submodule] + up
elif innermost:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3,
padding=1, bias=use_bias)

down = [downrelu, downconv]
up = [uprelu, upsample, upconv, upnorm]
model = down + up
else:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3,
padding=1, bias=use_bias)
down = [downrelu, downconv, downnorm]
up = [uprelu, upsample, upconv, upnorm]

if use_dropout:
model = down + [submodule] + up + [nn.Dropout(0.5)]
else:
model = down + [submodule] + up

self.model = nn.Sequential(*model)

def forward(self, x):
if self.outermost or self.noskip:
return self.model(x)
else:
return torch.cat([x, self.model(x)], 1)

The second component, the Audio Analysis Network (AAN) takes the sound mixture as input and applies the Short-Time Fourier Transform (STFT) with a log-frequency scale to obtain a sound spectrogram.

Then the obtained spectrogram is fed to a U-Net that yields K feature maps representing different components of input audio mixture. In the code snippet below, you can find the PyTorch code for an AAN.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Code from Hang Zhao (@hangzhaomit)
class Unet(nn.Module):
def __init__(self, fc_dim=64, num_downs=5, ngf=64, use_dropout=False):
super(Unet, self).__init__()

# construct unet structure
unet_block = UnetBlock(
ngf * 8, ngf * 8, input_nc=None,
submodule=None, innermost=True)
for i in range(num_downs - 5):
unet_block = UnetBlock(
ngf * 8, ngf * 8, input_nc=None,
submodule=unet_block, use_dropout=use_dropout)
unet_block = UnetBlock(
ngf * 4, ngf * 8, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
ngf * 2, ngf * 4, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
ngf, ngf * 2, input_nc=None,
submodule=unet_block)
unet_block = UnetBlock(
fc_dim, ngf, input_nc=1,
submodule=unet_block, outermost=True)

self.bn0 = nn.BatchNorm2d(1)
self.unet_block = unet_block

def forward(self, x):
x = self.bn0(x)
x = self.unet_block(x)
return x

# Defines the submodule with skip connection.
# X -------------------identity---------------------- X
# |-- downsampling -- |submodule| -- upsampling --|
class UnetBlock(nn.Module):
def __init__(self, outer_nc, inner_input_nc, input_nc=None,
submodule=None, outermost=False, innermost=False,
use_dropout=False, inner_output_nc=None, noskip=False):
super(UnetBlock, self).__init__()
self.outermost = outermost
self.noskip = noskip
use_bias = False
if input_nc is None:
input_nc = outer_nc
if innermost:
inner_output_nc = inner_input_nc
elif inner_output_nc is None:
inner_output_nc = 2 * inner_input_nc

downrelu = nn.LeakyReLU(0.2, True)
downnorm = nn.BatchNorm2d(inner_input_nc)
uprelu = nn.ReLU(True)
upnorm = nn.BatchNorm2d(outer_nc)
upsample = nn.Upsample(
scale_factor=2, mode='bilinear', align_corners=True)

if outermost:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3, padding=1)

down = [downconv]
up = [uprelu, upsample, upconv]
model = down + [submodule] + up
elif innermost:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3,
padding=1, bias=use_bias)

down = [downrelu, downconv]
up = [uprelu, upsample, upconv, upnorm]
model = down + up
else:
downconv = nn.Conv2d(
input_nc, inner_input_nc, kernel_size=4,
stride=2, padding=1, bias=use_bias)
upconv = nn.Conv2d(
inner_output_nc, outer_nc, kernel_size=3,
padding=1, bias=use_bias)
down = [downrelu, downconv, downnorm]
up = [uprelu, upsample, upconv, upnorm]

if use_dropout:
model = down + [submodule] + up + [nn.Dropout(0.5)]
else:
model = down + [submodule] + up

self.model = nn.Sequential(*model)

def forward(self, x):
if self.outermost or self.noskip:
return self.model(x)
else:
return torch.cat([x, self.model(x)], 1)

The third component, the Audio Synthesizer Network (ASN) takes the extracted pixel-level appearance features and audio features as input and predicts a vision-based spectrogram binary mask. The number of predicted binary masks depends on the number of sound sources to separate in the input mixture.

These binary masks are then multiplied with the input spectrogram to separate each sound component, followed by a magnitude adjustment of the prediction with the phase of the input to get the final waveform. The final waveform is then processed with an inverse STFT to retrieve the final audio components. In the code snippet below, you can find the PyTorch code for an ASN.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Code from Hang Zhao (@hangzhaomit)
class InnerProd(nn.Module):
def __init__(self, fc_dim):
super(InnerProd, self).__init__()
self.scale = nn.Parameter(torch.ones(fc_dim))
self.bias = nn.Parameter(torch.zeros(1))

def forward(self, feat_img, feat_sound):
sound_size = feat_sound.size()
B, C = sound_size[0], sound_size[1]
feat_img = feat_img.view(B, 1, C)
z = torch.bmm(feat_img * self.scale, feat_sound.view(B, C, -1)) \
.view(B, 1, *sound_size[2:])
z = z + self.bias
return z

def forward_nosum(self, feat_img, feat_sound):
(B, C, H, W) = feat_sound.size()
feat_img = feat_img.view(B, C)
z = (feat_img * self.scale).view(B, C, 1, 1) * feat_sound
z = z + self.bias
return z

# inference purposes
def forward_pixelwise(self, feats_img, feat_sound):
(B, C, HI, WI) = feats_img.size()
(B, C, HS, WS) = feat_sound.size()
feats_img = feats_img.view(B, C, HI*WI)
feats_img = feats_img.transpose(1, 2)
feat_sound = feat_sound.view(B, C, HS * WS)
z = torch.bmm(feats_img * self.scale, feat_sound) \
.view(B, HI, WI, HS, WS)
z = z + self.bias
return z

class Bias(nn.Module):
def __init__(self):
super(Bias, self).__init__()
self.bias = nn.Parameter(torch.zeros(1))
# self.bias = nn.Parameter(-torch.ones(1))

def forward(self, feat_img, feat_sound):
(B, C, H, W) = feat_sound.size()
feat_img = feat_img.view(B, 1, C)
z = torch.bmm(feat_img, feat_sound.view(B, C, H * W)).view(B, 1, H, W)
z = z + self.bias
return z

def forward_nosum(self, feat_img, feat_sound):
(B, C, H, W) = feat_sound.size()
z = feat_img.view(B, C, 1, 1) * feat_sound
z = z + self.bias
return z

# inference purposes
def forward_pixelwise(self, feats_img, feat_sound):
(B, C, HI, WI) = feats_img.size()
(B, C, HS, WS) = feat_sound.size()
feats_img = feats_img.view(B, C, HI*WI)
feats_img = feats_img.transpose(1, 2)
feat_sound = feat_sound.view(B, C, HS * WS)
z = torch.bmm(feats_img, feat_sound) \
.view(B, HI, WI, HS, WS)
z = z + self.bias
return z

Now as I mentioned earlier, this solution might not be enough for separating sound coming from visually similar objects, as the appearance features may get fooled during the synthesizer phase. Therefore we would need another network for analyzing the motion of the sound-producing objects. This additional module is proposed by Zhao et. al [2].

The fourth component, the Motion Analysis Network (MAN) takes the video frames as input and predicts a dense trajectory feature map in three major steps. In the first step, we can use a dense optical flow estimator like PWC-Net (lightweight design and fast speed) to extract the dense optical flow vectors for the input frames.

In the next step, the network will use the extracted dense optical flow to predict the dense trajectories. To understand this in basic terms, let's assume a pixel’s spatial location to be I_t = (x_t,y_t) and the dense optical flow to be ω_t = (u_t, v_t) at a time “t”. Then for time “t+1”, the estimated position will be I_t+1 = (x_t+1, y_t+1) = (x_t, y_t) + ω|(x_t, y_t).

The concatenation of these estimated coordinates (I_t, I_t+1, I_t+2…) is the full trajectory for a pixel. In the third step, the estimated dense trajectories are fed to a CNN model to extract the deep features of these trajectories. The choice of CNN is not fixed and can be arbitrary.

Zhao et. al [] proposes to use an I3D model, which is well known for capturing spatiotemporal features. I3D has a compact design that inflates 2D CNN into 3D to bootstrap 3D filters from pre-trained 2D filters.

A Graphical Illustration of the three steps of the MAN as suggested in [2]-source

The question that still remains unanswered is how to incorporate these trajectory features in our initial model framework. To do so, first, we have to fuse these features with the appearance features that were generated as a part of the first component (VAN).

A simple way to do this fusion is to extract an attention map from the appearance features convoluting them to a single channel and activating them the values with a sigmoid function to get a spatial attention map. This attention map can then be multiplied with the trajectory features to focus only on important trajectories, followed by the concatenation of both appearance and trajectory features.

After this step, either we can use these features directly in place of the old appearance features or we can do an alignment of the visual and sound features in time by applying Feature-wise Linear Modulation (FiLM) on sound features and use fuse them to act as an input to the Audio-U-Net decoder (as suggested by Zhao et. al). In the second case (using FILM) we would no longer need the audio synthesizer network and we can rewrite the U-Net decoder to directly predict the binary masks.

The Self-supervised framework

In this blog section, we will discuss two major training frameworks that are necessary for training a model to learn the cross-modal context in a self-supervised way.

Mix and Separate Framework (MSF)

The Mix and Separate Framework as suggested in [1]-source

The mix and separate training framework is a procedure that artificially creates a complex auditory scene for the model under training. MSF enforces the model to analyze some randomly generated complex auditory scenes and frames a situation for it to separate and ground the mixed sounds.

The generated data is not directly available in the training data, thus MSF creates an auto data augmentation situation. MSF leverages the fact that audio signals are additive and thus we can mix sounds from different video samples to generate a complex auditory signal for the model input.

On the other hand, this framework also creates a self-supervised learning objective for the model. The objective is to separate and restore the sounds back to their original waveform that was intact to each source before the addition by using the visual input associated with the sound mixture.

For the mix and separate framework, we randomly sample N video clips from the training set and in a simple case, mix the sound components of any two of them and serve the model with audio mixture input and their respective frames. It is important to note that although the framed training targets are clear in the training process, the process still is unsupervised as we do not use any data labels and data sampling assumptions.

Curriculum Learning (CL)

By definition, curriculum learning is a type of learning in which the training samples start out with only easy examples of a task and then gradually increase the task difficulty. CL is a kind of smart sampling technique that can replace the random sampling nature of the MSF.

Inspired by the observation that models trained on a single class of instruments suffer from overfitting due to class imbalance, we can use a multi-stage training curriculum that can start by sampling easily separable sound sources. Such kind of curriculum will help in bootstrapping the model with good weight initialization for better convergence on the difficult tasks.

Note: The learning targets (spectrogram masks) can be both binary and ratios. In the case of binary nature masks, we would use a per-pixel sigmoid cross-entropy loss. Otherwise, in the case of ratio nature masks, we would use a per-pixel L1 loss for training. Also due to possible interference, the values of the ground truth mask do not necessarily stay within the range [0, 1].

The Mathematics under the Hood

In deep learning applications, we often tend to rely on the network to learn the mathematical model on its own but if we peek under the hood, we will observe numerous interesting mathematical facts.

In the case of a cross-modal association, we assume that each modality will generate a significant event (onset). If the generated onset coincides in time repeatedly (movement of the guitar strings make a sound), then they are assumed to be co-related.

In mathematical terms, we can say that if the coincidences are more, the likelihood of cross-modal correspondence is also more. On the other hand, if the onset coincidences are low, so is the cross-modal correspondence likelihood.

Sounding object localization. Overlaid heatmaps show the predicted sound volume at each pixel location-source

To understand the process as a likelihood matching algorithm, we must assume that all the onsets of each modality are independent and mutually exclusive. Let us consider the video onset to be a binary with notation V_on and the audio onset binary to be A_on (I am using binary values, just for the sake of explanation). Now if we pre-train our network on an optimization function (likelihood function) of nature,

L = [((A_on)^T ✕ V_on)-(I^T ✕ V_on)]

that increases as the coincidences increase, we can explain the likelihood maximization for the cross-modal association better. Assuming that the onsets are random variables that are statically independent of each other and follow the probability law, we can say that L = ∏(P^(onset_match) ✕ (1-P)^(onset_mismatch)) or for an instance L(i) = P^(onset_match) ✕ (1-P)^(onset_mismatch). Now we can take a log and rewrite it as:

Log(L(i)) = onset_match ✕ log(P) + onset_mismatch ✕ log(1-P)

Finally, we can also state the onset_match when both are V_on and A_on are either {1, 1}, or {0,0}. Thus showing that onset_match = V_on ✕ A_on + (1-V_on)✕(1-A_on). Therefore we can finally state that, when our network optimizes for cross-modal correspondence modeling, it will indirectly be equivalent to the matching likelihood of features from the cross-modal sources.

Note: Due to the limitation of expressing complex mathematical equations in the blog paragraphs, I have simplified the notations to be easily formattable in paragraph format. “^” stands for power, “T” for matrix transpose, and “I” for the identity matrix.

Conclusion

In this blog, we discussed how we can make a system that can learn from unlabeled videos to separate auditory signals and also locate them in the visual input. We started with a simple architecture and showed how the initial system can be enhanced to model the cross-modal context more accurately even when the sound sources are visually similar.

In the end, I would conclude on a note that the desire to understand the world from the human perspective has drawn the attention of the deep learning community on the topic of audio-visual learning, and these type of learning will not only help in solving many existing problems but will also lay the foundations for the future development of self-supervised learning and it’s applications on real-world problems.

My blogs are a reflection of what I worked on and simply convey my understanding of these topics. My interpretation of deep learning can be different from that of yours, but my interpretation can only be as inerrant as I am.