How Neural Networks hallucinate missing pixels for image inpainting
When a human sees an object, certain neurons in our brain’s visual cortex light up with activity, but when we take hallucinogenic drugs, these drugs overwhelm our serotonin receptors and lead to the distorted visual perception of colours and shapes.
Similarly, deep neural networks that are modelled on structures in our brain, stores data in huge tables of numeric coefficients, which defy direct human comprehension. But when these neural network’s activation is overstimulated (virtual drugs), we get phenomenons like neural dreams and neural hallucinations.
Dreams are the mental conjectures that are produced by our brain when the perceptual apparatus shuts down, whereas hallucinations are produced when this perceptual apparatus becomes hyperactive. In this blog, we will discuss how this phenomenon of hallucination in neural networks can be utilized to perform the task of image inpainting.
Image and Video Inpainting
Image inpainting is the art of synthesizing alternative contents for the reconstruction of missing or deteriorated parts of an image such that the modification is semantically correct and visually realistic. Image inpainting has received significant attention from the computer vision and image processing community throughout the past years and led to key advances in the research and application field.
Traditionally, inpainting is achieved either using examplar-based approaches that reconstruct one missing pixel/patch of a plausible hypothesis at a time, while maintaining the neighbourhood consistency, or diffusion-based approaches that propagate local semantic structures into the missing parts.
However, irrespective of the employed method, the core challenge of image inpainting is to maintain a global semantic structure and generate realistic texture details for the unknown regions. The traditional approaches fail to achieve this global semantic structure and realistic textures when the size of the missing regions is large or high irregular.
Therefore a component that can provide a plausible hallucination for the missing pixels is needed to tackle such inpainting problems. To design these hallucinative components, researchers generally choose deep neural networks to provide high-order models of natural images.
There are a plethora of use cases that use image inpainting to retouch undesired regions, remove distracting objects or to complete occluded regions in images. It can also be extensively applied to tasks including video un-cropping, re-targeting, re-composition, rotation, and stitching.
Similar to image inpainting, video inpainting aims to fill in a given space-time region with newly synthesized content. It reconstructs the missing regions of a given video sequence with pixels that are both temporally and spatially coherent.
Traditional video inpainting algorithms formulate the problem as a patch-based optimization task and follow the traditional image inpainting pipelines to fill the missing regions through sampling spatial-temporal patches of the known regions and then solve it as a minimization problem. Although most of the video inpainting algorithms face an obvious challenge due to the complex motion of objects and cameras.
These challenges are mostly due to the assumption of a smooth and homogeneous optical motion field in the unknown region. Similar to image inpainting, association with a plausible motion field hallucination of the missing regions helps to tackle these challenges and generate seamless content for the video sequence, making the alteration almost imperceptible.
Video inpainting is mostly used for video restoration (removing scratches), editing special effects workflows (removing unwanted objects, watermark and logo), and video stabilization.
Hallucinating without any Prior Learning
Convolutional Neural Network’s excellent performance is generally imputed to their ability to learn realistic image priors from a huge amount of data.
In case you are wondering what “image prior” means, it is the “prior information” on our image dataset, that is used to ease the choice of processing parameters and resolve indeterminacies in image processing, like the vector representations which a CNN learns after training. But on the contrary, researches like  show that even the structure of a generative CNN is capable of capturing a lot of low-level image statistics prior to any data-intensive learning.
The main idea is that even a randomly-initialized convolutional neural network can be used as a handcrafted prior with high-quality performance in standard inverse problems such as image inpainting and denoising.
This idea not only highlights the inductive bias captured by the generator networks but also bridges the gap between deep learning CNNs and learning-free algorithms based on handcrafted image priors. In this section, we will focus on how this technique of attaining image priors can be used to hallucinate unseen pixels in an image.
It is a consensus that the structure of a CNN plays a key role in the performance of the network and also that the network structure must resonate with the structure of the data. But at the same time, we cannot expect an untrained network F(θ) to know about the specific appearance details of certain object categories.
However, as suggested in , even a sequence of untrained convolutional filters has the ability to capture multi-scale low-level image statistics between pixel neighbourhoods due to their properties of local and translational invariance.
These statistics are sufficient to model the conditional image distribution p(x_filled|x_missing) required in the image inpainting problem. During formulation, this distribution is written in a more generic manner, it is stated as an energy minimization problem (in our case can be a loss function minimization eg. MSE). We assume that the ground truth belongs to a manifold of points x that have null energy E(x, x_in) = 0.
X = argmin( E ( x_prior_estimated | x_input_with_missing_pixels ) ) + R(x) | where E can be a loss function like MSE, x_prior_estimated is the output of randomly initialized network, x_input_with_missing_pixels is the input that needs to be inpainted and the regularizer R(x) can be omitted during solving, considering implicit prior captured by the network parameters.
To start the hallucination, we would first need an image with missing or occluded pixels in correspondence of a binary mask (M). Now if a randomly initialized CNN estimates the missing region, we can calculate the loss as:
Loss = [ (x_prior_estimated − x_input_with_missing_pixels) ⦿ M ] | where ⦿ is element-wise multiplication
The above-mentioned equation is independent of the actual values of the missing pixels, which makes it impossible to optimize it directly over pixel values. Therefore x_prior_estimated is calculated post-optimization w.r.t. the reparametrization. The produced hallucination leads to almost perfect results in many cases with virtually no seams and artefacts.
However, this approach seems to have certain drawbacks like inpainting large holes or any highly semantical missing region. But these drawbacks are acceptable on an argument that this method is not trained on any supervised data and it works surprisingly well for most other situations.
The achieved hallucinations highlights that:
- The network utilizes the global and local context of the image and interpolates the missing region with textures from the known parts.
- The relationship between the traditional self-similarity priors and the deep learning architectures and also suggests and explains the benefits of deep architectures with skip connections for general recognition tasks.
Run this colab notebook if you want to try out the network on image inpainting (Courtesy: DmitryUlyanov).
Hallucinating after learning on Images
In this section, we will discuss some relevant network architectural components that help deep neural networks hallucinate. Before we discuss the components, it’s important to investigate the human behaviour that inspires the architecture of these components to achieve better hallucinations for the image restoration task.
The basic process majorly involves two steps as conceptualization and painting to maintain global structural consistency and local pixel continuity of the image.
During the painting process, humans generally draw new lines from the end nodes of the previously drawn lines to ensure neighbouring pixel continuity and consistency. Keeping this in mind, we will discuss various components suggested in multiple research papers that aim to fulfil a similar purpose.
Recent image inpainting studies have shown good quality results by utilizing the contextual information using mainly two types of methods. The first family of methods uses spatial attention which utilizes the neighbouring pixel features as a reference to restore the unknown pixels, thus ensuring the semantic consistency of the hallucinated content w.r.t. the global context.
The second family of methods uses conditioned values of the valid pixels to predict the missing pixels. Nevertheless, both types of methods sometimes fail to generate semantically flawless content and artefact-less boundaries. But if we utilize architectures like coherent semantic attention modules and gated convolutions as suggested in the papers  and , we can easily overcome most of these challenges.
We will also look into the concept of periodic activation functions like SIREN for generating better implicit neural representations, thus resulting in better neural hallucinations.
Coherent semantic attention
Inspired by the human methodology of conceptualization and painting, authors of  introduced coherent semantic attention (CSA) layer. The CSA initializes the missing pixel values with the most similar feature pixel in the known region. Then these initialized pixels are iteratively optimized using the adjacent pixel’s value by assuming spatial consistency.
The advantages of the mentioned process are two-fold, the first benefit is the global semantic consistency introduced due to the initialization and the second benefit is the local feature coherency ensured by the optimization iterations.
The original network first computes a rough prediction (I_p) using a simple autoencoder (so that similarity could be computed for the initialization process) and then feeds the rough prediction (I_p) and the input image (I_in) to a CSA facilitated encoder layer for refinement. The refinement network performs the above mentioned two steps (initialization and iterative optimization ) on the input (I_p + I_in) to output the final result (I_r).
Iterative Optimization in simple terms
pred_pixel = A + B
A = similarity(pred_pixel, adjacent_pixel) × adjacent_pixel
B = similarity(pred_pixel, most_similar_pixel) × most_similar_pixel
where pred_pixel is the pixel to be hallucinated, adjacent_pixel is the adjacent pixel and most_similar_pixel is the pixel with most similar feature. Also the similarities are normalized.
When we use vanilla CNNs, the convolutional filters apply the same operation on all the pixels, irrespective of the fact that whether they are spatially located in the known or unknown region. This drawback of vanilla CNNs leads to blurry outputs with visual artefacts in the colour and edge domain.
To handle such problems, concepts like partial convolution and gated convolution have been suggested in recent studies [4,3]. The main idea behind gated convolution is to learn a dynamic feature gating mechanism for every spatial location and image channel.
The gating values are nothing but a soft mask automatically calculated from data/feature and multiplied back to feature to regulate the values of certain spatial and channel indices.
For example, if the input feature is I_in and the learnable convolution weight matrix for the gating mechanism is W_g. Then the soft mask will be calculated as:
gating_values = sigmoid ( ∑ ∑ W_g · I_in )
The calculated soft mask is then multiplied back to the original feature. It is important to note that multiplying the mask before or after convolution is equivalent when convolutions are stacked layer-by-layer in the CNN. Gated convolution has two significant advantages:
- Firstly, it makes the hallucinative components more robust for arbitrary shapes
- Secondly, it enables the network to learn to select the feature not only according to the mask and background but also according to the semantic segmentation information in some channels.
SIREN — Sinusoidal Representation Networks
The task of image inpainting involves modelling fine-grained details of the image signals. But most of the used methods often fail to learn robust implicit neural representations of the image’s spatial derivatives, which may or may not be important during the generative process (depends on the difficulty of the task).
To tackle this rarely considered issue,  proposes to leverage periodic activations for robustly modelling complex implicit neural representations. Unlike the traditional approach of using discrete representations for modelling different types of signals in images, SIREN uses the sine as a periodic activation function.
As the derivative of the sine is a cosine, the derivatives of a SIREN inherit the properties of SIRENs, which enables the supervision of any derivative of SIREN with complicated signals. The authors of SIREN demonstrated the capability of SIREN on image inpainting by fitting a 5-layer MLP SIREN to an image input and enforcing a prior on the representation.
The results of the method can be seen in the figure below. Therefore SIREN also holds a worthy mention among these components and it surely holds several exciting avenues for future work in many types of inverse problems.
Hallucinating after learning on videos
Unlike image inpainting, video inpainting focuses on filling the space-time regions in a given video sequence with generated content. To synthesize this content, most traditional approaches used patch-based synthesis.
But after the rise of learning-based methods, some of the most successful approaches are flow-based approaches that jointly synthesize optical flow and colour to enable high-resolution outputs.
The synthesized colour is generally propagated to the missing spatial-temporal regions along the flow trajectories to ensure the temporal coherence and also alleviates the memory problems. In this section, we will discuss the flow-based approach suggested in the paper .
The key component of flow-based video inpainting approaches is the accurate and sharp edge synthesis of the optical flow fields for the object in motion. The method proposed in  “Flow-edge Guided Video Completion” aims to specifically handle accurate flow completion. For achieving the same, the network’s first stage computes a forward and backward flow between the adjacent and non-adjacent frames of the sequence.
Then using the computed flow, the flow of the missing pixel regions is computed. To compute the initial flow, they first use a canny edge detector to extract the edges of the known region and then use EdgeConnect and train a flow edge completion network. This stage of network is the major hallucinative component of the architecture, whose job is to hallucinate the flow edges in the missing region.
The hallucinated edges of the flow maps are typically the most salient features that serve as the key input to produce piecewise-smooth flow completion. Once the hallucination of the optical flow is complete, the network follows the backward and forward flow trajectories to propagate two candidate pixels for each missing pixel.
The network also obtains three non-local flow vectors from the sequence by checking three temporally distant frames. Finally, the candidate pixel’s values are fused in the gradient domain for each missing pixel using a confidence-weighted average.
This type of fusion in the gradient domain ensures the removal of any visual artefact and visible colour seam.
In this blog, we focused on how hallucination in neural networks is utilized to perform the task of image inpainting. We discussed three major scenarios that covered the concepts of hallucinating pixels without any prior learning, after learning on images and after learning on videos.
All of the discussed cases holds deep meaning and reflects the rich history of research in image/video inpainting in their own respective way. Nevertheless, we emphasized how all of these methods have a common goal of hallucinating unseen pixels and how they tackle this inverse problem of image/video inpainting in their respective way.
The variety of applications where the neural hallucinations can be applied is vast and only limited by the ingenuity of its designers.