A 6 Minute Introduction to the Technology Powering Deepfakes
Understanding how an Autoencoder works and building a simple Autoencoder in Python
Deepfakes are lately garnering widespread attention in media primarily because of being used to spread fake news, financial frauds, hoaxes and more. Deepfakes is a portmanteau of Deep Learning and Fake where usually a person in an image or a video is replaced by a different person.
While creating synthetic videos and editing pictures to generate fake content is not a new concept but this technology uses techniques in deep learning and artificial intelligence to morph images and videos so that the the resulting content has a very high potential to deceive. Look at a small example below —
The original video on the left of the actress Amy Adams is modified to have the face of actor Nicolas Cage on the right. An early landmark paper was published in 1997 by Christoph Bregler, Michele Covell and Malcolm Slaney on Video Rewrite: Driving Visual Speech with Audio where they modified existing video footage of a person speaking to depict that person mouthing the words contained in a different audio track.
It was the first system to fully automate this kind of facial reanimation, and it did so using machine learning techniques to make connections between the sounds produced by a video’s subject and the shape of the subject’s face.
And a lot of academic research has been done on this subject after that to improve the techniques and to include the whole face and then the whole body in morphing. Although the technology uses a lot of techniques from artificial intelligence and deep learning we will discuss one of the core components behind this technology — Autoencoders.
What is an Autoencoder?
Autoencoder is simply a neural network which outputs its input or mimics its input to its output. Now why would we need such a neural network? The question will be answered as we go deep and understand autoencoders. Lets look at how autoencoders are designed.
A simple autoencoder is a neural network which can have one or more hidden layers. A neural network is trained by using backpropagation algorithm which uses input examples and its labels to train the network to classify the data. Autoencoders on the other hand do not need labels. It uses the input data itself as labels on which it trains since it is just trying to copy input data to output it is thus an unsupervised training or rather it is self-supervised.
This process of learning to copy its input to output produces a code in the hidden layer which is a lower dimension representation of the original data.
This is also known as the latent space representation of the data. The network before the code can be called as an encoder and the part after that a decoder. The encoder encodes or compresses the data into a smaller dimension and the decoder decodes or reconstructs the original data from this compressed form.
As we have seen, autoencoders are, similar to Principal Component Analysis, mainly a dimensionality reduction (or compression) algorithm with a couple of important properties
- Autoencoders are data specific which means they are only able to meaningfully compress data similar to what they have been trained on. Just like a neural network the hidden layers learn features of the data they are trained on. They are very different from the general compression algorithms like jpeg. So you can’t expect an autoencoder trained on MNIST handwritten digits to compress portrait images.
- Autoencoders are lossy i.e. the output of the autoencoder will not be exactly the same as the input, it will be a close but a slightly degraded representation of the original. A lossless algorithm like gzip reconstructs the complete data from the compressed form without any loss. After all you don’t want to zip a file and later find out that the extracted version has only half of the data you actually zipped!
How to build an Autoencoder?
As discussed earlier autoencoders are simply fully connected layers as in a neural network.
A simple autoencoder will have an input layer a hidden layer and an output layer much like a Single Layer Feed Forward Neural Network or SLFN.
Except here the output layer is of the same dimension as the input layer. The hidden layer which is also the code layer for our autoencoder will have the dimension same as what we want out input to be compressed to. For example, if the input is an MNIST handwritten image which is of 28x28 pixels, when flattened out becomes a vector of 784 we can reduce it down to lets say a vector of 32 elements.
This will be our code size. Code size is a hyperparameter for our autoencoder architecture. Other parameters which help decide the architecture could be number of layers, number of neurons per layer, type of loss function etc. As we have seen autoencoders are trained using backpropagation. Lets try to code and train a simple autoencoder using python and Keras.
We will make an autoencoder for MNIST dataset with the following architecture —
The input layer will have a size of 784, hidden layer size is 128, and the code size is 32. The following gist builds the network in Keras. The full code can be found at this github repo.
Lets see how the input and output looks for this autoencoder —
input_size = 784
hidden_size = 128
code_size = 32
input_img = Input(shape=(input_size,))
hidden_1 = Dense(hidden_size, activation='relu')(input_img)
code = Dense(code_size, activation='relu')(hidden_1)
hidden_2 = Dense(hidden_size, activation='relu')(code)
output_img = Dense(input_size, activation='sigmoid')(hidden_2)
autoencoder = Model(input_img, output_img)
autoencoder.fit(x_train, x_train, epochs=3)
On the left is the input to this autoencoder and the right one is the reconstructed image. They are indeed pretty similar, but not exactly the same. If you notice the horizontal line of the digit 7 you will be able to notice the difference. Since this was a simple task our autoencoder performed pretty well.
Similar to neural networks autoencoders can also be fine tuned by adjusting the hyperparameters like number of hidden layers, neurons per layer etc. But it also suffers from similar problems as neural networks namely overfitting and underfitting. In this case overfitting means that the autoencoder memorizes the data or it will just mimic the identity function where it will just produce perfect copies of the input data.
Autoencoders in Action
Autoencoders find a lot of applications a few of the major applications of autoencoders are —
- Image Denoising
- Dimensionality Reduction
- Feature Extraction
- Image Compression
- Image Generation
Deepfakes uses autoencoders to learn the latent space representation which is followed by a decoder, which reconstructs the image from the latent representation. Deepfakes utilize this architecture by having a universal encoder which encodes a person in to the latent space.
The latent representation contains key features about their facial features. This can then be decoded with a model trained specifically for the target. This means the target’s detailed information will be superimposed on the underlying facial and body features of the original video, represented in the latent space. This is the most simplistic deepfakes technique.
Now a days a generative adversarial network or GAN (more on GANs in upcoming articles) is attached to the decoder. The generator creates new images from the latent representation of the source material, while the discriminator network in a GAN attempts to determine whether or not the image is generated.
This causes the generator to create images that mimic reality extremely well as any defects would be caught by the discriminator. This makes deepfakes difficult to combat as they are constantly evolving; any time a defect is determined, it can be corrected.
A simple technique which just learns to copy the input to the output can be used to generate such powerful images which are unfortunately like any other technology this can also be used for good and bad both. Deepfakes is being used for deception and spreading fake news quite a lot these days.
Equipping yourself with the right information is the best way to be cognizant about these and hopefully this article is one baby step towards that.