Image of Mousollini with his horse handler edited out to make himself seem more impressive (source)

“The rise of synthetic media and deepfakes is forcing us towards an important and unsettling realization: our historical belief that video and audio are reliable records of reality is no longer tenable.” -The State of DeepFakes 2019 Report

Media manipulation through images and videos has been around for decades. For example, during WWII Mousollini released a propaganda image of himself on a horse with his horse handler edited out. The goal was to make himself seem more impressive and powerful [1]. These types of tricks can have significant impacts given the scale of people that see images like these, especially in the internet era. DARPA has an entire program constructed just to develop methods for detecting manipulated media through their media forensics (MEDIFOR) [2].

Fake news may one day pale in comparison to the impact of deepfake news [3]. Deepfakes are a type of Computer Vision method with the goal of doctoring an image or video of a person to do or say whatever you want them to. They have been blowing up in both quality and popularity over the last couple of years. The term deepfake comes from a “fake” image or video generated by a “deep” learning algorithm. You’ve likely seen a video of a movie scene with actors face-swapped with a scary degree of accuracy.

Deepfake of Arnold Schwarzenegger and Silvester Stallone in the movie Step Brothers (source)

This technology has the potential to provide attackers with the means to sow unprecedented amounts of disinformation. Fake news is prevalent and people are rampantly believing news with little to no evidence [4]. Just recently, a fake video making it look like Joe Biden didn’t know what state he is in got 1 million views on Twitter [5]. Deepfakes could provide “evidence” to people who are looking to further their cognitive dissonance around disinformation. This is a real threat, with two bills [H. R. 3230, S. 3805] having already been proposed by Congress to counter the spread of deepfakes for illegal purposes.

In this blog post, I’m going to go over some background behind deepfakes, what they have been used for, and how to counter them. Some of the technologies used are datasets and models from a Kaggle competition designed to detect deepfakes [6] as well as an open-source tool, FiftyOne [7], that we at Voxel51 have been developing to visualize datasets and analyze the results of deep learning models.

What are deepfakes?

A deepfake refers to a specific kind of synthetic media where a person in an image or video is swapped with another person's likeness. - Meredith Somers

Progress of GANs over the last few years (source)

Synthetic image and video generation has been a growing field of computer vision that got a lot of momentum with the release of generative adversarial networks (GAN) in 2014 [8].

Deepfakes come in a variety of forms according to Mirsky and Lee [9]. Here are examples of different types of deep fakes on a woman’s face:

Reenactment — using your facial or body movements to dictate the movements of another person

Current works in reenactment are looking to minimize the amount of training data that is needed to be able to generate a modified face. One-shot and few-shot learning have become popular approaches to utilize data from many different faces but only need to fine-tune on a select few samples of their target face. Some of the leading methods here are MarioNETte and FLNet.

Replacement — your identity is mapped to another person (for example the face swap filter on Snapchat)

Many replacement works apply either encoder-decoder networks or variational autoencoders to learn to map the source face to the target while retaining the same expressions. Common challenges that come up with replacement is the need to deal with occlusions, when an object passes in front of the target’s face. Recent replacement models include FaceShifter and FaceSwap-GAN.

Editing — altering the attributes of a person in some way (like changing their age or glasses)

Synthesis — generating completely new images or videos with no target person

Reenactment and replacement are the two most prominent types of deepfakes since they have the highest potential for destructive applications. For example, the face of a politician could be reenacted to say something they never actually said, or a person’s face could be replaced into an incriminating video for blackmail purposes. These types of manipulated media (deepfake or not) can be incriminating and/or harmful to the target’s reputation, like the post of Biden mentioned earlier in which signs that said the name of the state were edited to make it look like he didn’t know where he was.

At the core of the concern about deepfakes is the fact that they make reenactment and replacement accessible and cheap; you don’t have to understand the details of the techniques involved to be able to create deepfakes. Tools like DeepFaceLab allow anyone to take an image or video and replace a face, de-age a face, or manipulate a speech. For really seamless deepfakes, tools like this will provide a high-quality generation of a face, but you still need some experience in video editing software like Adobe After Effects to add them to a video. If you don’t have the desire or ability to make them yourself, you can even find services and marketplaces willing to create deepfakes for you. One example is https://deepfakesweb.com/ where you just have to upload videos and images you want, and they will create a deepfake for you in the cloud.

How are deepfakes being used?

When the term was first coined by a Reddit user in 2017, they were using deepfakes to create fake pornography of celebrities using face swaps [10]. This has since been banned on the website. Many deepfakes these days are more benign like replacing actors in movies. For example, here is a recent post where the faces of Arnold Schwarzenegger and Silvester Stalone were added to the movie Step Brothers [11].

Deepfake of Arnold Schwarzenegger and Silvester Stallone in the movie Step Brothers (source)

There have been a few deepfakes made for political reasons. One of the most popular was made by Jordan Peele where he reenacted Obama’s face [12]. This was not done maliciously, just to raise awareness of the potential that deepfakes have to shape the political landscape.

Only a few cases of serious political deepfakes have been uncovered so far. Not all of them are malicious, for example, an Indian politician made a deepfake of an announcement he made to translate it into other languages, like English [13].

Left Deepfake (source) | Right Original (source) — **Left** Deepfake (source) | **Right** Original (source)

Another example of using deepfakes to make a political statement are these videos of Kim Jong-Un and Vladimir Putin discussing the election and the need for a peaceful transition of power.

Left (source) | Right (source) — **Left** (source) | **Right** (source)

However, there have been a few accounts of malicious political deepfakes. A deepfake of Donald Trump speaking about the Paris climate agreement was produced by a Belgian political group to persuade people to sign a petition calling the government to take more action against climate change [3, 14].

Even though the quality of the deepfake is rather poor and the deepfake can easily be spotted by looking at his mouth, numerous commenters were deceived and were calling out Trump for the statements in the video. As deepfakes are increasing in quality, it not only becomes concerning that you may not be able to tell if a video was faked, but also that someone can claim a real incriminating video is actually fake. Last year, a Malaysian politician was jailed over a video of him engaging in homosexual activity (which is illegal in Malaysia) that the politician claims is a deepfake. Experts were not able to concretely determine if the video was faked or not. In the future, this defense of claiming a real video as a deepfake could be used for much more serious crimes.

Should I be worried?

As deepfake technology has progressed, so have methods for detecting deepfakes. Deepfake detectors work by being trained to find a host of different deepfake identifiers in images and videos. For example, Face X-ray is a method designed to look for boundaries when a deepfake face is blended back into the target image or video. Another method looks at the background of the image and compares it with the face to see if there are discrepancies between the two. An emotion recognition network has also been used to detect if the emotions shown on the face match the context and audio of the scene to determine if the behavior of the actor indicates a deepfake. Another way that human biology can be used to detect deepfakes and even identify the deepfake model used is by detecting heartbeats from the target of the video and analyzing the residuals of that signal. Most of these deepfake detection works have open-source code available, however, there are also some commercially available APIs, for example sensity, deepware, and Microsoft Video Authenticator.

There have been multiple competitions hosted on Kaggle designed to create better deepfake detectors [6]. I took one of the winners of a Deepfake Detection Challenge and put it to the test on the videos shown above. I downloaded the deepfake videos as well as corresponding real videos of Obama and Trump and passed them through the deepfake detection model. The output of the model is classified as “REAL” if the score is close to 0 and “FAKE” if the score is close to 1. I then used FiftyOne to quickly visualize and analyze the results.

Results visualized in FiftyOne. Ground Truth indicates if the video is actually real or fake and Prediction indicates what the model predicted the video as.

The detector was able to correctly classify the videos of Obama and the Indian politician, Manoj Tiwari, but failed to detect the deepfakes of Trump and Step Brothers.

The data and predictions can easily be loaded in a FiftyOne dataset with the following code. You just need to install FiftyOne and download the dataset zip first.

Check out the full example and follow along here!

I also downloaded the 400 video validation dataset from the Kaggle competition and evaluated the model on it [15]. I used the following code to find the samples where the model was wrong and the prediction does not match the ground truth.

Only 11 of the 407 validation and manually downloaded videos were predicted incorrectly. Looking through them, we can see that all of the samples from the validation set that were incorrectly predicted were actually “real” and predicted as “fake”. From the 7 downloaded videos, the only ones incorrectly predicted were actually “fake” and predicted as “real”.

Additionally, I used FiftyOne to search for samples where the score was between 0.25 and 0.75 indicating that the model was uncertain. We can see that compared to the ground truth distribution, the model was much more uncertain about “real” videos than “fake” ones.

(Left) Distribution of ground truth labels for all 407 videos | (Right) Distribution of ground truth labels where the prediction was between 0.25 and 0.75 — (**Left**) Distribution of ground truth labels for all 407 videos | (**Right**) Distribution of ground truth labels where the prediction was between 0.25 and 0.75

These results indicate that this model is tuned to predict real videos as “fake” more easily than predicting a fake video as “real”. However, when we throw in deepfakes downloaded from the internet that were not designed for this competition, the model failed on 2/7 of them and called fake videos “real”. In the real world, it is ok for this model to predict more real videos as “fake” since that will help us sift through large amounts of data to determine which are actually fake. However, the fact that it failed on some of the videos we provided means that challenges like these should gather and generate deepfakes from a wider variety of sources and methods.

There is a desperate need to continue research in deepfake detectors before they become more prevalent. Luckily, high-quality deepfakes still require human intervention to touch up the editing which will slow their spread and give deepfake detection models time to improve.

As of today, there have not been any notable deepfakes that have targeted the 2020 US Election [3,4]. Fake news and disinformation are still bigger concerns at this point in time over deepfakes [5]. Videos created by manual editing, like one showing Biden asleep during a TV interview [16], are a much greater concern than what is output by deep learning models currently. On top of that, the largest sources of disinformation are through rumors and forums rather than any individual video [17].

Spot the deepfake

The best way to be able to prevent deepfakes from being used maliciously is to train people to spot the telltale signs of a deepfake so they don’t accidentally fall for it. While some deepfakes are really well made, there are oftentimes common traits that can give them away.

According to MIT researcher Matt Groh, if you think that you might be seeing a deepfake, you should look particularly closely at the:

Face: Are parts of the face seeming to “glitch”? Are their eyebrows and facial muscles moving when they talk? Does their hair seem to be falling naturally?
Audio: Does the audio match up with the expressions on the person’s face? Are there any weird cuts or jumps that sound unnatural in the audio?
Lighting: Is there a portion of the video where the lighting doesn’t match the rest of the scene? Are you seeing reflections in things like glasses?

You can see how good you are by checking out detectfakes.media.mit.edu. This website provides examples from the Deepfake Detection Kaggle challenge to test if you can spot the deepfake as good as a deepfake detection model!

Summary

Deepfakes can lead to a potentially terrifying future of misinformation in the media. However, the current state of the technology is still in its infancy where high-quality deepfakes require a significant human touch. As deepfakes are improving, so are the methods and algorithms to counter them. A much greater concern for the current state of politics is the spread of misinformation and manually edited images and videos.

About Me

My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in Computer Vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or dataset without serious scripting to visualize and search through outputs and labels. Working at the computer vision startup, Voxel51, I helped develop the tool FiftyOne so that researchers can quickly load up and start looking through datasets and model results. Follow me on Twitter @ehofesmann