Would you believe me if I were to say that there is a single solution to such different problems as brain decoding in neuroscience, shape reconstruction in computer graphics, color transfer in computer vision, and automatic translation in neural language processing? And if I were to add transfer learning, image registration, and spectral unmixing for musical data to this list and the fact that the solution I am talking about has nothing to do with deep learning?

Well, if you couldn’t guess the right answer, then keep on reading as this article is about Optimal Transport (OT): a mathematical theory dating back to the late 18th century that has flourished recently in both pure mathematics (2 Fields medals in last 12 years!) and machine learning to become one of the most emerging topics to learn about right now.

Join me on this walk through the beautiful history of OT spanning the last three centuries and learn why it became such an important part of today’s machine learning.

The beginning

Like many good things, it all started in France in the late 18th century under the rule of Louis XVI, around a decade before the French revolution of 1789. On that sunny day, Louis XVI looked concerned when discussing with one of the most prominent scientists of his country, Gaspard Monge, a matter of the highest governmental importance.

“You see, Gaspard,” said Louis XVI, “we have three types of cheese certification in France: fermier cheese fully made without leaving the farm, artisanal cheese made from milk from the nearby farms by artisans, and laitier cheese made from milk from the same region at the factories.

The problem is that artisans making artisanal cheese tend to waste too much time riding around and picking up milk from the farms that are too far away. I would like you, Gaspard, to help us to deal with the issue by figuring out which farm gives all of its milk to which artisan so that the cheese is abundant and the life is pretty.

“Oh!” exclaimed Gaspard somewhat excitedly, “it’s like this problem with ore and mines that I’ve been working on recently with my fellow metallurgists.”

“Whatever, Gaspard,” replied his Majesty. “As long as the cheese is on the table and the people are happy.”

Assuming one cow’s milk is enough to produce one cheese, the optimal T here would be an assignment given by the arrows with an overall distance of 19kms. — Assuming **one cow’s milk** is enough to produce **one cheese**, the optimal T here would be an **assignment** given by the arrows with an overall distance of 19kms.

After this conversation with the king, Gaspard immediately figured out that the problem he is dealing with can be expressed using the table on the left. Here, each cell gives the distance between the farms and the artisans, while the number of cows and cheeses indicates their respective production capacities and supply needs.

The goal then would be to find a function T, called the Monge map, that assigns each farm to an artisan in an optimal way by taking into account both the distances between them and their respective demands.

Despite its simplistic appearance, it took mathematicians almost two hundred years to fully characterize this problem (some 140 years less than the famous Fermat’s Last Theorem), until Yann Brenier showed in 1991 that it admits a solution in the general (continuous) case for some common distances used in science. But, before that, the following important thing happened.

The unexpected benefits of communism

Bookkeeper Nina, whose job in the late 1930s was what Google search is for us today, was staring at the shelves of the Leningrad State Library trying to find the Charles Baudelaire’s book “Les Fleurs du mal”. The latter was asked for by Leonid, a poetry-lover in his late twenties now waiting impatiently at the counter.

“We do not have Charles Baudelaire, comrade Kantorovich,” she said when coming back to her workplace. “But here is a nice article about metallurgy that you, as an inventor of linear programming and an honorable Soviet citizen, may find interesting”.

The nice article she referred to was Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais” that Leonid Kantorovich took somewhat disappointingly and, while reading it back at home, figured out how it can be improved.

In Kantorovich formulation, we are allowed to split the production among the artisans to minimize the distances covered by them. — In Kantorovich formulation, we are allowed to **split the production** among the artisans to minimize the distances covered by them.

“The issue with Monge problem is that those monarchists do not share stuff, while we, Soviet people, share everything with each other,” said Leonid to himself. “Take milk, for instance. Why would we care about not allowing farms to split its production among all factories?”

This brought him to the idea that would later be called the Monge-Kantorovich problem where contrary to seeking for a function T assigning each farm to one artisan, we now want to find a (probabilistic) function Г that is allowed to split the production of farms among all artisans.

This problem, contrary to the Monge’s one, can be shown to 1) always have a solution under some mild assumptions regarding the distance used and 2) earn Nobel Prize, Fields medals and other distinguished awards for a dozen of different researchers working on it in the last 60 years.

Finally, it hints on why all the cheese in post-Soviet countries tastes the same despite being sold under different names and that, my friend, is a great modern mystery to put some light on (believe me, I was born in Ukraine).

Low-level view

As you may have guessed already, all the stories presented so far are simplified (and even made up) for entertainment sake and to provide a general intuition behind OT as the true problems considered by both Monge and Kantorovich were much more complicated than that.

In fact, the true power of OT is hidden in its ability to act as a mechanism of transforming one (continuous) probability distribution into another one with the lowest possible effort.

The latter criterium is measured by some function providing a dissimilarity for any pair of points from the two distributions.

Monge problem with discrete probability distributions represented by histograms and their fitted continuous approximations given by green and purple lines.

Getting back to my example with farms and artisans, I can now define two discrete distributions over them with probabilities given by the production of each farm (3 cows out of 7 give me the probability of 3/7 ) and the demand of each artisan.

I will then obtain two histograms, and the distances considered before would become the costs of transforming the points from my first distribution into those from the second one.

“But why is that so important?” you may ask.

Well, mainly because a whole awful lot of objects manipulated by data science practitioners can be modeled as probability distributions.

For instance, any image can be seen as a distribution over pixels with probabilities given by their intensities normalized to sum to one.

The same holds for text documents that you can see as discrete distributions of words with probabilities given by their frequencies of occurrence. The examples are countless, and OT allows you to deal with all of them whenever you need to compare two such complex objects.

“But we usually use distances to compare objects!” you may rightfully notice.

Well, OT got it all covered once again as Kantorovich formulation of the problem defines a true metric on the space of distributions, called the Wasserstein distance, that obeys the triangle inequality and vanishes when two distributions are equal.

Interpolation between Monge and Kantorovich portraits using OT. The intermediate images show the shortest path for transforming one image into another.

Moreover, as OT problem finds the most efficient way of transforming one distribution into another, you can use its solution to interpolate smoothly between them and obtain intermediate transformations along this path.

To illustrate this, let’s take the image of Gaspard Monge and that of Leonid Kantorovich and solve the OT problem between them.

The result in the figure above shows a geodesic path between the two images given by the most efficient way (that of the shortest path) of gradually transforming one image into another with the geometry (or pairwise distances between the pixels in our case) specified by the cost function you are using.

Pretty cool, heh?! Let’s now finally get back to the modern times to see what exactly this means for machine learning.

Applications’ trailer

Below, I will present only some applications of OT in computer vision and neural language processing with the code used to produce the results covered in detail in “Hands-on guide to Python Optimal Transport toolbox: Part 2”.

Also, check out the introduction to Python Optimal Transport toolbox in this article “Hands-on guide to Python Optimal Transport toolbox: Part 1”.

Top row: left — original day sky image; right — original sunset sky image; middle — the result of color transfer. Bottom row: the coordinates of point in the Blue and red frequencies highlighting the different color styles of the two images. The reader may use the following code to reproduce the results. — **Top row**: **left** — original day sky image; **right** — original sunset sky image; **middle** — the result of color transfer. **Bottom row**: the coordinates of point in the Blue and red frequencies highlighting the different color styles of the two images. The reader may use the following code to reproduce the results.

Color transfer. In this application, we have two images: one with a blue sky over the ocean and one with the sunset. Our goal is to transfer the color style of the first image to the second one so that the colors of the sunset image will look identical to that of the daytime one.

As before, we consider both images as probability distributions over pixels given by 3-dimensional vectors in the RGB cube. We sample 1000 pixels from each image and align them using OT. The final result is given in the middle image on the left and shows a highly realistic sunset image with the daytime colors.

A result of color gradient adaptation with OT for Poisson image editing using the code from here. Note that the final result can be improved by using a more precise image mask.

Image editing. For this task, the goal is to edit an image by replacing a part of it using a patch of another image. For instance, in the image on the left, you can see my face, the Mona Lisa’s face, and the result of a seamless copy of my face on hers.

Optimal transport here is applied to color gradients of the two images, and then the Poisson equation is solved to calculate the edited image. For more cool examples of this, check this paper.

Aligning the embeddings of documents written in different languages using OT provides state-of-the-art results for automatic translation and cross-lingual information retrieval.

Automatic translation. Let’s now turn our attention to something else than images and consider text corpora. Our goal would be to find a matching of words and phrases in two different languages.

Let take the following pair as an example: given English proposition “the cat sits on the mat’’ and its French translation “le chat est assis sur le tapis’’, we would like to find a matching that provides the correspondences “cat”- “chat”, “sits”- “assis” and “mat”- “tapis”.

You see where I am heading with this one, don’t you?! I will define a distribution over each proposition and treat each word as a single point. I will then use the distances between their embeddings to match the two distributions with OT. The result of this matching can be seen on the left.

There are many more applications where OT has shown to be useful, including, for instance, shape reconstruction, multi-label classification, GAN training, brain decoding, image super-resolution, and many, many others. OT is a huge universe in itself, and it only keeps on rising, as confirmed by an increasing number of papers studying it at top machine learning conferences.