Why transfer learning works or fails?
An (almost) math-free guide to understanding the theory behind transfer learning and domain adaptation.
During the NIPS tutorial talk given in 2016, Andrew Ng said that transfer learning — a subarea of machine learning where the model is learned and then deployed in related, yet different, areas — will be the next driver of machine learning commercial success in the years to come.
This statement would be hard to contest as avoiding learning large-scale models from scratch would significantly reduce the high computational and annotation efforts required for it and save data science practitioners lots of time, energy, and, ultimately, money.
As an illustration of these latter words, consider Facebook’s DeepFace algorithm that was the first to achieve a near-human performance in face verification back in 2014. The neural network behind it was trained on 4.4 million labeled faces — an overwhelming amount of data that had to be collected, annotated, and then trained on for 3 full days without taking into account the time needed for fine-tuning.
It won’t be an exaggeration to say that most of the companies and research teams without Facebook’s resources and deep learning engineers would have to put in months or even years of work to complete such a feat, with most of this time spent on collecting an annotated sample large enough to build such an accurate classifier.
This is where transfer learning magically steps in by allowing us to use the same model across related datasets just as we would have done it if they were to come from the same source. Despite being quite efficient and helpful for such challenging tasks as computer vision and natural language processing, transfer learning algorithms also fail badly in practice, and explaining why it may or may not happen is what I will attempt to do below.
Getting back to the roots
To start my brief and painless introduction to transfer learning theory, let me introduce Homer, a guy in his late 30s who got all excited because of the hype around machine learning and decided to automatically classify all the weird stuff he buys on Aliexpress for his online shop.
The main motivation of Homer stemmed from his laziness, the fact that translated English descriptions on Aliexpress were usually quite confusing, to say the least, meaning that only photos of what Homer buys were providing any information about the actual item.
And so, Homer downloaded a huge annotated dataset of items sold on Amazon, hoping that a classifier learned on them was going to work well on his images from Aliexpress too.
“What made him think so?” you may ask. Well, first of all, Homer assumed that with that many images from Amazon, he can learn a low-error classifier for them using a state-of-the-art deep neural network with 1 trillion layers. Also, during all summer vacations spent at his grandmother’s house near the sea, Homer had a chance to read all the latest work of Mr. Vapnik and Mr. Chervonenkis who, (very) broadly speaking, had suggested the following inequality:
Homer knew that the first term on the right-hand side can be made as small as desired due to neural networks’ capacity to learn well from any rubbish fed into them. Also, Homer supposed that the high complexity of the neural network in the numerator of the second term was going to be compensated by the large sample size at his disposal, thus bringing it close to 0 too.
The last piece of the puzzle that bothered Homer is the left-hand side, as he was not sure whether the classification error on unseen items from Amazon was going to be close to that achieved on items from Aliexpress. To deal with this, he made the following simple assumption:
“What kind of distance?” a curious reader will ask and will be right to do so. But Homer didn’t not care about such details and, being now quite happy with himself, proceeded to the following ultimate inequality:
“Now I know what to do,” said Homer to himself, meaning transfer learning and not his unsettled life in general. “First, I need to find a way to transform images from Amazon so that they will look as similar as possible to those from Aliexpress, thus reducing the distance between them.
Then, I will learn a low-error classifier on the transformed images, as I still have ground-truth labels for them, and will apply this classifier further on my images from Aliexpress.”
After some moments of thinking, he became uncertain about his idea. “Is there is something I am missing here?” he asked himself, while ordering a laser saber umbrella from Aliexpress’ website, and as it happens, he did.
What is supposed to be called similar?
While Homer’s intuition about transfer learning formalized in the last inequality was generally right, he still lacked a well-defined notion of a distance that he could have used as a measure of transferability between two datasets.
“Roughly speaking,” Homer reasoned with himself, “ there are two possible ways of comparing datasets: an unsupervised and a supervised one. If I go with the supervised one, it means that I take into account both images and their labels to measure the distance; if I opt for the unsupervised one, I consider images only.”
Both these approaches bothered Homer, but for different reasons. For the supervised approach, he had to have labels for Aliexpress images, which was something that he was trying to obtain with transfer learning in the first place.
As for the unsupervised one, he thought that it was not accurate enough as two images can seem similar even when they belong to different categories. “How’s that can be?” you may wonder. Well, just take a look at the following two items sold on Aliexpress and Amazon websites and tell me to which category each of them belongs.
Obviously, the one on the left is a sleeping pillow (it was obvious, wasn’t it?!), while that on the right is a salmon fillet. To avoid this sort of confusion, Homer decided to put forward the following assumption:
Homer thought that putting the labeling functions into the equation— functions that output the category of any possible item from their respective online platform — is the most straightforward way to account for both the annotations of the datasets and the actual similarity of their images. Surprisingly, this is how he (almost) came up with a result extremely close to Theorem 1 from the seminal paper on transfer learning theory.
One theory to bind them all
You may be quite surprised if I tell you that most of the papers on transfer learning theory boil down to the inequality derived by Homer from his very basic understanding of machine learning principles. The only difference between those numerous papers and Homer’s down-to-earth reasoning is that Homer had to get through by making the assumptions and not by actually proving the desired result contrary to the published works.
I will now show you a slightly reformulated inequality that captures the essence of both Theorem 1 and Homer’s reasoning, and then we will see how it can be used to justify all those transfer learning algorithms that were discussed on Medium before. My reformulation writes as follows:
Here, my target domain is any dataset that I may want to categorize without manually annotating it; in Homer’s example, it consists of images of the items sold on Aliexpress.
My source domain, abundant Amazon images in that same example, is any annotated dataset for which I can produce a low-error model used in the target domain afterwords. The second term on the right-hand side is an unsupervised distance between the two domains that we can usually calculate without knowing the labels of instances in any of them.
Researchers working on transfer learning proposed many different candidates for this term, and most of them took the form a certain divergence between the (marginal) distributions of the two domains.
Finally, the third term represents what is usually called the a priori adaptability: a non-estimable quantity that we can compute only when the true target domain’s labeling function is known. This latter observation brings us to the following important conclusion.
While a transfer learning algorithm can explicitly minimize the first two terms of our inequality, the a priori adaptability term remains uncontrollable, potentially leading to a failure of transfer learning.
If you wait for a magic solution at this point, then I will have to disappoint you by saying that there is no such solution.
You can use kernel-based, moment matching, or adversarial approaches, but it won’t change anything: in the end you will be left at the mercy of the non-estimable term that may ultimately impact the final performance of your model in the target domain with its invisible hand. The good news, however, is that in most cases, it will still work better than doing no transfer at all.
Back to real life
I will now present a simple example provided in this paper that highlights one of the pitfalls of transfer learning approach following the philosophy described above. In this example, we will consider two 1-dimensional datasets, representing the source and target domains, and generated using the code given below.
import numpy as np
size = int(1./(2*xi))
source = np.zeros((size,))
target = np.zeros_like(source)
source[k] = 2*k*xi
target[k] = (2*k+1)*xi
return source, target
Executing this piece of code produces two sets of points in the interval [0,1], as in the figure below.
For instance, when ξ = 0.1, it will return the following two lists:
source,target = generate_source_target(1./10)print(source)
[0. 0.2 0.4 0.6 0.8]print(target)
[0.1 0.3 0.5 0.7 0.9]
I will further attribute label 1 to all points from the source domain and label 0 to those from the target domain. My final learning samples will thus become:
source_sample = [(round(i,1),1) for i in source]
target_sample = [(round(i,1),0) for i in target]
[(0.0, 1), (0.2, 1), (0.4, 1), (0.6, 1), (0.8, 1)]print(target_sample)
[(0.1, 0), (0.3, 0), (0.5, 0), (0.7, 0), (0.9, 0)]
Is it easy to find a perfect classifier for each of these samples? Yes, it is, as, in case of the source domain, it can be done using a threshold function that outputs 1 for points whose coordinate is smaller than 0.8, and 0, otherwise. The same holds for the target domain, but this time the classifier will output 0 for all points whose coordinate is smaller than 0.9, and 1, otherwise.
Finally, we will now make sure that the distance between these two domains depends on ξ and thus can be made arbitrarily small by reducing it. To do this, we will use the 1-Wasserstein distance that is exactly equal to ξ in our case. Let’s verify it using the following code:
from scipy.stats import wasserstein_distancefor xi in [1e-1,1e-2,1e-3]:
source, target = generate_source_target(xi)
wass_1d = wasserstein_distance(source,target)
To summarize, we now have a source domain for which we can learn a perfect classifier and a target domain that can be made arbitrary close to it and for which there exists a perfect classifier too. Now, the question we ask is: Can transfer learning succeed in this seemingly very favorable case?! Well, as you may have already guessed, no, it can’t, and the reason for this hides in the non-estimable term that I have mentioned before.
Indeed, in this setup, a classifier that would be good for both domains simultaneously does not exist, whatever you may do. Actually, its lowest possible error will be exactly equal to 1-ξ which, once again, can be made arbitrary close to 1 by manipulating ξ accordingly. As put by the authors of the paper establishing the foundation of the transfer learning theory,
“When there is no classifier that performs well on both the source and target domains, we cannot hope to find a good target model by training only on the source domain.”
Sadly enough, this is true for all transfer learning approaches that do not have access to labels in the target domain.
While this article may seem a little pessimistic, its main goal, however, is to provide a general understanding of transfer learning theory to the interested reader and not to deceive him or her from using transfer learning methods. Indeed, transfer learning algorithms are usually quite efficient in practice as they are often applied to datasets that have a strong semantic connection between them.
In this case, assuming that the non-estimable term is small is reasonable as, most likely, there exists a good classifier that will work well for both domains. However, as it often happens in life, you will never know if this is the case unless you try it, and that’s what makes the beauty of it.