# Probability Theory for Machine/Deep Learning

# A very quick introduction to building a foundation in probability theory for understanding machine/d

Nimish Mishra

*This article is the verbatim of the article published in Towards Data Science by the same author.*

**Introduction**

*Why do we need a foundation in Probability Theory in order to understand machine/deep learning algorithms?*

The answer to the above question is the main motivation behind this article. Machine/Deep learning often deals with **stochastic** **or random** quantities, which can be thought of as non-deterministic (something which can not be predicted beforehand or which exhibits random behaviour). The study of these quantities is quite different from deterministicquantities arising in a range of computer science fields. Given this crucial information, it is therefore desirable to be able to reason in an environment of uncertainty, and probability theory is the tool that shall help us to do so.

*Because I do not want to cloud your thoughts with mathematical jargon in the very beginning, I have added a section on application of all these things at the very end of the article. T hat should be your prime motivation for understanding this stuff. Let’s begin then*

*So what makes any system prone to these uncertainties?*

Let us establish a bit of mathematical literature in terms of which we can base our further discussion. Firstly, a deterministic system can be thought of as something which involves absolutely no randomness in the development of future states. Take for example Newton’s second law of motion. One can *determine* the future state of an accelerating system. The outcome is not random.

On the other hand, a non-deterministic system is something that is not deterministic (or involves a fair amount of randomness in future states). For instance, flipping a coin is a non-deterministic process as there is randomness involved in the result (either head or tails and there is no way to ascertain which outcome shall come).

*Back to the question.* There can be a variety of ways due to which stochastic behaviour is introduced in systems.

A system might be *inherently stochastic*, like a Quantum Mechanics model. In such a case, there is no way we can make deterministic arguments about the state of the system. Or there might be a system that is deterministic given we have complete knowledge about the variables of the system. Now if we lose some knowledge about these variables, we lose the ability to be able to determine the future states of development of the system. And thus, a deterministic system turns into a non-deterministic one.

### Prerequisites

Reading this article requires a basic understanding of the notion of probability, some idea about frequentist and Bayesian probability, basic idea about conditional probability, notions on (in)dependent events. If you feel like refreshing these concepts, I’ll recommend checking out this article on TDS by Kirill Dubovikov

Let’s begin then…

### Random Variable

As detailed above, a non-deterministic system may have more than one possible outcomes. For instance, flipping a coin may have two different, equally likely outcomes- heads or tails.

A random variable (or Stochastic variable) can be thought of as a variable whose possible values are the outcomes of the non-deterministic system being modelled. For instance, we can define a random variable X that shall denote the outcome of a coin flip. Therefore, X takes the value 1 when the outcome is heads, and X = 0 when the outcome is tails. So the random variable X is said to take on one of the values from {0, 1}.

Formally speaking, if S is the sample space of the outcomes of an event with a probability measure and X is a real-valued function defined over the elements of S, then X is a random variable (or a description of the possible states the system can be in on an experiment’s outcome).

A random variable may be **discrete** (if it covers **finite** or **countably infinite** number of states) or **continuous** (if it covers **uncountably infinite** number of states).

**Note**: The notion of countably infinite and uncountably infinite merits a whole article explaining it and is thus omitted here. You can, however, view ideas on set domination online. I am attaching a very brief discussion here.

Consider two sets- X and N (the set of Natural numbers) and the usual definitions of mapping and bijection. Set X is said to **strictly dominate** N if there exists a mapping from N to a **subset of X** (and not to the whole of X). In other words, there exists at least one element in X which has no pre-image in N. You can construct similar condition for the set N strictly dominating the set X. Also, the set X is said to be **equivalent to** the set N when there exists a bijection between the two.

Now, X is finite when N strictly dominates X. X is countably infinite when X is equivalent to N. And X is uncountably infinite when X strictly dominates N.

### Probability Distribution Functions

Simply stating, PDF tells you how likely is a random variable to take on a particular value. For instance, in our example of flipping a coin, the probability distribution of X = heads is 0.5 (or there is a 0.5 probability that the coin comes out as a head when the event occurs). Formally stating,

P (X = x) = f(x) denoted by x ~ f(x)

or a PDF can be thought of as a mapping from the value of a state to its probability of occurrence.

#### Probability Mass Function (PMF)

This is the probability distribution function of a discrete random variable. Consider the experiment of throwing two dice and let X be a random variable depicting the sum of the numbers of individual dices. Then,

You can see here how the values states of X are mapped to their respective probabilities in the table defined above. You may find more information about how to go about calculating this here.

#### Probability Density Function

This is the probability distribution function of a continuous variable. Contrary to the Probability Mass function that associates the probability of X taking on a certain value x, the density function associates to X the probability that X shall land in an *infinitesimal region with measurement dx (*where measurement = length (for univariate distributions), area (for bivariate distributions), volume (for trivariate distributions) and so on*).* The probability associated can be given by *f(x).dx*

We can obviously apply integral calculus to calculate the probability that X lands in the *measurement* between any two limits (say a and b such that a ≤ b) by repeatedly adding up probabilities of infinitesimal regions given by *f(x).dx*

i.e. the probability that X takes on a value between a and b is the integral over the infinitesimal probabilities between a and b.

You may also want to read about bivariate distribution functions (joint probability distribution functions (discrete and continuous)) and marginal distribution functions.

### Expectation Value

Expectation value of a random variable can be thought of as the **mean value** the variable takes when it is drawn according to the probability distribution *f(x).* Calculations are done as follows:

Likewise, *variance* of a random variable can be seen as a measure of how much the values of a function of a random variable vary when X is drawn from a probability distribution *f(x).* Variance is the expectation value of the square of (X — mean).

A very detailed theory and practice on expectation values can be found here.

### Covariance

Covariance is a sense of how much variables are related to each other. Take for instance this covariance matrix:

In this matrix in the first row, 150 is the variance of A, -90 is the covariance of A and B, 100 is the covariance of A and C and so on.

Figure 5 shows the calculation of the covariances depicted in the table above, where f(x, y) is the joint probability distribution of random variables X and Y. The above equation can be solved to yield *cov(X, Y) = E(XY) — E(X).E(Y)*

From the table, certain deductions can be drawn. A negative covariance means that as one variable’s value increases, the other variable tends to take lower values. Vice-versa case for positive covariance (both variables tend to take either high or low values simultaneously). Independent variables have 0 covariance (as they do not depend on each other, one’s value doesn’t affect the other’s). *However, 0 covariance doesn’t necessarily imply independence of variables.*

### Predefined Probability Distributions

There are several predefined probability mass and probability density functions. I’ll explain some of them here.

#### Bernoulli Distribution

This is the distribution function over a binary, single, discrete random variable or a discrete random variable X that can take only 2 values. So, for example, say I have a coin, and, when tossed, the probability it lands heads is 𝑝. So the probability that it lands tails is 1−𝑝 (there are no other possible outcomes for the coin toss).

Formally, the Bernoulli distribution is parameterised by a single parameter denoting the probability of success (or whose value is equal to *p* if you consider the example in the last paragraph):

Now consider the following:

Here, the probability that X takes the value 1 (or head is tossed as in our example) is given by the parameter *phi* (which takes some value between 0 and 1). Likewise, the chance of the other event happening (tails being tossed)is (1 — *phi*). We can combine both these probabilities into a *generalised statement* given by:

Utilising the concept of expectation values and variance as discussed above, one can find the mean and variance of this distribution as:

#### Binomial Distribution

A binomial distribution can be thought of as the sum of n **independent** and **identically** distributed Bernoulli random variables, where each Bernoulli random variable takes on one of the two values {0, 1}.

The informal meaning of *sum of n independent and identically distributed Bernoulli variables* is that we repeat the same experiment *n* times, and the outcome of each experiment is independent of the outcome of the others. We also define a parameter *p* (which is identical to the parameter *phi* in Bernoulli distribution) that denotes the probability of the random variable taking on the value 1 in that instance of the experiment out of the n instances of the experiment. The binomial distribution thus goes like:

For instance, take 5 tosses of a fair and balanced coin. Now define a random variable X that denotes the number of heads obtained. X can therefore take any value in {0, 1, 2, 3, 4, 5}. Here n = 5 (number of times the experiment is repeated). Formally stating, if we define the Bernoulli variable X[i] as the outcome of the *i* th coin toss, we need to add X[1], X[2], … , X[5] in order to get our desired value of X. Also note X[1], X[2], … , X[5] are pairwise independent, or one coin toss doesn’t depend on the other coin tosses.

Condensing information from the previous paragraph, we can directly calculate Pr(k = 2; n = 5, p = 0.5) by placing these variables in the distribution given in figure 9. This will output the probability of finding 2 heads given we toss a balanced(means p = 0.5) coin 5 times.

#### Gaussian Distribution (Normal Distribution)

This is the most basic distribution function for continuous random variables. This is parameterised by the mean and variance (denoted by their standard symbols) of the distribution as follows:

The function is graphed as follows:

Gaussian Distributions are sensible choices in cases where we don’t know anything about the distribution of the random variable. It is therefore assumed that values will follow a normal distribution with an equal number of measurements above and below the mean value, with the mean value being the peak of the distribution. For instance, we are given the task to model a continuous random variable and we don’t know anything about its distribution. It would be sensible, in this case, to make the least number of assumptions about the distribution of the variable and choose Gaussian distribution function (which introduces the **maximum amount of uncertainty** over the distribution of X among all the distributions with **finite variance**).

Condensing the above paragraph into a statement, *it is highly probable that your continuous random variable X follows a Gaussian distribution with some noise (suggested by the Central Limit theorem). So why not make that assumption beforehand?*

In case we wish to model multivariate distributions, we can have the Gaussian distribution as:

where the mean is now a vector, and in place of variance, we use a covariance matrix symbolised by *capital sigma* (already discussed in an above section).

*Find more about Gaussian distributions in multivariate settings in Gaussian Mixture model discussed below.*

#### Exponential and Laplace Distribution

In Deep Learning, we need to regularise the parameters of a neural network to prevent overfitting. From a Bayesian perspective, fitting a regularised model can be interpreted as computing the maximum a posteriori (MAP) estimate. Without going into more details here (you can obviously read more about that here), we need to have a very sharp point at x = 0 or x = mean of the distribution of X in order to aid regularization. We hence use the following:

where,

The indicator function serves to assign a 0 probability to all negative values of X. You can view the graph:

The exponential distribution describes the time between the events in a *Poisson point* process, i.e. a process in which events occur continuously and independently at a constant average rate.

Similarly, if we wish to model a spike at the mean of the distribution of X, we can use the Laplace distribution function,

This Laplacian distribution can be viewed as two exponential distributions spliced together back-to-back such that we obtain a spike at the mean of the distribution (*note how we get a shift in the green curve in the above figure,* this can be managed by the parameter *mean* in the Laplacian distribution).

#### Dirac Delta distribution

The Dirac Delta function serves to cluster all the distribution around a single point and neutralises distributions over all other values of the continuous random variable X.

The above equation tends to collect all the mass around the mean. A more specific example is the following function:

Here, *a* is a parameter whose value serves to define the closeness of the peak to the origin (or the concentration of the mass about the origin). As *a* approaches 0, the peak becomes infinitely narrow and infinitely high.

You can parameterise the equation in Figure 19 with the mean. The function would then look like the equation in figure 18 and would result in a peak at the desired spot (other than 0). The next few images denote the growth of the Dirac Delta functions as value of the parameter changes.

The Dirac Delta functions find their importance in the next distribution:

### Empirical distribution

Empirical distribution is the multinoulli distribution of continuous random variables.

Empirical distribution serves to concentrate (1/m) mass (instead of the whole mass in Dirac Delta distribution) over m points in the sample space, by utilising the power of the Dirac Delta distribution at these points. Consider the following:

Quite evidently, this distribution centers (1/m) mass over x(1), x(2), … , x(m). Find more explanation on this in the next section…

### Distribution Mixtures

Congratulations if you have made this far! We now move towards understanding an ensemble of the above-described distributions. You can combine these distributions in order to create more complex distributions. Let’s take an example of a distribution mixture you have already seen- *empirical distribution.*

The empirical distribution works as given in Figure 21. It is a mixture of Dirac delta functions described over each of the m points we are targeting. Consider the following equation:

Let me unpack that a bit. Here, *c* stands for the different component distributions defined. In our example, we have m points with a Dirac Delta function over each point. You can thus think the empirical distribution as a mixture of *m* Dirac Delta distributions, where each distribution is parameterised by the point x(i) about which it needs to concentrate the mass.

The equation in Fig 22. means that the final distribution *P(x)* is a multinoulli distribution over the individual distribution components *c,* calculated by finding the **prior probability****(**or probability before observing x**)** of choosing the *i* th Dirac Delta distribution from the *m* Dirac Delta distributions we have, and then sampling x from that *i* th Dirac Delta distribution.

Simply stating, *you have m dirac delta distributions, you choose one of them and then concentrate (1/m) mass over that. You then choose another Dirac Delta distribution and concentrate (1/m) mass over that. Keep doing that until all Dirac Delta functions are exhausted. Finally, sum all of them to get your empirical distribution.*

Let’s discuss a final Distribution mixture, *Gaussian mixture model.* You can again think of this mixture as having say *m* separate components where each component is a *Gaussian distribution with separate parameters: mean vector and covariance matrix.* Now look at the following figure:

This is a *Gaussian mixture distribution.* You can distinctly observe three different clusters. Each of these is a Gaussian distribution. Notice how these are parameterised differently (I’ll explain in terms of the covariance matrix, if you need a refresher on this parameter, consult Fig 12).

- Bottom left distribution has an
**isotropic covariance matrix.**Thus, it has the same amount of variance in all directions - Middle distribution has a
**diagonal covariance matrix.**This implies only the diagonal elements are non-zero, or covariance of both variables is 0 meaning they are independent. - Upper right distribution has a
**full-rank covariance matrix.**

I hope this makes sense how the same distributions with different parameters can combined into one mixture model.

And finally!!!

### Applications of all this

Exponential distributions can help to regularise the parameters of a neural network to prevent overfitting. This is a highly desirable task at hand as overfitting leads to poor performance and needs to be eradicated at all costs.

The other day, I was trying to solve the Santander Customer Transaction Prediction on Kaggle, and in exploring the data, I tried to look at distributions of all the 200 variables (anonymised) to see if there are obvious relations that can be seen straight away. The distributions of variables can give vital information about the nature of the data you are trying to handle. I have attached a part of the graphs I obtained:

Probably, you can. A preliminary thought of mine suggested that the first 3 graphs in the top rows indicate different distributions of the same variables for different groups (denoted by 1 and 0). On the other hand, the fourth graph in the top row seems like a perfect overlap. So probably, the fourth variable doesn’t affect distributions into groups denoted by 1 and 0, while first, second, and third variables might do so.

The above inference is preliminary. But it might give a sense of direction for further exploration.

There are loads of other examples such as the math behind the different algorithms used that can now be understood in a better way if you have some background knowledge about these.

And lastly and more importantly, you shall be able to begin to understand the Machine/Deep learning literature in a better way as most things have their foundations in these concepts.

Also, you might want to check out these posts: link and link

I hope you learnt something new in this post. Let me know what you think about this post in the comments below.

Have a good day!

Upvote

Nimish Mishra

Related Articles