The rules of probability
At the simplest level, a model, be it machine learning or a more classical method such as linear regression, is a mathematical description of how a target variable changes in response to variation in a predictive variable; that relationship could be a linear slope or any of a number of more complex mathematical transformations. In the task of modeling, we usually think of separating the variables in our dataset into two broad classes:
- Independent data, by which we primarily mean inputs to a model, is often denoted by X. For example, if we are trying to predict the grades of school students on an end-of-year exam based on their characteristics, we could think of several kinds of features:
- Categorical: If there are six schools in a district, the school that a student attends could be represented by a six-element vector for each student. The elements are all 0, except for one that is 1, indicating which of the six schools they are enrolled in.
- Continuous: The student heights or average prior test scores can be represented as continuous real numbers.
- Ordinal: The rank of the student in their class is not meant to be an absolute quantity (like their height) but rather a measure of relative difference.
- Dependent variables, conversely, are the outputs of our models and are denoted by the letter Y. Note that, in some cases, Y is a “label” that can be used to condition a generative output, such as in a conditional GAN. It can be categorical, continuous, or ordinal, and could be an individual element or multidimensional matrix (tensor) for each element of the dataset.
How can we describe the data in our model using statistics? In other words, how can we quantitatively describe what values we are likely to see, how frequently, and which values are more likely to appear together and others? One way is by asking how likely it is to observe a particular value in the data or the probability of that value. For example, if we were to ask what the probability of observing a roll of four on a six-sided die is, the answer is that, on average, we would observe a four once every six rolls. We write this as follows:
P(X=4) = 1⁄6 = 16.67%
Here, P denotes “probability of.” What defines the allowed probability values for a particular dataset? If we imagine the set of all possible values of a dataset—such as all values of a die—then a probability maps each value to a number between 0 and 1. The minimum is 0 because we cannot have a negative chance of seeing a result; the most unlikely result is that we would never see a particular value, or 0% probability, such as rolling a seven on a six-sided die. Similarly, we cannot have a greater than 100% probability of observing a result, represented by the value 1; an outcome with probability 1 is absolutely certain. This set of probability values associated with a dataset belongs to discrete classes (such as the faces of a die) or an infinite set of potential values (such as variations in height or weight). In either case, however, these values have to follow certain rules, the probability axioms described by the mathematician Andrey Kolmogorov in 193314:
- The probability of an observation (a die roll, a particular height) is a non-negative, finite number between 0 and 1.
- The probability of at least one of the observations in the space of all possible observations occurring is 1.
- The probability of distinct, mutually exclusive events (such as the rolls 1-6 on a die) is the sum of the probability of the individual events.
While these rules might seem abstract, we will see in Chapter 3 that they have direct relevance to developing neural network models. For example, an application of rule 1 is to generate the probability between 1 and 0 for a particular outcome in a softmax function for predicting target classes. For example, if our model is asked to classify whether an image contains a cat, dog, or horse, each potential class receives a probability between 0 and 1 as the output of a sigmoid function based on a deep neural network applying nonlinear, multi-layer transformations on the input pixels of an image we are trying to classify. Rule 3 is used to normalize these outcomes into the range 0–1, under the guarantee that they are mutually distinct predictions of a deep neural network (in other words, a real-world image logically cannot be classified as both a dog and cat, but rather a dog or cat, with the probability of these two outcomes additive). Finally, the second rule provides the theoretical guarantees that we can generate data at all using these models.
However, in the context of machine learning and modeling, we are not usually interested in just the probability of observing a piece of input data, X; we instead want to know the conditional probability of an outcome Y given the data X. Said another way, we want to know how likely a label for a set of data is, based on that data. We write this as the probability of Y given X, or the probability of Y conditional on X:
P(Y|X)
Another question we could ask about Y and X is how likely they are to occur together—their joint probability—which can be expressed using the preceding conditional probability expression as:
P(X, Y) = P(Y|X)P(X) = P(X|Y)P(Y)
This formula expressed the probability of X and Y. In the case of X and Y being completely independent of one another, this is simply their product:
P(X|Y)P(Y) = P(Y|X)P(X) = P(X)P(Y)
You will see that these expressions become important in our discussion of complementary priors in Chapter 4, and the ability of restricted Boltzmann machines to simulate independent data samples. They are also important as building blocks of Bayes’ theorem, which we describe next.