In this blog, we look at the topic of uncertainty quantification for machine learning and deep learning. By no means is this a new subject, but the introduction of tools such as Tensorflow Probability and Pyro have made it easy to perform probabilistic modeling to streamline uncertainty calculations. Consider the scenario in which we predict the value of an asset like a house, based on a number of features, to drive purchasing decisions. Wouldn’t it be beneficial to know how certain we are of these predicted prices? Tensorflow Probability allows you to use the familiar Tensorflow syntax and methodology but adds the ability to work with distributions. In this introductory post, we leave the priors and the Bayesian treatment behind and opt for a simpler probabilistic treatment to illustrate the basic principles. We use the likelihood principle to illustrate how an uncertainty measure can be obtained along with predicted values by applying them to a deep learning regression problem.
Uncertainty can be divided into two types:
- Epistemic uncertainty
- Aleatoric uncertainty
Epistemic uncertainty is a result of the model not having information, but this information can be obtained from providing new data to the model or increasing the representation capacity of the model by increasing its complexity. This type of uncertainty can potentially be addressed and reduced. Aleatoric uncertainty, on the other hand, stems from the inherent stochasticity in the data-generating process. In stochastic processes, there are a number of parameters and only a subset of these parameters are observable. So, theoretically, if there was a way to measure all these parameters, we would be able to reproduce an event exactly. However, in most real-life scenarios, this is not the case. In this work, we are trying to quantify the epistemic uncertainty, which stems from the lack of knowledge in our network or model parameters.
The goal here is to quantify the uncertainty of predictions. In other words, along with getting the predicted values, a measure of certainty or confidence would also be computed for each predicted value. We are going to illustrate this uncertainty analysis using a regression problem. Here we model the relationship between the independent variables and the dependent variable using a neural network. Instead of the neural network outputting a single predicted value y_pred, the network will now predict the parameters of a distribution. This probability distribution is chosen based on the type of the target or dependent variable. For classification, the MaxLike [https://www.nbi.dk/~petersen/Teaching/Stat2015/Week3/AS2015_1201_Likelihood.pdf] principle tells us that the network weights are updated to maximize the likelihood or probability of seeing the true data class given the model (network + weights). A Normal distribution is a baseline; however, it may not be appropriate for all scenarios. For example, if the target variable represents count data, we would choose a Poisson distribution. For a Normal distribution, the neural network would output two values, the parameters of the distribution y_mean and y_std, for every input data point. We are assuming a parametric distribution in the output or target variable, which may or may not be valid. For more complex modeling, you may want to consider a mixture of Gaussians or a Mixture Density network instead.
Normally, the error of the predicted values is computed using a number of loss functions such as the MSE, cross-entropy, etc. Since we have probabilistic outputs, MSE is not an appropriate way to measure the error. Instead, we choose the likelihood function, or rather the Negative Log-likelihood (NLL) as a baseline loss function. In fact, apart from the differences in interpretation of one being deterministic and the other being probabilistic in nature, it can be shown that cross-entropy and NLL are equivalent [REFERENCE]. To illustrate this, two Normal distributions are plotted in Fig. 1 below with the dotted lines indicating the likelihood as the probability density at two different data points. The narrower distribution is shown in red, while the wider distribution is plotted in blue. The likelihood of the data point given by x=68 is higher for the narrower distribution, while the likelihood for the point given by x=85 is higher for the wider distribution.
Using the MaxLike principle [REFERENCE] and under assumptions of independence of data points, the objective here is to maximize the likelihood of each data point. As a result of the independence assumption, the total likelihood is therefore the product of the individual likelihoods. For numerical stability, we use the log-likelihood as opposed to the likelihood. The NLLs are summed up for each point to obtain the total loss for each iteration.
We want to capture the non-linear relationships that may exist between the independent and dependent variables, therefore we use multiple hidden layers with activation functions for both parameters y_mean and y_std. This allows non-monotonic variations for both parameters. One could simplify this in two ways:
- Fixed variance: only a single parameter y_mean is estimated
- Linear variance: y_std is also estimated but now this is a function of a single hidden layer and no activation function
The examples below show non-linear variance of standard deviation. The first example illustrates how to fit a linear model (linear variation for y_mean) and will be followed by non-linear variation of y_mean to capture more complex phenomena.
What is Tensorflow Probability (TFP)?
Tensorflow Probability is a framework that builds upon TensorFlow but can work with and perform operations on distributions of data. You can define distributions and sample from it, as shown in the following section.
What distributions are available?
Common distributions such as the Bernoulli, Binomial, Normal, Gamma, etc. are available. More information about these distributions can be found here [https://www.tensorflow.org/probability/api_docs/python/tfp/distributions]
Predictions with uncertainty using Tensorflow Probability on synthetic data
In order to illustrate how TFP can be used to quantify the uncertainty of a prediction, we start off with a synthetic one-dimensional dataset. Synthetic data allows us to perform a controlled experiment and the single dimension makes it easy to visualize the uncertainty associated with each data point and prediction.
Synthetic data generation
The goal here is to generate some synthetic data with non-constant variance. This property of the data is referred to as heteroscedasticity. This data is generated in segments and then concatenated together, as shown below.
Fit a linear model with non-constant standard deviation
Some noise is added to the above data, and we generate the target variable ‘y’ from the independent variable ‘x’ and the noise. The relationship between them is:
This data is then split into a training set and a validation set to assess performance. The relationship between the dependent and independent variables can be visualized in Fig. 2 for both the training set and the validation set.
np.random.seed(4710) noise=np.random.normal(0,x,len(x)) np.random.seed(99) first_part=len(x1) x11=np.random.uniform(-1,1,first_part) np.random.seed(97) x12=np.random.uniform(1,6,len(noise)-first_part) x=np.concatenate([x11,x12]) x=np.sort(x) y=2.7*x+noise
Define the model
The model that we build is a fairly simple one with three dense layers applied to the data and two outputs, corresponding to the mean y_mean and the standard deviation y_std. These parameters are concatenated and passed to the distribution function ‘my_dist’.
In the function ‘my_dist,’ the Normal distribution is parameterized by the mean and scale (standard deviation). The mean is the first index in the two-dimensional variable ‘params,’ and standard deviation is defined through a softplus operation because we are computing the log of the standard deviation or log y_std. This is because the standard deviation is always a positive value and the output of the neural network layer can be positive or negative. Therefore the transformation helps to constrain the output to just positive values.
The function ‘NLL’ computes the Negative Log-likelihood (NLL) of the input data given the network parameters, as the name indicates and returns them. This will be the loss function.
Three models are generated:
- Model - outputs y_mean and y_std for the output distribution
- Model_mean - outputs the mean of the distribution returned from ‘my_dist’
- Model_std - outputs the standard deviation of the distribution returned from ‘my_dist’
def NLL(y, distr): return -distr.log_prob(y) def my_dist(params): return tfd.Normal(loc=params[:,0:1], scale=1e-3 + tf.math.softplus(0.05 * params[:,1:2]))# both parameters are learnable inputs = Input(shape=(1,)) hiddena = Dense(30)(inputs) hidden1 = Dense(20,activation="relu")(hiddena) hidden2 = Dense(20,activation="relu")(hidden1) out1 = Dense(1)(hiddena) #A out2 = Dense(1)(hidden2) #B params = Concatenate()([out1,out2]) #C dist = tfp.layers.DistributionLambda(my_dist)(params) model_flex_sd = Model(inputs=inputs, outputs=dist) model_flex_sd.compile(Adam(learning_rate=0.01), loss=NLL)
Evaluating the results
Once the model is trained and the convergence plot is inspected, we can also observe the sum of NLLs for the test data. We will look at this in more detail in the next section. This can be used to tune the model and evaluate the fit, but care should be taken to not perform comparisons across different datasets. The sum of NLLs can be computed as shown below.
model_flex_sd.evaluate(x_test,y_test, verbose=0) 4.0097329248257765
The model fitted on the training and validation data is shown below. A linear model was fit as a result of the training, and the black line obtained from y_mean captures this trend. The variance is indicated by the dotted red lines, which aligns with the variance that was incorporated into the generated data. Finally, this is evaluated on the test data set.
Fit a non-linear model with non-constant standard deviation
Here, the relationship between the dependent and independent variables vary in a non-linear manner due to the squared term and is shown below.
In order to obtain this nonlinear behavior, we add an activation function (non-linear) to the output of y_mean. Similar to what was done before, we fit the model and plot the predicted mean and standard deviation at each data point for the training, validation and test data points as shown below.
inputs = Input(shape=(1,)) hiddena = Dense(30, activation="relu")(inputs) hidden1 = Dense(20,activation="relu")(hiddena) hidden2 = Dense(20,activation="relu")(hidden1) out1 = Dense(1)(hiddena) #A out2 = Dense(1)(hidden2) #B params = Concatenate()([out1,out2]) #C dist = tfp.layers.DistributionLambda(my_dist)(params)
Unlike the data previously generated, real-life tends to not have desirable properties such as unit standard deviation; therefore, preprocessing the data is often a good idea. This is particularly important for techniques where assumptions of Normality in the data distribution are made for the techniques to be valid. The dataset used here is the Diabetes dataset [REFERENCE]. This is a regression problem with numerical features and targets.
There are two transformations applied to the data here.
- Power transformation or Quantile transformation
The data is standardized and one of two transforms is applied to the data. The Power transformation can include either the Box-Cox transform [G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).], which assumes that all values are positive, or the Yeo-Johnson transform [[I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).], which makes no such assumption about the nature of the data. In the Quantile transformer. Both of these transform the data to more Gaussian-like distribution. The quantile information of each feature is used to map it to the desired distribution, which here is the Normal distribution.
def preprocess_pipeline_power(df, target_column): scaler = StandardScaler() power_transform = PowerTransformer(method='yeo-johnson') pipeline_power = Pipeline([('s', scaler), ('p', power_transform)]) res_power = pipeline_power.fit_transform(df) x_train, x_test, y_train, y_test = train_test_split(res_power[:,0:-1], res_power[:,-1], test_size = 0.2, random_state=123) return(x_train, x_test, y_train, y_test)
def preprocess_pipeline_quantile(df, target_column): scaler = StandardScaler() quantile_transform = QuantileTransformer(n_quantiles=100, output_distribution="normal") pipeline_quantile = Pipeline([('s', scaler), ('q', quantile_transform)]) res_quantile = pipeline_quantile.fit_transform(df) x_train, x_test, y_train, y_test = train_test_split(res_quantile[:,0:-1], res_quantile[:,-1], test_size = 0.2, random_state=123) return(x_train, x_test, y_train, y_test)
Model fit and evaluate
def NLL(y, distr): return -distr.log_prob(y) def my_dist(params): return tfd.Normal(loc=params[:,0:1], scale=1e-3 + tf.math.softplus(0.05 * params[:,1:2]))# both parameters are learnable def get_model(X): if(isinstance(X, pd.DataFrame)): Xlen = len(X.columns) else: Xlen = np.shape(X) input1 = Input(shape=(Xlen)) # 13 for boston housing and 8 for california housing data hidden1 = Dense(32, activation='relu', name='dense_1')(input1) # 32 or 8 hidden2 = Dense(8, activation='relu', name='dense_2')(input1) out1 = Dense(1, activation='relu', name='out_1')(hidden2) # out1 is mean out2 = Dense(1, activation='relu', name='out_2')(hidden1) # out2 is std params = Concatenate()([out1,out2]) #C dist = tfp.layers.DistributionLambda(my_dist)(params) model = Model(inputs=input1, outputs=dist) model.compile(Adam(learning_rate=0.001), loss=NLL) model_mean = Model(inputs=input1, outputs=dist.mean()) model_std = Model(inputs=input1, outputs=dist.stddev()) model.summary() return(model, model_mean, model_std) def fit_model(model, X_data_train, y_data_train, batch_size=128, epochs=1000, validation_split=0.1): history = model.fit(X_data_train, y_data_train, batch_size=batch_size, epochs=epochs, validation_split=validation_split) return(model) def evaluate_model(model, model_mean, model_std, X_data_test, y_data_test): y_out_mean = model_mean.predict(X_data_test) y_out_std = model_std.predict(X_data_test) y_out_mean_vals = y_out_mean.squeeze(axis=1) if(isinstance(y_data_test, pd.DataFrame)): y_test_true_vals = y_data_test.values.squeeze(axis=1) else: y_test_true_vals = y_data_test y_out_std_vals = y_out_std.squeeze(axis=1) neg_log_prob_array =  for elem in zip(y_out_mean_vals, y_test_true_vals, y_out_std_vals): predicted = elem predicted_var = elem true_val = elem neg_log_prob = -1.0 * tfd.Normal(predicted, predicted_var).log_prob(true_val).numpy() neg_log_prob_array.append(neg_log_prob) return(neg_log_prob_array)
Evaluating the results
As mentioned before, apart from the convergence plots, you can evaluate model uncertainty based on the performance on the test set using the sum of NLL. This metric gives us a way to compare different models. We can also look at the distribution of the NLLs that are obtained on the test data set to understand how the model has generalized to new data points. Outliers could have contributed to a large NLL, which would be obvious from inspecting the distribution.
model, model_mean, model_std = get_model(X_trans) model = fit_model(model, X_data_train, y_data_train, epochs=1000) neg_log_array = evaluate_model(model, model_mean, model_std, X_data_test, y_data_test)
Here, the NLL for each point is accumulated in the array ‘neg_log_array’ and the histogram is plotted. We compare two scenarios: one where the quantile transformation is applied to the target and the power-transformed version is applied to the other. We want most of the density in the histogram to be close to 0, indicating that most of the data points had a low NLL, i.e. the model fit those points well. Fig. 7 illustrates this for the two transformation techniques, the quantile transformation seems to have marginally better performance if your goal is to minimize the outliers in the uncertainty of model predictions. This can also be used to perform hyperparameter tuning of the model and select an optimal model.
This post shows how uncertainty quantification can be beneficial in predictive modeling. Additionally, we walked through using Tensorflow Probability to quantify uncertainty from a probabilistic perspective on a deep learning problem. This approach avoids a full Bayesian treatment and tends to be more approachable introduction to uncertainty estimation.
Try out the examples shown here on Databricks on our ML runtimes!