Batch Normalization is used in deep neural networks to speed up training and make it more stable. It is an additional layer in the network that normalizes the input vector. This means that the input values are scaled so that the mean is 0 and the standard deviation is 1.
What problems arise when training deep neural networks?
When training a deep neural network, the so-called backpropagation takes place after each run. In this process, the prediction error runs through the network layer by layer from behind. The weights of the individual neurons are then changed so that the error is reduced as quickly as possible. The weights are changed under the assumption that all other layers remain the same. In practice, however, these conditions only apply to a limited extent, since during backpropagation all layers are changed relatively quickly one after the other. In the paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” this core problem is described in more detail.
The problem with this fact is that with each change of the weights also the statistical key figures of the value distribution change. This means that after each run, the weights in a layer have a new distribution with a different mean and new standard deviation. This causes the training to be slower since one has to use lower learning rates to get good results. This is because the model has to adjust to the new statistical properties in each run.
What is Normalization?
Normalization of data is a process that is often used in the preparation of data sets for Machine Learning. The idea is to bring the numerical values of different attributes to a common scale. Normalization then allows numeric features to be combined and merged into a new attribute.
Let’s assume that you want to train a model that is supposed to learn different marketing activities and their effect on the turnover and the sold quantity. To do this, one could simply calculate the sum of the quantity sold and the sales volume as the dependent variable. However, this can quickly lead to distorted results, for example, if one has a product series in which many products are sold, but these have a relatively low unit price. In a second series, it can be the other way around, i.e. the products are not sold as often, but have a high unit price.
A marketing campaign that then leads to 100,000 products sold, for example, is to be evaluated as worse in the product series with low unit prices than in the product series with high unit prices. Similar problems arise in other fields, for example, if one looks at the private expenditures of individual persons. For two different individuals, food expenditures of €200 can be very different when set in relation to monthly income. This also represents a normalization of the data.
What is Batch Normalization?
When training neural networks, the complete data set is divided into so-called batches. These contain a random selection of data of a certain size and are used for a training run. In most cases, a so-called batch size of 32 or 64 is used, i.e. there are 32 or 64 individual data sets in a batch.
The input data arriving at the input layer of the network is already normalized in normal data preprocessing. This means that all numerical values have been brought to a uniform scale and a common distribution and are thus comparable. In most cases, the data then have a mean value of 0 and a standard deviation of 1.
After the first layer, however, the values have already passed through the so-called activation function, which leads to a shift in the distribution and thus denormalizes the values again. With each layer in the network, an activation function is run through again and in the output layer, the numbers are no longer normalized. This process is also called “internal covariate shift” (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift).
For this reason, during batch normalization, the values are normalized again before each activation function. For this purpose, the new mean value and the new standard deviation are calculated for each batch after the shift and the batch is normalized accordingly.
What are the advantages of normalizing the batch?
There are many advantages to using Batch Normalization when training a neural network. The most commonly mentioned are the following:
- Faster training: Through normalization, the network converges faster and the training can be completed faster.
- Bypassing the Internal Covariate Shift: As described earlier, normalizing the batch can at least mitigate the Internal Covariate Shift. This occurs, for example, in image classifications where a distinction is to be made between different classes, such as dogs and cats. The different coat colors can lead to different distributions of the image data, which can be smoothed out by Batch Normalization.
- Faster convergence: Normalization allows the model to generalize faster and, more importantly, more independently of the batch in question. As a result, the loss function runs smoother and without many outliers. Certain model types, such as transformers, can also converge only with difficulty or even not at all without normalization.
- Preventing overfitting: Without normalization, outliers in training data can cause the model to overfit, giving good results for the training data but generalizing poorly.
- Different network types possible: Batch normalization can be used with different types of neural networks and often leads to good results. For example, one can add a normalization layer to Recurrent, Convolutional, or Feedforward Neural Networks.
How to implement batch normalization in Python?
When using TensorFlow, you can already use predefined layers for batch normalization. These also still offer a variety of parameters, so that the layer can be strongly individualized without much effort. It is best to always insert this layer between two layers that have an activation function.
import tensorflow as tf tf.keras.layers.BatchNormalization( axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None, **kwargs )
How should the model be constructed?
When building a model with a batch normalization layer, there are several things to consider. Among other things, one should increase the learning rate with the installation of the normalization layer. The normalization makes the model more stable, which is why it can also change the weights faster and still converge.
At the same time, one should refrain from using a dropout layer. On the one hand, normalization already provides an additional degree of generalization, which is why the dropout layer may not be necessary at all. On the other hand, it can even worsen the result, since noise is generated by the normalization and the simultaneous omission of nodes.
Finally, it may be useful to vary the position of the batch normalization and test both the normalization before and after the activation function. Depending on the structure of the model, this can lead to better results.
What are the limitations of batch normalization?
Although batch normalization (BN) is a widely used technique in deep learning, there are some limitations and challenges associated with it. Here are some of the main limitations of batch normalization:
- Limited performance on small batch sizes: Batch normalization works by computing statistics (mean and variance) over mini-batches of training data. However, when the batch size is small, the estimated statistics may not accurately represent the true distribution of the data, which can limit the effectiveness of BN.
- Computational overhead: Batch normalization involves extra computations and memory usage, which can increase the training time and memory requirements of deep learning models.
- Difficulty interpreting model weights: The effect of batch normalization on the weights of a model can be difficult to interpret since the weights are no longer directly related to the output of the network.
- Dependence on batch ordering: Batch normalization is sensitive to the ordering of batches during training, which can lead to different results if the same model is trained with different batch orders.
- Not always suitable for certain types of models: There are some types of models, such as recurrent neural networks, where batch normalization may not be as effective or may require special adaptations.
- Potential for over-regularization: Batch normalization can sometimes lead to over-regularization of deep learning models, which can limit their ability to generalize to new data.
Despite these limitations, batch normalization remains a powerful tool for improving the performance of deep learning models in many cases. However, it is important to understand its limitations and use it appropriately in different types of models and training scenarios.
This is what you should take with you
- Batch normalization is an additional layer in a neural network that ensures that the numerical input values are normalized.
- It can ensure that the model trains significantly faster and more stable since outliers are smoothed out as far as possible.
- TensorFlow already offers a predefined layer that can be added to an existing model. In addition, there are many parameters with which the layer can be individualized.
Other Articles on the Topic of Batch Normalization
- The Keras documentation on the Batch Normalization Layer can be found here.
- Machine Learning Mystery’s article on Batch Normalization is also well worth reading and was used as a source for this post.