MixedPrecision Training is a technique that allows you to increase the batch size by representing the parameters of your model in 16-bit instead of 32-bit, and thus speed up learning. We'll also cover how to adaptively solve problems that may arise during this process, as well as what Automatic Mixed Precision (AMP) is.
Ifyou'd like to read the paper first, here's a link to the paper on MixedPrecision Training.
Background on Mixed Precision
Is single precision (Floating Point 32, FP32) really a red line in deep learning training? If we can utilize Half Precision (Floating Point 16, FP16), we can reduce the memory required to train the model and accelerate the computation.Of course, we need to make sure that the performance of the trained model doesn't drop too much compared to FP32.
At its simplest, it's just a matter of reducing the parameters of the model from32-bit to 16-bit, but this technology is truly groundbreaking when you consider the following three aspects.
- the same model can learn faster and get results faster
- faster learning, resulting in less GPU usage, leading to lower carbon dioxide emissions
- can be applied to any model, regardless of model structure
I'm particularly impressed with points 2 and 3 because they can be applied universally, not limited to specific models or computations, which can greatly reduce the environmental impact of carbon dioxide generation. Moreover, considering that there are many super-sized AI models represented by GPT, especially ChatGPT, nowadays, and the resources used to train them are astronomical, the importance of Mixed Precision cannot be overstated.
The problem with Half Precision
If we can reduce the precision of the number representation by reducing the parameter from 32-bit to 16-bit, and thus reduce the amount of computation and speed up learning, why not just go with 16-bit in the first place?
The answer can be found in the resources below. The figure below shows the distribution of gradients for the parameters of the famous Single Shot Multibox Detector (SSD) model in Object Detection, as experimented on the NVIDIA Technical Blog.
Based on the red line in the figure, the left side is the range of numbers that cannot be represented by FP16 and the right side is the range of numbers that can be represented by FP16. Since more than half of the gradients cannot be represented with FP16, it can be inferred that a lot of information would be lost if training with FP16.
If you're thinking, "What difference does it make if the number of bits is halved?", take a look at the resources below.
The difference between 32-bit and 16-bit is dramatic, as shown in the image above.
If 16-bit is not enough, it is unreasonable to think that you can learn with 8-bitand 4-bit...
So we want to utilize FP16 to take advantage of memory and speed, but due to the above problems, the performance degradation of Fully Designed to FP16 model seems obvious. To prevent this, Mixed Precision is proposed.
Asa side note, since C does not support FP16, it has been reported that there is no 16-bit / 32-bit speed difference when utilizing Numpy, which is based on CPython. In fact, it was reported that 16-bit was slower, but this seems to have been resolved.
Work around for Mixed Precision (MP)
As the name suggests, it's neither FP32 nor FP16 for all parameters. It's a mixture of both types of training, as illustrated in the paper.
Mixed Precision Training Iteration (Version 1)
- copy the FP32 Master weights expressed in FP32 to create FP16 weights
- proceed with Forward Propagation with FP16 (gradient should be FP16)
- multiply the Loss value by the Scale factor S
- back propagate the obtained FP16 gradient
- Since we multiplied the loss by S in step 3, divide the weight gradient obtained by back propagating by S.
- Apply Gradient clipping, weight decay, etc.
- Update the FP32 Master weights
However, if the scaling factor S is too large or too small, the multiplication and division process may cause Inf or NaN. There are so many models and experimental environments in the world, and it doesn't seem like a good way to learn by always setting S to the same value.
But what we can think of is to design it so that the maximum value of the gradient is not more than 65,504 (the maximum value in FP16), so it would be good to set the initial S large, but if there is a problem with learning after an iteration, we can choose to increase or decrease S. We can call this method adaptive. And if we describe this as an algorithm,
MixedPrecision Training Iteration (Version 2)
Store the FP32 master weights expressed in FP32.
Set the initial value of S to a large number
Foreach iteration, do the following
- copy the FP32 weights to create FP16weights.
- perform forward propagation
- multiply the Loss value by the scale factor S
- perform backward propagation
- if any of the gradients of the model's total weights are inf or NaN:
->Reduce the value of S.
->Go to the next iteration without updating the weight (skip f, g, h)
Divide the weight gradient obtained by back propagating by S
Performa weight update, including gradient clipping and weight decay.
Increase the value of S if no inf or NaN has occurred in the last N iterations
I've translated it verbatim from English, but it's easy to understand. The only addition is that we increase or decrease S if an inf or NaN has occurred.
Why increment S?
- Multiplying by S means that we move all the bars in the gradient histogram illustration above to the right. This is because we said earlier that the range to the right of the red line is the range that can be represented in 16-bit, so the point of this method is to move as many parameters to the right of the redline as possible (to be precise, the goal is to distribute as much gradient as possible between the black line on the far right and the red line in the center).
Results with Mixed Precision Training
In our experiments, we can see that training with Mixed Precision does not degrade performance and increases throughput by 1-5 times.
There ported numbers are in the table, and I'll add my two cents.
- Some people say that there are cases where Mixed Precision actually increases performance, but I think the reported difference is within the margin of error.Even for the same model and the same dataset, there are several factors that can change the performance (data order, model initialization), so I think this level of difference is negligible.
- I noticed a performance improvement in MXNet Framework, but I need to think more about why.
- The reason why the number of bits was halved but the throughput was not doubled is likely due to the overhead of casting parameters during the training process and the framework itself.
AllowList, DenyList, and InterList
If you look at NVIDIA's documentation, you'll notice that there is a distinction between AllowList, DenyList, and InterList in Mixed Precision.
Why is this distinction made? Let's take a look at the original documentation.
How is AllowList/DenyList/InferList determined? What are the corresponding ops that are in each list?
- We determine these based on our experience with numeric stability from our research. AllowList operations are operations that take advantage of our GPU Tensor Cores. DenyList operations are operations that may overflow the range of FP16, or require the higher precision of FP32. InferList operations are operations that are safely done in either FP32 or FP16. Typical ops included in each list are:
- AllowList: Convolutions, Fully-connected layers
- DenyList: Large reductions, Cross entropy loss, L1 Loss, Exponential
- InferList: Element-wise operations (add, multiply by a constant)
Of note is DenyList, which can be broken down into two categories
- Linear operation: Large reductions, L1 Loss
- Non-linear operation: Cross entropy loss, Exponential
Non-linear can easily be expected to go beyond the scope of FP16's representation because it involves exponential operations such as log or e. Therefore, it is clear that casting will cause performance degradation when applying Mixed Precision.
However, Linear seems a bit unlikely, since it is a simple quadratic operation. Still, if you think about it, L1 Loss has mean or sum operations as options, and you can probably expect performance degradation assuming a large mini-batch (I understood that it is because it can go out of range in the process of adding large numbers).
What is AutomaticMixed Precision (AMP)?
What is Automatic Mixed Precision (AMP) and how can it help with training my model?
- Automatic Mixed Precision (AMP) makes all the required adjustments to train models using mixed precision, providing two benefits over manual operations:
-> Developers need not modify network model code, reducing development and maintenance effort.
-> Using AMP maintains forward and backward compatibility with all the APIs for defining and running models.
To summarize, it means providing Mixed Precision at the framework level to make it easier to apply.
Today, I introducedMixed Precision, which I think is very important for any company that providesAI services, including DataHunt, because training a deep learning model is a very expensive and time-consuming endeavor. If you can halve the time it takes to train a model, you can halve the material costs, and you can dramatically improve the speed of service development.
That's why Datahunt is also using Mixed Precision for model research and development, and we will continue to follow up on the latest research in this area.
In the next installment, we'll take a hands-on look at how to use AMP in PyTorch with the automatic Mixed precision package.