MixedPrecision Training is a technique that allows you to increase the batch size by representing the parameters of your model in 16-bit instead of 32-bit, and thus speed up learning. We'll also cover how to adaptively solve problems that may arise during this process, as well as what Automatic Mixed Precision (AMP) is.
Ifyou'd like to read the paper first, here's a link to the paper on MixedPrecision Training.
Is single precision (Floating Point 32, FP32) really a red line in deep learning training? If we can utilize Half Precision (Floating Point 16, FP16), we can reduce the memory required to train the model and accelerate the computation.Of course, we need to make sure that the performance of the trained model doesn't drop too much compared to FP32.
At its simplest, it's just a matter of reducing the parameters of the model from32-bit to 16-bit, but this technology is truly groundbreaking when you consider the following three aspects.
I'm particularly impressed with points 2 and 3 because they can be applied universally, not limited to specific models or computations, which can greatly reduce the environmental impact of carbon dioxide generation. Moreover, considering that there are many super-sized AI models represented by GPT, especially ChatGPT, nowadays, and the resources used to train them are astronomical, the importance of Mixed Precision cannot be overstated.
If we can reduce the precision of the number representation by reducing the parameter from 32-bit to 16-bit, and thus reduce the amount of computation and speed up learning, why not just go with 16-bit in the first place?
The answer can be found in the resources below. The figure below shows the distribution of gradients for the parameters of the famous Single Shot Multibox Detector (SSD) model in Object Detection, as experimented on the NVIDIA Technical Blog.
Based on the red line in the figure, the left side is the range of numbers that cannot be represented by FP16 and the right side is the range of numbers that can be represented by FP16. Since more than half of the gradients cannot be represented with FP16, it can be inferred that a lot of information would be lost if training with FP16.
If you're thinking, "What difference does it make if the number of bits is halved?", take a look at the resources below.
The difference between 32-bit and 16-bit is dramatic, as shown in the image above.
If 16-bit is not enough, it is unreasonable to think that you can learn with 8-bitand 4-bit...
So we want to utilize FP16 to take advantage of memory and speed, but due to the above problems, the performance degradation of Fully Designed to FP16 model seems obvious. To prevent this, Mixed Precision is proposed.
Asa side note, since C does not support FP16, it has been reported that there is no 16-bit / 32-bit speed difference when utilizing Numpy, which is based on CPython. In fact, it was reported that 16-bit was slower, but this seems to have been resolved.
As the name suggests, it's neither FP32 nor FP16 for all parameters. It's a mixture of both types of training, as illustrated in the paper.
However, if the scaling factor S is too large or too small, the multiplication and division process may cause Inf or NaN. There are so many models and experimental environments in the world, and it doesn't seem like a good way to learn by always setting S to the same value.
But what we can think of is to design it so that the maximum value of the gradient is not more than 65,504 (the maximum value in FP16), so it would be good to set the initial S large, but if there is a problem with learning after an iteration, we can choose to increase or decrease S. We can call this method adaptive. And if we describe this as an algorithm,
Store the FP32 master weights expressed in FP32.
Set the initial value of S to a large number
Foreach iteration, do the following
Divide the weight gradient obtained by back propagating by S
Performa weight update, including gradient clipping and weight decay.
Increase the value of S if no inf or NaN has occurred in the last N iterations
I've translated it verbatim from English, but it's easy to understand. The only addition is that we increase or decrease S if an inf or NaN has occurred.
Why increment S?
In our experiments, we can see that training with Mixed Precision does not degrade performance and increases throughput by 1-5 times.
There ported numbers are in the table, and I'll add my two cents.
If you look at NVIDIA's documentation, you'll notice that there is a distinction between AllowList, DenyList, and InterList in Mixed Precision.
Why is this distinction made? Let's take a look at the original documentation.
How is AllowList/DenyList/InferList determined? What are the corresponding ops that are in each list?
Of note is DenyList, which can be broken down into two categories
Non-linear can easily be expected to go beyond the scope of FP16's representation because it involves exponential operations such as log or e. Therefore, it is clear that casting will cause performance degradation when applying Mixed Precision.
However, Linear seems a bit unlikely, since it is a simple quadratic operation. Still, if you think about it, L1 Loss has mean or sum operations as options, and you can probably expect performance degradation assuming a large mini-batch (I understood that it is because it can go out of range in the process of adding large numbers).
What is Automatic Mixed Precision (AMP) and how can it help with training my model?
To summarize, it means providing Mixed Precision at the framework level to make it easier to apply.
Today, I introducedMixed Precision, which I think is very important for any company that providesAI services, including DataHunt, because training a deep learning model is a very expensive and time-consuming endeavor. If you can halve the time it takes to train a model, you can halve the material costs, and you can dramatically improve the speed of service development.
That's why Datahunt is also using Mixed Precision for model research and development, and we will continue to follow up on the latest research in this area.
In the next installment, we'll take a hands-on look at how to use AMP in PyTorch with the automatic Mixed precision package.