The first 100 billion model compression algorithm SparseGPT is here, reducing the cost of computing power while maintaining high precision

Author | Li Mei

Editor | Chen Caixian

Since the emergence of GPT-3 in 2020, the explosion of ChatGPT has once again brought the generative large-scale language models of the GPT family into the spotlight, and they have shown strong performance in various tasks.

However, the huge scale of the model also brings about the increase of computing cost and the difficulty of deployment.

For example, the GPT‑175B model requires a total of at least 320GB of storage in half-precision (FP16) format, requiring at least five A100 GPUs with 80 GB of storage for inference.

Model compression (model compression) is currently a widely used method to reduce the computational cost of large models, but so far, almost all existing GPT compression methods focus on quantization (quantization), that is, to reduce the numerical representation of a single weight. precision.

Another approach to model compression is pruning , i.e. removing network elements ranging from individual weights (unstructured pruning) to more granular components such as entire rows/columns of weight matrices (structured pruning). This approach works well for vision and smaller-scale language models, but leads to a loss of accuracy that requires extensive retraining of the model to restore accuracy, so when it comes to large-scale models such as GPT, the cost becomes Too expensive. While there are also some single-shot pruning methods that can compress models without retraining, they are too computationally expensive to be applied to models with billions of parameters.

So for a large model of the size of GPT-3, is there a way to accurately prune it while maintaining a minimum loss of accuracy and reducing computational costs?

Recently, Elias Frantar and Dan Alistarh, two researchers from the Austrian Institute of Science and Technology (ISTA), collaborated on a study that for the first time proposed an accurate single-shot pruning method SparseGPT for a model size of 10 to 100 billion parameters.

Paper address: https://ift.tt/S06hIzx

SparseGPT can prune the GPT family of models to 50% sparsity in a single pass without any retraining. The largest publicly available model, GPT-175B, achieves this pruning in a few hours using only a single GPU.

Moreover, SparseGPT is very accurate and can minimize the loss of accuracy. For example, when SparseGPT is executed on OPT-175B and BLOOM-176B, the largest open source models at present, it can achieve 60% sparsity while minimizing the loss of accuracy.

1
SparseGPT algorithm

The research on very large models has been very active in recent years, but so far, there is no model with more than 10 billion parameters that can achieve very accurate and highly sparse.

Existing methods have too high requirements on computational cost. Taking OBC, the most accurate post-training method at present, as an example, it takes more than 1 hour to compress for a billion-parameter model. The fastest known post-training method, AdaPrune, also takes minutes to prune a billion-parameter model, and at this rate, a GPT-3-scale model is estimated to require hundreds of hours (weeks) of computation.

Most existing pruning methods, such as gradual magnitude pruning, require extensive retraining to restore accuracy after the pruning step, and GPT-scale models typically require a large The amount of computation and parameter adjustment, which makes the retraining based methods difficult to apply. Therefore, it is not feasible to apply this progressive pruning method at GPT scale.

This work by the ISTA team proposes a SparseGPT method that can run a model with more than 100 billion parameters on a single GPU in a few hours, and is accurate enough to prune the model to a sparsity level of 50%-60% without Significantly reduces performance.

At the heart of SparseGPT is a new large-scale approximate sparse regression algorithm that generalizes to semi-structured (2:4 and 4:8) schemas and is compatible with existing weight quantization methods.

Legend: Visualization of the SparseGPT reconstruction algorithm. Given a fixed pruning mask M, weight treatments in each column of the weight matrix W are progressively pruned using the Hessian inverse sequence (HUj ) and updating the remaining weights in those rows that are located “to the right” of the column. Specifically, weights “to the right” of the pruned weights (dark blue) are updated to compensate for pruning errors, while unpruned weights generate no updates (light blue).

SparseGPT is a post-training method for GPT-scale models because it does not perform any fine-tuning.

There are many methods for quantizing post-training of GPT-scale models, such as ZeroQuant, LLM.int8(), and nuQmm, etc., but activation quantification can be difficult due to the presence of outlier features. GPTQ utilizes approximate second-order information to precisely quantize weights to 2‑4 bits, suitable for the largest models, and when combined with efficient GPU cores, can bring 2‑5X inference speedup.

However, since GPTQ focuses on sparsification rather than quantification, SparseGPT is a supplement to the quantization method, and the two can be combined.

In addition, in addition to unstructured pruning, SparseGPT is also suitable for semi-structured modes, such as the popular n:m sparse format, which can be accelerated at a ratio of 2:4 on the Ampere NVIDIA GPU.

SparseGPT:

High sparsification level, low precision loss

After evaluating the effect of the SparseGPT compression model, the researchers found that the difficulty of sparsifying a large language model is proportional to the size of the model. Compared with the existing magnitude pruning (Magnitude Pruning) method, using SparseGPT can achieve higher The degree of model sparsification while maintaining a minimum loss of accuracy.

The researchers implemented SparseGPT in PyTorch and used HuggingFace’s Transformers library to process the model and dataset, all on a single NVIDIA A100 GPU with 80GB of memory. Under such experimental conditions, SparseGPT can fully sparse a model with 175 billion parameters in about 4 hours.

The researchers sparse the Transformer layers sequentially, which significantly reduces memory requirements and also greatly improves the accuracy of processing all layers in parallel. All compression experiments were performed in one pass without any fine-tuning.

The evaluation objects are mainly OPT series models, which contain a set of models ranging from 125 million to 175 billion parameters, which is convenient for observing the scaling performance of pruning relative to the model size. In addition, 176 billion parameter variants of BLOOM were analyzed.

In terms of data sets and evaluation metrics, the experiment uses the perplexity of the original WikiText2 test set to evaluate the accuracy of the SparseGPT compression method, and in order to increase interpretability, some ZeroShot accuracy metrics are also used. Also, the evaluation focuses on the accuracy of the sparse model relative to the dense model baseline, rather than absolute numbers.

The researchers prune all linear layers across the OPT model family (excluding standard embeddings and heads) to 50% unstructured sparsity, full 4:8 or full 2:4 semi-structured sparsity, respectively degree, the result is shown in the figure below.

Legend: The perplexity of the OPT model family on the original WikiText2 test set

It can be seen that the accuracy of the model compressed using amplitude pruning is terrible at all sizes, and the larger the model, the more the accuracy drops.

The trend of the model compressed using SparseGPT is different. Under 2.7 billion parameters, the perplexity loss is < 1 point, and under 66 billion parameters, it is zero loss. Moreover, there is even an improvement in accuracy at very large model sizes compared to a dense model baseline.

3
Larger models are easier to sparsify

A general trend is that larger models are easier to sparsify, and at a fixed level of sparsity, the relative accuracy drop of sparse models versus dense models shrinks as model size increases. The authors speculate that this may be due to their higher degree of parameterization and overall greater noise immunity.

Compared to the dense model baseline, at the maximum scale, the perplexity growth is only 0.11 and 0.39 when the model is compressed to 4:8 and 2:4 sparsity using SparseGPT. This result means that we can achieve a 2x speedup in practice, and commercial NVIDIA Ampere GPUs already support 2:4 sparsity.

The author studied the relationship between the performance of the OPT-175B and BLOOM-176B two hundred billion models and the degree of sparsity brought about by using SparseGPT, and the results are shown in the figure below.

Legend: The left picture shows the unified compression of OPT-175B to different sparse levels using SparseGPT and amplitude pruning respectively. The image on the right shows the compression of the entire OPT model family to different sparsity levels using SparseGPT.

It can be seen that for the OPT-175B model, amplitude pruning can achieve up to 10% sparsity, followed by a large loss of accuracy. And SparseGPT can achieve 60% sparsity under the increase of perplexity.

Legend: The left picture shows the unified compression of BLOOM-176B to different sparse levels using SparseGPT and amplitude pruning respectively. The picture on the right shows the comparison between 50% sparsity + 4 bit quantized joint compression and 3-bit on the OPT family model.

For the BLOOM-176B model, although magnitude pruning can achieve 30% sparsity without significant accuracy loss, in comparison, SparseGPT can achieve 50% sparsity, which is a 1.66x improvement. Moreover, at 80% sparsity, the perplexity of the model compressed using SparseGPT remains at a reasonable level, but when amplitude pruning reaches 40% sparsity of OPT and 60% sparsity of BLOOM, the perplexity is already > 100.

Additionally, SparseGPT was able to remove approximately 100 billion weights from these models with limited impact on model accuracy.

To conclude, this study shows for the first time that Transformer-based large-scale pre-trained models can be compressed to high sparsity with one-time weight pruning without any retraining and with low accuracy loss.

It is worth noting that SparseGPT’s approach is local: after each pruning step, it performs weight updates designed to preserve the input-output relationship of each layer, these updates are calculated without any global gradient information . Thus, the high degree of parameterization of large-scale GPT models seems to enable this method to directly identify sparse accurate models in the “nearest neighbors” of densely pretrained models.

Also, since the accuracy metric (perplexity) used in the experiment is very sensitive, the output of the generated sparse model seems to be closely related to the output of the dense model.

This research has great positive significance in alleviating the computing power limitation of large models. One of the future work directions is to study the fine-tuning mechanism of large models to further restore accuracy. At the same time, expand the applicability of the SparseGPT method during model training. It will reduce the computational cost of training large models.

For more content , click below to follow:

Without the authorization of “AI Technology Review”, reprinting on web pages, forums, and communities in any way is strictly prohibited!

Please leave a message in the background of “AI Technology Review” to obtain authorization before reprinting from the official account. When reprinting, you must indicate the source and insert the business card of this official account.

Leifeng.com

This article is reproduced from: https://www.leiphone.com/category/academic/P7uOUBEIXhzYBxSi.html
This site is only for collection, and the copyright belongs to the original author.