How to perform performance tuning for deep learning training?

Original link: http://gaocegege.com/Blog/kubernetes/metrics-survey

TL;
DR In order to better support the collection of training metrics and the performance profiling of tasks, we designed a small-scale survey of machine learning development observability.
envd
Support profiler that is more in line with the needs of algorithm engineers
Function. Welcome to participate!

Deep learning training is rare and may be computationally intensive (Compute
intensive), it may be data intensive, or it may be memory intensive.
intensive) workloads.

This makes it profiling
The process is very complicated. When the model uses more video memory or memory than expected, or when the model training fails to use the computing power of the graphics card, it is difficult to understand what is going on. Of course, in order to solve these problems, there are also some open source products or tools.

TensorBoard

The most popular of these is undoubtedly
TensorBoard . It helps users collect metrics during training and visualize them. It is also very simple to use:

 log_dir = "logs/profile/" + datetime . now (). strftime ( "%Y%m%d-%H%M%S" ) tensorboard_callback = tf . keras . callbacks . TensorBoard ( log_dir = log_dir , histogram_freq = 1 , profile_batch = 3 ) model . fit ( train_data , steps_per_epoch = 20 , epochs = 5 , callbacks = [ tensorboard_callback ])

not only
TensorFlow and Keras, PyTorch also began to support TensorBoard .

Nvidia
SMI

If you just want to get some hardware metrics, there is an easier way: nvidia-smi . You can view the GPU of different processes with the nvidia-smi command
Video memory usage. Usually you will want the training process to take up the vast majority of the available video memory, which means your model is using it well
GPU.

smi.png Nvidia
SMI

In addition, you can pay attention to the power consumption (Power
consumption) indicator, when the GPU
The power consumption will increase when the computing unit of the device performs operations. A high power consumption means a higher frequency of use of the computing unit. When you find that the use of video memory is very small, or the power consumption is very low, you can try to use a larger
batch size for training. If you want to get more indicators, you can pass nvidia-smi
dmon
command to obtain.

nvidia-smi is also a very commonly used performance debugging tool, it provides CLI
The interface is very simple to use.

Nvidia
Nsight Systems

Nvidia Nsight
Systems
are powerful tools for debugging and optimizing GPU programs. For deep learning training tasks, Nsight Systems
Ability to visually analyze memory usage, CUDA Kernel execution, and more. Although it is very powerful, it also has a relatively high learning threshold. If not very understanding
Nsight
Engineers of software series products, or when they are not pursuing extreme optimization, rarely use it for performance optimization.

nsight.png Nvidia Nsight
Systems

Nvidia DLProf

DLProf
Is a wrapper for Nvidia Nsight. You can pass dlprof python main.py
to collect metrics during training. It generates two files: sqlite and qdrep , and events_folder . can be used next
TensorBoard comes based on events_folder
Make a visual presentation.

dlprof.png
DLProf

how to choose

While there are a variety of tools to choose from, in our opinion, these are all difficult to use. In our previous work, we also often encountered
It’s a headache when it comes to GPU-related performance issues.

To be able to better support training metrics collection and task performance
profiling, we designed a small-scale survey machine learning to develop observability small surveys , we hope in
envd
It supports the profiler function that is more in line with the needs of algorithm engineers. Welcome to participate!

code.png

references

License

  • This article
    is licensed under CC
    BY-NC-SA 3.0
    .
  • Please contact me for
    commercial use.

This article is reproduced from: http://gaocegege.com/Blog/kubernetes/metrics-survey
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment