Original link: http://gaocegege.com/Blog/kubernetes/metrics-survey
TL;
DR In order to better support the collection of training metrics and the performance profiling of tasks, we designed a small-scale survey of machine learning development observability.
envd
Support profiler that is more in line with the needs of algorithm engineers
Function. Welcome to participate!
Deep learning training is rare and may be computationally intensive (Compute
intensive), it may be data intensive, or it may be memory intensive.
intensive) workloads.
This makes it profiling
The process is very complicated. When the model uses more video memory or memory than expected, or when the model training fails to use the computing power of the graphics card, it is difficult to understand what is going on. Of course, in order to solve these problems, there are also some open source products or tools.
TensorBoard
The most popular of these is undoubtedly
TensorBoard . It helps users collect metrics during training and visualize them. It is also very simple to use:
log_dir = "logs/profile/" + datetime . now (). strftime ( "%Y%m%d-%H%M%S" ) tensorboard_callback = tf . keras . callbacks . TensorBoard ( log_dir = log_dir , histogram_freq = 1 , profile_batch = 3 ) model . fit ( train_data , steps_per_epoch = 20 , epochs = 5 , callbacks = [ tensorboard_callback ])
not only
TensorFlow and Keras, PyTorch also began to support TensorBoard .
Nvidia
SMI
If you just want to get some hardware metrics, there is an easier way: nvidia-smi
. You can view the GPU of different processes with the nvidia-smi
command
Video memory usage. Usually you will want the training process to take up the vast majority of the available video memory, which means your model is using it well
GPU.
Nvidia
SMI
In addition, you can pay attention to the power consumption (Power
consumption) indicator, when the GPU
The power consumption will increase when the computing unit of the device performs operations. A high power consumption means a higher frequency of use of the computing unit. When you find that the use of video memory is very small, or the power consumption is very low, you can try to use a larger
batch size for training. If you want to get more indicators, you can pass nvidia-smi
command to obtain.
dmon
nvidia-smi
is also a very commonly used performance debugging tool, it provides CLI
The interface is very simple to use.
Nvidia
Nsight Systems
Nvidia Nsight
Systems are powerful tools for debugging and optimizing GPU programs. For deep learning training tasks, Nsight Systems
Ability to visually analyze memory usage, CUDA Kernel execution, and more. Although it is very powerful, it also has a relatively high learning threshold. If not very understanding
Nsight
Engineers of software series products, or when they are not pursuing extreme optimization, rarely use it for performance optimization.
Nvidia Nsight
Systems
Nvidia DLProf
DLProf
Is a wrapper for Nvidia Nsight. You can pass dlprof python main.py
to collect metrics during training. It generates two files: sqlite
and qdrep
, and events_folder
. can be used next
TensorBoard comes based on events_folder
Make a visual presentation.
DLProf
how to choose
While there are a variety of tools to choose from, in our opinion, these are all difficult to use. In our previous work, we also often encountered
It’s a headache when it comes to GPU-related performance issues.
To be able to better support training metrics collection and task performance
profiling, we designed a small-scale survey machine learning to develop observability small surveys , we hope in
envd
It supports the profiler function that is more in line with the needs of algorithm engineers. Welcome to participate!
references
License
- This article
is licensed under CC
BY-NC-SA 3.0 . - Please contact me for
commercial use.
This article is reproduced from: http://gaocegege.com/Blog/kubernetes/metrics-survey
This site is for inclusion only, and the copyright belongs to the original author.