Original link: https://sanzo.top/Default/mlc/
Chinese Notes: https://mlc.ai/zh/
Chinese timetable: https://mlc.ai/summer22-zh/schedule
Course code: https://github.com/mlc-ai/notebooks
Homework: https://github.com/Sanzo00/mlc-summer22
Machine Learning Compilation Overview
There are many complex variables from the development to deployment of a machine learning model: a matter of hardware (ARM or x86), operating system, container execution environment, Runtime Libraries, or the type of accelerators involved.
Machine learning compilation (MLC): Take a machine learning algorithm from the development stage, transform and optimize the algorithm, and make it into the deployment stage.
Machine learning compilation is different from traditional compilation. First, machine learning compilation does not necessarily involve code generation. For example, the deployment form can be a set of predefined library functions, and ML compilation only needs to convert the development phase into calls to these functions. In addition, the challenges and solutions encountered are also different.
MLC goals
Targets for machine learning compilation:
- Integration and Minimization of Dependencies, Code Integration and Minimization of Dependencies can reduce the size of the application and allow the application to be deployed to more environments.
- With hardware acceleration, each deployment environment has its own set of native acceleration technologies, and many of them are specially developed for machine learning. One goal of machine learning compilation is to use the characteristics of the hardware itself to accelerate. Specifically, it can be implemented by constructing deployment code that calls the native acceleration library or code that utilizes native instructions.
- General optimization: Transform model execution in a way that minimizes memory usage and increases efficiency.
Key elements
Key Elements of Machine Learning Compilation: Tensor, Tensor Functions
The process of machine learning compilation is to convert the content on the left side of the figure below into the content on the right side. In different scenarios, this process can be manual, some automatic conversion tools, or both.
abstraction and realization
MLC is actually the process of transforming and assembling tensor functions under the same or different abstractions, studying different abstract types of tensor functions and how they work together.
Tensor Program Abstraction
metatensor function
Metatensor functions represent single unit computations in machine learning models, such as relu
, linear
, softmax
, etc.
The tensor program abstraction consists of:
- Multidimensional array to store data
- Loop nesting that drives tensor computation
- The calculation part itself
Additional structures in tensor programs can provide more information for program transformations.
TensorIR
import numpy as np import tvm from tvm . ir . module import IRModule from tvm . script import tir as T
TensorIR
TensorIR is a tensor program abstraction used in the open source machine learning framework Apache TVM.
The following is an example of the mm_relu
function:
This code is implemented using low-level numpy:
def lnumpy_mm_relu ( A : np . ndarray , B : np . ndarray , C : np . ndarray ) : Y = np . empty ( ( 128 , 128 ) , dtype = "float32" ) for i in range ( 128 ) : for j in range ( 128 ) : for k in range ( 128 ) : if k == 0 : Y [ i , j ] = 0 Y [ i , j ] = Y [ i , j ] + A [ i , k ] * B [ k , j ] for i in range ( 128 ) : for j in range ( 128 ) : C [ i , j ] = max ( Y [ i , j ] , 0 )
The following code is implemented using the language TVMScript, a domain-specific dialect embedded in the Python AST.
@tvm . script . ir_module class MyModule : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) : T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) Y = T . alloc_buffer ( ( 128 , 128 ) , dtype = "float32" ) for i , j , k in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) vk = T . axis . reduce ( 128 , k ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )
Where T.grid
is syntactic sugar in TensorIR, used to write multiple nested iterators.
T.block
is the basic computation unit in TensorIR, a block contains a set of block axes (vi, vj,vk)
and computations around them.
vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) vk = T . axis . reduce ( 128 , k ) [ block_axis ] = T . axis . [ axis_type ] ( [ axis_range ] , [ mapped_value ] )
A block axis contains the following information:
- defines where
vi
,vj
,vk
should be bound (i
,j
, andk
in this case); - The original range of
vi
,vj
,vk
isT.axis.spatial(128, i)
128
; - The properties of the block axis (
spatial
,reduce
) are declared.
Syntactic sugar for block axes: T.axis.remap
# SSR means the properties of each axes are "spatial", "spatial", "reduce" vi , vj , vk = T . axis . remap ( "SSR" , [ i , j , k ] ) # 等价于vi = T . axis . spatial ( range_of_i , i ) vj = T . axis . spatial ( range_of_j , j ) vk = T . axis . reduce ( range_of_k , k )
transform
TensorIR introduces an auxiliary structure called Schedule to help us do program transformations.
Original MyModule:
@tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j , k in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : vi , vj , vk = T . axis . remap ( "SSR" , [ i , j , k ] ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )
Creates a Schedule helper class that takes the given MyModule as input, and gets a reference to block Y and the corresponding loop
sch = tvm . tir . Schedule ( MyModule ) block_Y = sch . get_block ( "Y" , func_name = "mm_relu" ) i , j , k = sch . get_loops ( block_Y )
Loop j is split into two loops, where the inner loop has a length of 4
j0 , j1 = sch . split ( j , factors = [ None , 4 ] )
@tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 , j_1 , k in T . grid ( 128 , 32 , 4 , 128 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )
Reorder the two loops
sch . reorder ( j0 , k , j1 )
@tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 , k , j_1 in T . grid ( 128 , 32 , 128 , 4 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )
Move block C to inner loop of Y
block_C = sch . get_block ( "C" , "mm_relu" ) sch . reverse_compute_at ( block_C , j0 )
@tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 in T . grid ( 128 , 32 ) : for k , j_1 in T . grid ( 128 , 4 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for ax0 in T . serial ( 4 ) : with T . block ( "C" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + ax0 ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )
Separate the initialization of the Y elements from the reduction update
sch . decompose_reduction ( block_Y , k )
@tvm . script . ir_module class Module : @tir . prim_func def mm_relu ( A : tir . Buffer [ ( 128 , 128 ) , "float32" ] , B : tir . Buffer [ ( 128 , 128 ) , "float32" ] , C : tir . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict tir . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with tir.block("root") Y = tir . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 in tir . grid ( 128 , 32 ) : for j_1_init in tir . serial ( 4 ) : with tir . block ( "Y_init" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + j_1_init ) tir . reads ( ) tir . writes ( Y [ vi , vj ] ) Y [ vi , vj ] = tir . float32 ( 0 ) for k , j_1 in tir . grid ( 128 , 4 ) : with tir . block ( "Y_update" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = tir . axis . reduce ( 128 , k ) tir . reads ( Y [ vi , vj ] , A [ vi , vk ] , B [ vk , vj ] ) tir . writes ( Y [ vi , vj ] ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for ax0 in tir . serial ( 4 ) : with tir . block ( "C" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + ax0 ) tir . reads ( Y [ vi , vj ] ) tir . writes ( C [ vi , vj ] ) C [ vi , vj ] = tir . max ( Y [ vi , vj ] , tir . float32 ( 0 ) )
Run the program obtained in IRModule.
# 调用构建函数将IRModule 变换为runtime.Module rt_lib = tvm . build ( MyModule , target = "llvm" ) # 创建三个用于保存输入和输出的TVM NDArray a_nd = tvm . nd . array ( a_np ) b_nd = tvm . nd . array ( b_np ) c_nd = tvm . nd . empty ( ( 128 , 128 ) , dtype = "float32" ) # 从rt_lib 中获取可运行函数, 通过传递三个数组参数来执行它func_mm_relu = rt_lib [ "mm_relu" ] func_mm_relu ( a_nd , b_nd , c_nd ) np . testing . assert_allclose ( c_mm_relu , c_nd . numpy ( ) , rtol = 1e-5 )
Performance comparison:
f_timer_before = rt_lib . time_evaluator ( "mm_relu" , tvm . cpu ( ) ) print ( "Time cost of MyModule %g sec" % f_timer_before ( a_nd , b_nd , c_nd ) . mean ) f_timer_after = rt_lib_after . time_evaluator ( "mm_relu" , tvm . cpu ( ) ) print ( "Time cost of transformed sch.mod %g sec" % f_timer_after ( a_nd , b_nd , c_nd ) . mean )
Time cost of MyModule 0.00330733 secTime cost of transformed sch.mod 0.00113919 sec
The reason for the performance gap has a certain relationship with the cache characteristics of the CPU. The CPU has a multi-level cache, and the data needs to be extracted into the buffer before the CPU can access it. In particular, it is much faster to access the data already in the cache. The CPU uses the cache line strategy to load adjacent data into the cache at a time.
And the code transformed by the program, the iteration of j1 produces continuous access to the elements of B, and in addition makes the calculation of C
closer to Y
, thus achieving better cache behavior.
How to create TensorIR
We can create TensorIR via TVM Sccipt and tensor expressions.
Tensor Expressions (TE) is a domain-specific language that describes a series of computations through API-like expressions.
from tvm import teA = te . placeholder ( ( 128 , 128 ) , "float32" , name = "A" ) B = te . placeholder ( ( 128 , 128 ) , "float32" , name = "B" ) k = te . reduce_axis ( ( 0 , 128 ) , "k" ) Y = te . compute ( ( 128 , 128 ) , lambda i , j : te . sum ( A [ i , k ] * B [ k , j ] , axis = k ) , name = "Y" ) C = te . compute ( ( 128 , 128 ) , lambda i , j : te . max ( Y [ i , j ] , 0 ) , name = "C" )
te.compute
takes the signature te.compute(output_shape, fcompute)
. fcompute
function describes how we want to compute the value of the element Y[i, j]
for a given index (i, j)
.
lambda i , j : te . sum ( A [ i , k ] * B [ k , j ] , axis = k )
Create a function with two input parameters (A, B) and one output parameter (C)
te_func = te . create_prim_func ( [ A , B , C ] ) . with_attr ( { "global_symbol" : "mm_relu" } ) MyModuleFromTE = tvm . IRModule ( { "mm_relu" : te_func } )
@tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i0 , i1 , i2 in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : i , j , k = T . axis . remap ( "SSR" , [ i0 , i1 , i2 ] ) T . reads ( A [ i , k ] , B [ k , j ] ) T . writes ( Y [ i , j ] ) with T . init ( ) : Y [ i , j ] = T . float32 ( 0 ) Y [ i , j ] = Y [ i , j ] + A [ i , k ] * B [ k , j ] for i0 , i1 in T . grid ( 128 , 128 ) : with T . block ( "C" ) : i , j = T . axis . remap ( "SS" , [ i0 , i1 ] ) T . reads ( Y [ i , j ] ) T . writes ( C [ i , j ] ) C [ i , j ] = T . max ( Y [ i , j ] , T . float32 ( 0 ) )
MLC process: develop, transform, build.
End-to-end model integration
Automated program optimization
Integrate with machine learning frameworks
GPU hardware acceleration
Computational graph optimization: operator fusion and memory optimization
Deploy the model to the service environment
Deploy models to edge devices
This article is reprinted from: https://sanzo.top/Default/mlc/
This site is for inclusion only, and the copyright belongs to the original author.