Machine Learning Compilation (Continuously updated…)

Original link: https://sanzo.top/Default/mlc/

Chinese timetable: https://mlc.ai/summer22-zh/schedule

Course code: https://github.com/mlc-ai/notebooks

Homework: https://github.com/Sanzo00/mlc-summer22

Machine Learning Compilation Overview

There are many complex variables from the development to deployment of a machine learning model: a matter of hardware (ARM or x86), operating system, container execution environment, Runtime Libraries, or the type of accelerators involved.

Machine learning compilation (MLC): Take a machine learning algorithm from the development stage, transform and optimize the algorithm, and make it into the deployment stage.

Machine learning compilation is different from traditional compilation. First, machine learning compilation does not necessarily involve code generation. For example, the deployment form can be a set of predefined library functions, and ML compilation only needs to convert the development phase into calls to these functions. In addition, the challenges and solutions encountered are also different.

MLC goals

Targets for machine learning compilation:

Integration and Minimization of Dependencies, Code Integration and Minimization of Dependencies can reduce the size of the application and allow the application to be deployed to more environments.
With hardware acceleration, each deployment environment has its own set of native acceleration technologies, and many of them are specially developed for machine learning. One goal of machine learning compilation is to use the characteristics of the hardware itself to accelerate. Specifically, it can be implemented by constructing deployment code that calls the native acceleration library or code that utilizes native instructions.
General optimization: Transform model execution in a way that minimizes memory usage and increases efficiency.

Key elements

Key Elements of Machine Learning Compilation: Tensor, Tensor Functions

The process of machine learning compilation is to convert the content on the left side of the figure below into the content on the right side. In different scenarios, this process can be manual, some automatic conversion tools, or both.

abstraction and realization

MLC is actually the process of transforming and assembling tensor functions under the same or different abstractions, studying different abstract types of tensor functions and how they work together.

Tensor Program Abstraction

metatensor function

Metatensor functions represent single unit computations in machine learning models, such as relu , linear , softmax , etc.

The tensor program abstraction consists of:

Multidimensional array to store data
Loop nesting that drives tensor computation
The calculation part itself

Additional structures in tensor programs can provide more information for program transformations.

TensorIR

 import numpy as np import tvm from tvm . ir . module import IRModule from tvm . script import tir as T

TensorIR

TensorIR is a tensor program abstraction used in the open source machine learning framework Apache TVM.

The following is an example of the mm_relu function:

This code is implemented using low-level numpy:

 def lnumpy_mm_relu ( A : np . ndarray , B : np . ndarray , C : np . ndarray ) : Y = np . empty ( ( 128 , 128 ) , dtype = "float32" ) for i in range ( 128 ) : for j in range ( 128 ) : for k in range ( 128 ) : if k == 0 : Y [ i , j ] = 0 Y [ i , j ] = Y [ i , j ] + A [ i , k ] * B [ k , j ] for i in range ( 128 ) : for j in range ( 128 ) : C [ i , j ] = max ( Y [ i , j ] , 0 )

The following code is implemented using the language TVMScript, a domain-specific dialect embedded in the Python AST.

 @tvm . script . ir_module class MyModule : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) : T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) Y = T . alloc_buffer ( ( 128 , 128 ) , dtype = "float32" ) for i , j , k in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) vk = T . axis . reduce ( 128 , k ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )

Where T.grid is syntactic sugar in TensorIR, used to write multiple nested iterators.

T.block is the basic computation unit in TensorIR, a block contains a set of block axes (vi, vj,vk) and computations around them.

 vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j ) vk = T . axis . reduce ( 128 , k ) [ block_axis ] = T . axis . [ axis_type ] ( [ axis_range ] , [ mapped_value ] )

A block axis contains the following information:

defines where vi , vj , vk should be bound ( i , j , and k in this case);
The original range of vi , vj , vk is T.axis.spatial(128, i) 128 ;
The properties of the block axis ( spatial , reduce ) are declared.

Syntactic sugar for block axes: T.axis.remap

 # SSR means the properties of each axes are "spatial", "spatial", "reduce" vi , vj , vk = T . axis . remap ( "SSR" , [ i , j , k ] ) # 等价于vi = T . axis . spatial ( range_of_i , i ) vj = T . axis . spatial ( range_of_j , j ) vk = T . axis . reduce ( range_of_k , k )

transform

TensorIR introduces an auxiliary structure called Schedule to help us do program transformations.

Original MyModule:

 @tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j , k in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : vi , vj , vk = T . axis . remap ( "SSR" , [ i , j , k ] ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )

Creates a Schedule helper class that takes the given MyModule as input, and gets a reference to block Y and the corresponding loop

 sch = tvm . tir . Schedule ( MyModule ) block_Y = sch . get_block ( "Y" , func_name = "mm_relu" ) i , j , k = sch . get_loops ( block_Y )

Loop j is split into two loops, where the inner loop has a length of 4

 j0 , j1 = sch . split ( j , factors = [ None , 4 ] )

 @tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 , j_1 , k in T . grid ( 128 , 32 , 4 , 128 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )

Reorder the two loops

 sch . reorder ( j0 , k , j1 )

 @tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 , k , j_1 in T . grid ( 128 , 32 , 128 , 4 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for i , j in T . grid ( 128 , 128 ) : with T . block ( "C" ) : vi , vj = T . axis . remap ( "SS" , [ i , j ] ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )

Move block C to inner loop of Y

 block_C = sch . get_block ( "C" , "mm_relu" ) sch . reverse_compute_at ( block_C , j0 )

 @tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 in T . grid ( 128 , 32 ) : for k , j_1 in T . grid ( 128 , 4 ) : with T . block ( "Y" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = T . axis . reduce ( 128 , k ) T . reads ( A [ vi , vk ] , B [ vk , vj ] ) T . writes ( Y [ vi , vj ] ) with T . init ( ) : Y [ vi , vj ] = T . float32 ( 0 ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for ax0 in T . serial ( 4 ) : with T . block ( "C" ) : vi = T . axis . spatial ( 128 , i ) vj = T . axis . spatial ( 128 , j_0 * 4 + ax0 ) T . reads ( Y [ vi , vj ] ) T . writes ( C [ vi , vj ] ) C [ vi , vj ] = T . max ( Y [ vi , vj ] , T . float32 ( 0 ) )

Separate the initialization of the Y elements from the reduction update

 sch . decompose_reduction ( block_Y , k )

 @tvm . script . ir_module class Module : @tir . prim_func def mm_relu ( A : tir . Buffer [ ( 128 , 128 ) , "float32" ] , B : tir . Buffer [ ( 128 , 128 ) , "float32" ] , C : tir . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict tir . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with tir.block("root") Y = tir . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i , j_0 in tir . grid ( 128 , 32 ) : for j_1_init in tir . serial ( 4 ) : with tir . block ( "Y_init" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + j_1_init ) tir . reads ( ) tir . writes ( Y [ vi , vj ] ) Y [ vi , vj ] = tir . float32 ( 0 ) for k , j_1 in tir . grid ( 128 , 4 ) : with tir . block ( "Y_update" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + j_1 ) vk = tir . axis . reduce ( 128 , k ) tir . reads ( Y [ vi , vj ] , A [ vi , vk ] , B [ vk , vj ] ) tir . writes ( Y [ vi , vj ] ) Y [ vi , vj ] = Y [ vi , vj ] + A [ vi , vk ] * B [ vk , vj ] for ax0 in tir . serial ( 4 ) : with tir . block ( "C" ) : vi = tir . axis . spatial ( 128 , i ) vj = tir . axis . spatial ( 128 , j_0 * 4 + ax0 ) tir . reads ( Y [ vi , vj ] ) tir . writes ( C [ vi , vj ] ) C [ vi , vj ] = tir . max ( Y [ vi , vj ] , tir . float32 ( 0 ) )

Run the program obtained in IRModule.

 # 调用构建函数将IRModule 变换为runtime.Module rt_lib = tvm . build ( MyModule , target = "llvm" ) # 创建三个用于保存输入和输出的TVM NDArray a_nd = tvm . nd . array ( a_np ) b_nd = tvm . nd . array ( b_np ) c_nd = tvm . nd . empty ( ( 128 , 128 ) , dtype = "float32" ) # 从rt_lib 中获取可运行函数, 通过传递三个数组参数来执行它func_mm_relu = rt_lib [ "mm_relu" ] func_mm_relu ( a_nd , b_nd , c_nd ) np . testing . assert_allclose ( c_mm_relu , c_nd . numpy ( ) , rtol = 1e-5 )

Performance comparison:

 f_timer_before = rt_lib . time_evaluator ( "mm_relu" , tvm . cpu ( ) ) print ( "Time cost of MyModule %g sec" % f_timer_before ( a_nd , b_nd , c_nd ) . mean ) f_timer_after = rt_lib_after . time_evaluator ( "mm_relu" , tvm . cpu ( ) ) print ( "Time cost of transformed sch.mod %g sec" % f_timer_after ( a_nd , b_nd , c_nd ) . mean )

 Time cost of MyModule 0.00330733 secTime cost of transformed sch.mod 0.00113919 sec

The reason for the performance gap has a certain relationship with the cache characteristics of the CPU. The CPU has a multi-level cache, and the data needs to be extracted into the buffer before the CPU can access it. In particular, it is much faster to access the data already in the cache. The CPU uses the cache line strategy to load adjacent data into the cache at a time.

And the code transformed by the program, the iteration of j1 produces continuous access to the elements of B, and in addition makes the calculation of C closer to Y , thus achieving better cache behavior.

How to create TensorIR

We can create TensorIR via TVM Sccipt and tensor expressions.

Tensor Expressions (TE) is a domain-specific language that describes a series of computations through API-like expressions.

 from tvm import teA = te . placeholder ( ( 128 , 128 ) , "float32" , name = "A" ) B = te . placeholder ( ( 128 , 128 ) , "float32" , name = "B" ) k = te . reduce_axis ( ( 0 , 128 ) , "k" ) Y = te . compute ( ( 128 , 128 ) , lambda i , j : te . sum ( A [ i , k ] * B [ k , j ] , axis = k ) , name = "Y" ) C = te . compute ( ( 128 , 128 ) , lambda i , j : te . max ( Y [ i , j ] , 0 ) , name = "C" )

te.compute takes the signature te.compute(output_shape, fcompute) . fcompute function describes how we want to compute the value of the element Y[i, j] for a given index (i, j) .

 lambda i , j : te . sum ( A [ i , k ] * B [ k , j ] , axis = k )

Create a function with two input parameters (A, B) and one output parameter (C)

 te_func = te . create_prim_func ( [ A , B , C ] ) . with_attr ( { "global_symbol" : "mm_relu" } ) MyModuleFromTE = tvm . IRModule ( { "mm_relu" : te_func } )

 @tvm . script . ir_module class Module : @T . prim_func def mm_relu ( A : T . Buffer [ ( 128 , 128 ) , "float32" ] , B : T . Buffer [ ( 128 , 128 ) , "float32" ] , C : T . Buffer [ ( 128 , 128 ) , "float32" ] ) - > None : # function attr dict T . func_attr ( { "global_symbol" : "mm_relu" , "tir.noalias" : True } ) # body # with T.block("root") Y = T . alloc_buffer ( [ 128 , 128 ] , dtype = "float32" ) for i0 , i1 , i2 in T . grid ( 128 , 128 , 128 ) : with T . block ( "Y" ) : i , j , k = T . axis . remap ( "SSR" , [ i0 , i1 , i2 ] ) T . reads ( A [ i , k ] , B [ k , j ] ) T . writes ( Y [ i , j ] ) with T . init ( ) : Y [ i , j ] = T . float32 ( 0 ) Y [ i , j ] = Y [ i , j ] + A [ i , k ] * B [ k , j ] for i0 , i1 in T . grid ( 128 , 128 ) : with T . block ( "C" ) : i , j = T . axis . remap ( "SS" , [ i0 , i1 ] ) T . reads ( Y [ i , j ] ) T . writes ( C [ i , j ] ) C [ i , j ] = T . max ( Y [ i , j ] , T . float32 ( 0 ) )

MLC process: develop, transform, build.

End-to-end model integration

Automated program optimization

Integrate with machine learning frameworks

GPU hardware acceleration

Computational graph optimization: operator fusion and memory optimization

Deploy the model to the service environment

Deploy models to edge devices

This article is reprinted from: https://sanzo.top/Default/mlc/
This site is for inclusion only, and the copyright belongs to the original author.

Machine Learning Compilation Overview

Tensor Program Abstraction

TensorIR

End-to-end model integration

Automated program optimization

Integrate with machine learning frameworks

GPU hardware acceleration

Computational graph optimization: operator fusion and memory optimization

Deploy the model to the service environment

Deploy models to edge devices

Leave a Comment Cancel Reply