Some recent optimizations of ECS

Original link: https://blog.codingnow.com/2023/06/optimize_ecs.html

Recently we are optimizing our 3d engine. The rendering object management layer of the engine is based on the ECS framework, and the entire engine is designed and constructed based on Lua. In other words, the data in the rendering part can be read and written through Lua. However, for the core rendering loop, Lua’s performance is limited. When there are many objects to be rendered, the performance problem of the previous loop written in Lua is revealed.

We designed luaecs very early on for this purpose. Put the data in the C structure and give the interface for Lua access. In this way, it is convenient to use Lua for rapid development in the early stage, and to refactor and optimize the core cycle in C later. At the beginning of this year, we refactored the rendering core system in C , which basically solved the performance problem.

Recently, we have done some optimization work on the basis of profile. The performance hotspot discovered this time is that there are a large number of objects in the game scene, but the proportion of the scene that needs to be rendered is very small. Before, an optimization has been done for this scenario, and the solution is to add the grouping feature to the ecs framework .

Reasonable grouping can quickly tag the objects to be rendered after the lens is cropped, and quickly filter the objects managed by ECS according to the tags. Screening is carried out in C, so the performance of Lua operation ecs is greatly improved. But when we moved the core loop into C, we found that there was still room for optimization.

The reason is: when the number of objects managed by ECS is much larger than the number of objects that need to be screened out in the rendering loop, these objects are discrete in the storage space, and the time complexity of retrieving a single component is O (Log n). This results in a time complexity of O(n Log n) for the final kernel loop. luaecs has made a very limited optimization for this: it will record the last query position of a component, and use it for reference in the next query. Therefore, sequentially traversing these discrete tags will speed up to a certain extent, but the improvement is limited.

The goal of this optimization is: when the rendering core loops through all related Components of the Entity in the lens, the whole process can achieve O (Log N) in most cases. In order to achieve this, it is natural to trade space for time.

Imagine how this could be done in a non-ECS framework? It is usually necessary to build an additional linear container to store references to all visible objects. Under the ECS framework, objects are dynamically composed of several components, and the relationship between components will be more complicated. If we look at it from the perspective of components as the smallest unit, it is necessary to have an efficient container to put down all the components that need to be processed, and the number is very large (estimated on the order of one hundred thousand). Every access is performance sensitive. This container is obviously a Cache, which will be affected by the change of the lens and the life and death of the object.

Cache invalidation is the most difficult problem to deal with. Here, the scheme of using class smart pointers according to the traditional method must have a great impact on performance. It would be a better solution to use id in the ECS framework. I thought it would not be a big problem to use binary search for ordered ids, but after the actual profile, there is still room for optimization. I additionally implemented an index cache to record the specific location of each component during the traversal process, which is used as a reference for the next traversal process. This location cache does not have to be consistent with the actual situation. Anyway, it will be checked again every time it is searched, and it will be updated if it is wrong. The actual measurement really improved a lot of performance. In the large-scale scene test with a large number of objects on the iPhone 8, the CPU time occupied by the rendering core cycle of each frame can be controlled below 10ms, which can already meet our predetermined performance requirements.

The next little question is: Where should this cache object be placed? It is obviously inappropriate to use a global variable. Although we only have one ecs world now, it is not ruled out that it will become multiple in the future. And global variables themselves are a bad smell.

It seems that the cache object is best bound to the world. But adding directly to the world object in the traditional way is also a bad smell. After all, it is only related to the rendering system, not the infrastructure of the ECS framework.

So, what about being in the world as a standalone component type? That is, as a singleton component, it coexists with other components. The problem here is twofold:

First, in the current ecs framework, the number of independent types is limited. If many similar systems add new types to their own unique global objects, a lot of type resources will be used.

Second, component types are exclusive. If many systems apply for their own types, they must be declared in a unified place to distinguish them from each other. In this way, what is originally implemented in a rendering system must expose itself. When maintaining the code, in addition to writing the code in the implementation file, it is also necessary to add an external interface, which is not a good taste. .

When thinking about a solution, I started with C++’s global variables. C++’s global object is a very complicated thing. It is more complicated than C’s global variable in its construction and destruction process. This requires the linker to do a lot of work. That is, the linker coordinates the global objects in each independent module.

What if what we need is not a global object, but to bind the object to the world and keep it unique in each world? We can open up a global object area (in the form of a component singleton) for the world, and bind all such objects to this area at runtime. This area is an array of object pointers, and each global object has an independent index number. So how should the index number be assigned so that the systems do not conflict with each other? I think it can be done with the help of a global object’s auto-increment id allocator. That is: this self-incrementing id allocator only assigns a unique id to the global object used by each module instead of the object itself; and with the help of id, we can apply for a unique slot in the global object area of each different world bit, holds the pointer to the object. Ultimately, we can achieve that each world has its own independent set of global objects, and accessing them still takes O(1) time complexity.

The above is a little ECS optimization done recently. There is another performance optimization about the material system, which will be discussed next time.

This article is transferred from: https://blog.codingnow.com/2023/06/optimize_ecs.html
This site is only for collection, and the copyright belongs to the original author.