“Clean” code, terrible performance

Author | Casey Muratori

Translator | Nuka-Cola

Planning | Chu Xingjuan

The so-called programming “best practices” taught in many institutions today are fundamentally performance disasters that may explode at any time.

Many programmers have heard this saying when they were still “Xiaomengxin”: the code written must be “clean”. For this reason, many people have done a lot of reading and learning.

Redux author Dan Abramov was once obsessed with “clean code” and removing duplication. Many years ago, he and his colleagues developed a graphical editor canvas. When he saw his colleagues submit the code, he complained, “These repeated codes look really annoying.” Then, he found a way to delete the repeated codes. .

“It was late at night, I submitted the modified code to the master branch, and then went to bed. I was proud of helping my colleagues clean up the messy code.” But the reality was not as good as he imagined, The next day the boss saw it and talked to him, hoping that he would roll back the code.

Dan didn’t understand it at the time, and it wasn’t until he worked for a few years that he realized that, in addition to teamwork considerations, he sacrificed flexibility to reduce duplication of code. “It’s not a good trade-off,” he admitted.

Coincidentally, Casey Muratori, a senior developer specializing in the development of game engines, also recently published an article saying that the rules of the so-called “clean” code “actually don’t matter, and in most cases do not affect the actual operation of the code.”

This is the result of Casey’s own testing. He said, “If you analyze carefully, you will find that many of these requirements are quite arbitrary and difficult to verify or falsify. But some are very ‘evil’ and will indeed affect the running effect of the code. “We have translated Casey’s test sharing for readers.

Performance testing for “clean code”

Let’s look at a few representative “clean” suggestions:

• Compared with “if/else” and “switch”, try to use polymorphism;

• don’t tell the code what objects it deals with;

• Functions should be small; functions should only do one thing;

• “DRY” – Don’t repeat yourself.

These requirements are fairly specific, and it sounds like following them will lead to “clean” code. But the question is, how well does such code perform?

In order to more accurately test the actual performance of the “clean” code, I decided to directly use the example code listed in the relevant literature. In this way, everyone can’t say that I hacked on purpose. Here I just use the ready-made results provided by others to evaluate whether the “clean” code can be typed.

Use polymorphism as much as possible?

I believe that many friends have seen the following “clean” code examples:

 /* ======================================================================== LISTING 22 ======================================================================== */
 class shape_base { public: shape_base() {} virtual f32 Area() = 0; }; class square : public shape_base { public: square(f32 SideInit) : Side(SideInit) {} virtual f32 Area() {return Side*Side;}    private: f32 Side; }; class rectangle : public shape_base { public: rectangle(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {} virtual f32 Area() {return Width*Height;}    private: f32 Width, Height; }; class triangle : public shape_base { public: triangle(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {} virtual f32 Area() {return 0.5f*Base*Height;}    private: f32 Base, Height; }; class circle : public shape_base { public: circle(f32 RadiusInit) : Radius(RadiusInit) {} virtual f32 Area() {return Pi32*Radius*Radius;}    private: f32 Radius; };

This is a base class that provides several specific shapes: circle, triangle, rectangle, square. Afterwards, it also provides a virtual function for calculating the area.

Same as the previous requirement, polymorphism is used here, the function is small and only does one thing, in short, it fully complies with the regulations. So, we end up with a very “clean” class hierarchy. Each derived class knows how to calculate its own area and stores the data needed for the area calculation.

If we want to apply this hierarchy in practice, for example, we want to sum the area of all the input shapes, it should be like this:

 /* ======================================================================== LISTING 23 ======================================================================== */ f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes) { f32 Accum = 0.0f; for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex) { Accum += Shapes[ShapeIndex]->Area(); }     return Accum; }

As you may have noticed, I didn’t use iterators here, because the “clean” rules don’t suggest using iterators. In order to avoid confusion on the compiler and impact on performance differences, I decided not to introduce any abstract iterators.

In addition, this loop is also based on a series of pointers. This is a direct consequence of using the class hierarchy: we don’t know how big the shapes will be in memory, so unless we add another virtual function call to get the data size of each shape and introduce some sort of variable skipping , otherwise you must rely on the pointer to find the actual starting position of each shape.

What is done here is cumulative calculation, so there will be a circular dependency, which will cause the loop speed to slow down. In order to be able to reorder the accumulation at will, I also wrote a hand-filled version to be safe:

 /* ======================================================================== LISTING 24 ======================================================================== */ f32 TotalAreaVTBL4(u32 ShapeCount, shape_base **Shapes) { f32 Accum0 = 0.0f; f32 Accum1 = 0.0f; f32 Accum2 = 0.0f; f32 Accum3 = 0.0f;     u32 Count = ShapeCount/4; while(Count--) { Accum0 += Shapes[0]->Area(); Accum1 += Shapes[1]->Area(); Accum2 += Shapes[2]->Area(); Accum3 += Shapes[3]->Area();         Shapes += 4; }     f32 Result = (Accum0 + Accum1 + Accum2 + Accum3); return Result; }

If we only do a simple test of these two routines, we can roughly measure the CPU clock cycles consumed by each shape to complete the calculation:

Here code testing is done in two different ways. The first is run only once, expressing computation in a “cold” state – when data should be in L3 cache, but L2 and L1 have been flushed and branch predictors haven’t been “rehearsed” in the loop Pass.

The second is to run the code multiple times to see how the loop performs when both the cache and the branch predictor are “hot”. Please note that none of my methods are really accurate measurements. As you can see, the differences are so large that serious analysis tools are simply not warranted.

From the results, there is not much difference between the two routines. The “clean” code consumes about 35 compute cycles, sometimes 34 if you’re lucky, when calculating the area of the shape. In other words, if we strictly follow the principle of “clean” programming, we will use up 35 computing cycles.

But what happens if the first rule is ignored? Here we don’t use polymorphism, just use the switch statement.

I’ve written the exact same code here, only instead of taking the form of a class hierarchy (aka vtable at runtime), I’ve crammed everything into a single structure with enums and shape types:

 /* ======================================================================== LISTING 25 ======================================================================== */ enum shape_type : u32 { Shape_Square, Shape_Rectangle, Shape_Triangle, Shape_Circle,     Shape_Count, }; struct shape_union { shape_type Type; f32 Width; f32 Height; }; f32 GetAreaSwitch(shape_union Shape) { f32 Result = 0.0f;     switch(Shape.Type) { case Shape_Square: {Result = Shape.Width*Shape.Width;} break; case Shape_Rectangle: {Result = Shape.Width*Shape.Height;} break; case Shape_Triangle: {Result = 0.5f*Shape.Width*Shape.Height;} break; case Shape_Circle: {Result = Pi32*Shape.Width*Shape.Width;} break;         case Shape_Count: {} break; }     return Result; }

This is the old school way of programming before we were fooled by “clean” code.

Note that since the corresponding data types for the various shape variants are no longer specified here, if the type does not have one of the values in question (such as “height”), it is simply ignored.

Now, instead of getting the area from the virtual function call, this code gets it from the function via the switch statement – which is totally against the principles of “clean” programming. But you should be able to see that the latter is more concise, and the code has not changed much. Each implementation of the Switch statement has the same code as the corresponding virtual function in the class hierarchy.

As for the summing loop itself, it’s pretty much the same as the “clean” version:

 /* ======================================================================== LISTING 26 ======================================================================== */ f32 TotalAreaSwitch(u32 ShapeCount, shape_union *Shapes) { f32 Accum = 0.0f;     for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex) { Accum += GetAreaSwitch(Shapes[ShapeIndex]); }  return Accum; } f32 TotalAreaSwitch4(u32 ShapeCount, shape_union *Shapes) { f32 Accum0 = 0.0f; f32 Accum1 = 0.0f; f32 Accum2 = 0.0f; f32 Accum3 = 0.0f;     ShapeCount /= 4; while(ShapeCount--) { Accum0 += GetAreaSwitch(Shapes[0]); Accum1 += GetAreaSwitch(Shapes[1]); Accum2 += GetAreaSwitch(Shapes[2]); Accum3 += GetAreaSwitch(Shapes[3]);         Shapes += 4; }     f32 Result = (Accum0 + Accum1 + Accum2 + Accum3); return Result; }

The only difference is that instead of calling a member function to get the area, we call a regular function. That’s all.

But it’s clear that there are many advantages to having a flat structure compared to class hierarchies: the shapes are all in the matrix, and there’s no need for pointers at all. And because all shapes are the same size, no other indirect conversion is needed.

Also, the compiler can now understand exactly what we are doing in the loop, looking at the GetAreaSwitch function and looking at the entire code path. In this way, the compiler does not have to make guesses about the operation of virtual area functions that are only available to the runtime.

What kind of effects will these benefits translate into in the compiler? Here we run four shapes in one go, the result is:

By observing the results, we will find some very interesting phenomena. Just by changing the code a bit “old school” we got a 1.5x performance boost. Yes, don’t use such trivial things as C++ polymorphism, and the performance will improve immediately.

By violating the first (and core) of the “clean” code principles, we reduced the clock cycles for calculating the area of each shape from 35 to 24. If you want to compare hardware, it is equivalent to downgrading the iPhone 14 Pro Max to the iPhone 11 Pro Max. This is a three-to-four-year hardware evolution process, which is eliminated simply by not using polymorphism.

But this is just the beginning.

Ignore object internals?

What happens if we break more rules? For example, remove the second item, “Ignore the internal object”. Can we rely on internal knowledge to help functions improve their operating efficiency?

Looking back at the switch statement that calculates the area, we see that all area calculations are done in a similar way:

 case Shape_Square: {Result = Shape.Width*Shape.Width;} break; case Shape_Rectangle: {Result = Shape.Width*Shape.Height;} break; case Shape_Triangle: {Result = 0.5f*Shape.Width*Shape.Height;} break;        case Shape_Circle: {Result = Pi32*Shape.Width*Shape.Width;} break;

That is to say, the height is multiplied by the height, the width is multiplied by the width, and a coefficient such as π is multiplied when necessary. If it’s a circle, then divide by 2.

This is where I struggle the most with “clean” code principles, I think switch statements are great! It can clearly show us these patterns, because when the code is organized by operation (rather than by type), it is easy to find common patterns in it. In contrast, looking at “clean” programming examples, we may never find such patterns. Not only is there more boilerplate over there, but advocates recommend putting each class in a separate file.

So structurally, I’m generally not in favor of class hierarchies. All in all, now I want to emphasize the most important point – we can greatly simplify this switch statement by looking at the pattern.

Remember: this example was not chosen by me. This is an illustrative example of “clean” code chosen by itself. And similar to area calculation, many other tasks have similar algorithm structures. To take advantage of this pattern, we can put together a simple table illustrating the corresponding coefficients for each type. If we set circles and rectangles, etc., as single-argument types, we can write simpler area functions:

 /* ======================================================================== LISTING 27 ======================================================================== */ f32 const CTable[Shape_Count] = {1.0f, 1.0f, 0.5f, Pi32}; f32 GetAreaUnion(shape_union Shape) { f32 Result = CTable[Shape.Type]*Shape.Width*Shape.Height; return Result; }

The two summing loops here do not need to be modified much, except that they can only call GetAreaUnion instead of GetAreaSwitch, and the rest are identical.

Let’s see how this version works:

It can be seen that through the understanding of the actual type, we effectively convert the type-based thinking into a function-based thinking, which greatly improves the speed. Compared with the previous iPhone, our computing speed is now equivalent to landing on a desktop.

And the only thing we do, is a table lookup plus one line of code, nothing else! Not only is this faster, it is also semantically simpler. It involves fewer tokens, fewer operations, and fewer lines of code.

Therefore, it is necessary for us to combine the data model with the calculation operation, instead of asking for “ignoring the internal”. We now consume only 3.0 to 3.5 compute cycles per shape area calculation.

Abandoning the first two rules of “clean” programming has resulted in a 10x performance improvement in our code.

A 10x performance boost is no small feat, given that even the iPhone 6, introduced years ago (the oldest model supported by modern performance benchmarks), has a third of the performance of the iPhone 14 Pro Max.

If you compare the performance of a single-threaded desktop CPU, the 10-fold gap is equivalent to taking the current CPU against the 2010 product. See, the first two rules of “clean” programming alone wiped out 12 years of hardware evolution.

Should the function be smaller and more specific?

What’s even more shocking is how easy it is to restore this portion of performance. Here we do not emphasize the two terms of “the function should be small” and “the function only does one thing”. After all, our test is very simple and naturally complies with these regulations. So, if we add another requirement to the question, we should be able to see their actual impact, right?

Here, I’ve added a dummy function on top of the original hierarchy to give how many corners each shape has:

 /* ======================================================================== LISTING 32 ======================================================================== */ class shape_base { public: shape_base() {} virtual f32 Area() = 0; virtual u32 CornerCount() = 0; }; class square : public shape_base { public: square(f32 SideInit) : Side(SideInit) {} virtual f32 Area() {return Side*Side;} virtual u32 CornerCount() {return 4;}    private: f32 Side; }; class rectangle : public shape_base { public: rectangle(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {} virtual f32 Area() {return Width*Height;} virtual u32 CornerCount() {return 4;}    private: f32 Width, Height; }; class triangle : public shape_base { public: triangle(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {} virtual f32 Area() {return 0.5f*Base*Height;} virtual u32 CornerCount() {return 3;}    private: f32 Base, Height; }; class circle : public shape_base { public: circle(f32 RadiusInit) : Radius(RadiusInit) {} virtual f32 Area() {return Pi32*Radius*Radius;} virtual u32 CornerCount() {return 0;}    private: f32 Radius; };

Rectangles have four corners, triangles have three corners, and circles have none. After that, I’d adjust the definition of the problem from calculating the total area of each shape to calculating the corner-weighted area sum of each shape—that is, the total area plus the total number of corners.

Like the total area, calculating this corner-weighted area has no practical significance. It is purely for demonstrating the difference in performance, and the simplest mathematical calculation is used.

Here, I’ve updated the “clean” summing loop with math calculations and other dummy function calls:

 f32 CornerAreaVTBL(u32 ShapeCount, shape_base **Shapes) { f32 Accum = 0.0f; for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex) { Accum += (1.0f / (1.0f + (f32)Shapes[ShapeIndex]->CornerCount())) * Shapes[ShapeIndex]->Area(); }     return Accum; } f32 CornerAreaVTBL4(u32 ShapeCount, shape_base **Shapes) { f32 Accum0 = 0.0f; f32 Accum1 = 0.0f; f32 Accum2 = 0.0f; f32 Accum3 = 0.0f;     u32 Count = ShapeCount/4; while(Count--) { Accum0 += (1.0f / (1.0f + (f32)Shapes[0]->CornerCount())) * Shapes[0]->Area(); Accum1 += (1.0f / (1.0f + (f32)Shapes[1]->CornerCount())) * Shapes[1]->Area(); Accum2 += (1.0f / (1.0f + (f32)Shapes[2]->CornerCount())) * Shapes[2]->Area(); Accum3 += (1.0f / (1.0f + (f32)Shapes[3]->CornerCount())) * Shapes[3]->Area();         Shapes += 4; }     f32 Result = (Accum0 + Accum1 + Accum2 + Accum3); return Result; }

Basically plugging in another function as a whole, adding a new layer of indirection. Again, for clarity, no abstractions are used here.

On the switch statement side, I made basically the same changes. First, add another switch statement for the number of corners, which corresponds perfectly to the hierarchical version:

 /* ======================================================================== LISTING 34 ======================================================================== */ u32 GetCornerCountSwitch(shape_type Type) { u32 Result = 0;     switch(Type) { case Shape_Square: {Result = 4;} break; case Shape_Rectangle: {Result = 4;} break; case Shape_Triangle: {Result = 3;} break; case Shape_Circle: {Result = 0;} break;         case Shape_Count: {} break; }     return Result; }

Let’s take a look at the computing performance difference between the two versions:

 /* ======================================================================== LISTING 35 ======================================================================== */ f32 CornerAreaSwitch(u32 ShapeCount, shape_union *Shapes) { f32 Accum = 0.0f;     for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex) { Accum += (1.0f / (1.0f + (f32)GetCornerCountSwitch(Shapes[ShapeIndex].Type))) * GetAreaSwitch(Shapes[ShapeIndex]); }  return Accum; } f32 CornerAreaSwitch4(u32 ShapeCount, shape_union *Shapes) { f32 Accum0 = 0.0f; f32 Accum1 = 0.0f; f32 Accum2 = 0.0f; f32 Accum3 = 0.0f;     ShapeCount /= 4; while(ShapeCount--) { Accum0 += (1.0f / (1.0f + (f32)GetCornerCountSwitch(Shapes[0].Type))) * GetAreaSwitch(Shapes[0]); Accum1 += (1.0f / (1.0f + (f32)GetCornerCountSwitch(Shapes[1].Type))) * GetAreaSwitch(Shapes[1]); Accum2 += (1.0f / (1.0f + (f32)GetCornerCountSwitch(Shapes[2].Type))) * GetAreaSwitch(Shapes[2]); Accum3 += (1.0f / (1.0f + (f32)GetCornerCountSwitch(Shapes[3].Type))) * GetAreaSwitch(Shapes[3]);         Shapes += 4; }     f32 Result = (Accum0 + Accum1 + Accum2 + Accum3); return Result; }

Similar to the previous total area, the code between the class hierarchy and switch implementations is basically the same. The only difference is whether to call a virtual function or use a switch statement.

Let’s look at the table-driven example again. This way of combining calculation operations with data is really great. And this version only needs to modify the value in the table. We don’t even need to get other information about the shape, just add the number of corners and the area coefficient directly to the table, and we can get the result with almost the same code:

 /* ======================================================================== LISTING 36 ======================================================================== */ f32 const CTable[Shape_Count] = {1.0f / (1.0f + 4.0f), 1.0f / (1.0f + 4.0f), 0.5f / (1.0f + 3.0f), Pi32}; f32 GetCornerAreaUnion(shape_union Shape) { f32 Result = CTable[Shape.Type]*Shape.Width*Shape.Height; return Result; }

If you run all the Corner Area functions, you can see how the properties of the second shape affect its performance:

As you can see, the “clean” code performs worse in this test. The performance of the Switch statement was 2x that of the “clean” version, and the lookup table version was 15x faster.

This also highlights a deep problem with “clean” code: the more complex the requirements, the more performance-impaired these rules are . When we bring this “clean” programming approach to various real-world use cases, the final performance will definitely suffer greatly.

And the more you use “clean” code, the less the compiler can understand what you’re trying to do. Everything is put into a separate translation unit, hidden behind virtual function calls. In this way, even if the compiler is smart, it will be difficult to digest this messy implementation.

What’s even more frightening is that such a code will be helpless even if people read it! As you can see from the previous demonstration, if the code base is architected around functions, then the requirements such as fetching values from tables or deleting switch statements will be easy to implement; if it is architected around types, the difficulty will be greatly increased. . The only solution, I am afraid, is only a large-scale rewrite.

In short, just adding an attribute in the shape calculation, the speed difference has changed from 10 times to 15 times, which is equivalent to a sudden regression of hardware performance from 2023 to 2008! Isn’t it bold to erase 14 years of hardware development with one parameter? Moreover, we haven’t touched optimization at all.

All the previous demonstrations were just making a fuss about circular dependencies, without mentioning any room for optimization. Next, let’s look at the slightly optimized AVX version of the same calculation process:

The speed difference is in the 20 to 25 times range. Of course, AVX-optimized code ignores all the whimsy of “clean” programming. Four of the five principles have been disenchanted, let’s look at the last one.

Don’t repeat yourself?

Honestly, “don’t repeat yourself” actually makes sense. The version we tested also didn’t have much repetition. Only the 4-accumulation part counts as repetition, but that’s for demonstration. After all, if it were a real application, we wouldn’t even need to split it into 2 routines.

If “don’t repeat yourself” is more specific, such as not building two encoded versions of the same coefficients into two separate tables, then I can also object. After all, sometimes this can lead to better performance. But people don’t say that, they just say don’t repeat yourself, which is quite reasonable.

Most importantly, it is perfectly possible to maintain reasonable code performance while following #5.

in conclusion

So I now give a conclusion: Among these five principles, only the last one is worth following, and the first four can be ignored. Why? You may have noticed that the current software is really getting slower and slower. Compared with the real performance of modern hardware, the performance of software is too poor.

If you want to ask why it is so slow, there are many answers, and the core factor depends on the actual development environment and programming method. But at least from a certain point of view, “clean” code definitely has an inescapable responsibility. Although the underlying logic makes sense, the performance burden caused is unbearable for us.

So in the face of all kinds of rules, although some people think that this can improve the maintainability of the code base, we should at least think about the cost behind it.

Are we really willing to give up these decades of hardware development just to make programmers’ jobs a little easier? Our job is to develop programs that run smoothly on the hardware. If these principles seriously affect the operation effect of the software, wouldn’t it deviate from our original intention of practice?

Of course, we can still continue to explore better code organization, maintenance improvements, and readability methods, which are very reasonable demands. But these rules of “clean” programming are not, they are not reliable at all. I strongly recommend that they put a big asterisk saying “Adopt these rules and your code performance will shrink by a factor of ten”.

Do you choose clean code or decent performance? Welcome to leave your opinion in the comment area～

The text and pictures in this article are from InfoQ

This article is transferred from https://www.techug.com/post/clean-code-poor-performance95e209b8c62bf5be96fa/
This site is only for collection, and the copyright belongs to the original author.