Talk about the SLB fault review of station B

Original link: http://afoo.me/posts/2022-07-16-bilibili-slb-broken-postmortem.html

profile.jpeg

Talking about the SLB fault review at station B – Teacher Fuqiang said: thinking and precipitation of an architect

Talk about the SLB fault review of station B

Wang Fuqiang

2022-07-16


The whole article is good and detailed, but I always feel that the final improvement measures may not be so in place.

In fact, there is no need to overemphasize the problem of multiple actives. If it is really an access layer problem, how many active access points are useless, right?

As for fire drills, this is no problem. Train early and prepare early!

I think that more attention should be paid to R&D process management, especially the testing and launch of key infrastructure.

The problem with SLB this time should be that the newly added function of load balance based on weight has not been fully tested, especially precheck. The situation of 0 and “0”, I think as a typical marginal condition, should not be tested…

Therefore, strengthening the management of the R&D process, strengthening the daily code review, and strengthening the testing of key infrastructure before going online can greatly reduce the probability of such problems in SLB (and other key infrastructure).

As for the fire drill, it is equivalent to a prepared and premeditated training team, but I feel that Station B should have entered the ranks of Chaos Engineering long ago. From passive to active, with offense as defense, this is the ultimate stability test ^_-

So, to sum up briefly, the whole thing, I think the priority and order of the three things that should be done should be:

  1. Strengthen R&D process management , especially the addition, testing and launch of key basic middleware;
  2. Fire drills to exercise the team’s emergency response capabilities;
  3. Live more and move forward gradually according to the situation;

above.



zanshang.jpg


sph.jpg
©Wang Fuqiang Personal Copyright, All Rights Reserved.
Copyright © Wang Fuqiang All Rights Reserved – Since 2004

This article is reprinted from: http://afoo.me/posts/2022-07-16-bilibili-slb-broken-postmortem.html
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment