“SRE google operation and maintenance decryption” reading notes (5)

Test reliability

The premise of accurate prediction information:

  1. The system has not changed at all
  2. Fully describe the changes to the entire system

Testing is a means of proving the equality of certain areas of the system prior to the change.

Types of Software Testing

  • traditional test
  • production test

    traditional test

  • unit test
  • Integration Testing
  • System test
    • smoke test
    • Performance Testing
    • Regression Testing

Each test has a cost, usually the unit test time cost is low If you want to set up the complete function and test it, it usually takes several hours. Paying attention to the cost of testing is an important factor in improving software efficiency.

production test

Production tests interact directly with a business system that is already deployed in production, rather than running in a closed test environment. sometimes called black box testing

  • configuration test
  • pressure test
  • Canary Test
    • A small number of machines are upgraded first, maintaining a certain incubation period.
    • Place code under unpredictable user traffic
    • Need to be able to roll back quickly

Create a build and test environment

  • Testing focuses on where you get the most benefit with the least effort
    • Prioritize
    • Find key functions key classes
    • Find APIs available to other teams
  • Pass smoke test before launch
  • Find bugs and turn them into test cases
  • Build a good testing infrastructure
    • Track code changes
    • Build every time the code changes
  • Precise builds, build only where changes are made, and execute one side of the modified code
  • Use tools to visualize or quantify test coverage
  • Money-related systems need more testing

Mass testing

Unit tests need targeted coverage of interdependent parts of components

Test tools used at scale

testing for disaster

Disaster recovery tools are carefully designed to run offline

  • Calculates a recordable state that is equivalent to a state where the service is completely stopped
  • Pushing recordable status to non-disaster verification tools
  • Support for common release security boundary checks

desire for speed

Sometimes the results of tests can change after repeated runs. So you need to run a certain number of tests repeatedly for certain scenarios.

Publish to production environment

Often, configuration files for production environments are easily ignored by tests.


Writing configuration files in an interpreted language is risky. There is no upper limit on the execution time of the program, and a deadline check needs to be added.

Use a mature syntax (YAML) and a heavily tested parser.

Production environment probe

The test mechanism is to verify whether the system behavior is acceptable on the determined data.
The monitoring mechanism selects whether the system behavior is acceptable under unknown data input.

Known good requests should succeed and known bad requests should fail. Replay known requests to see if the system is normal.

(It feels like it should be a book translation problem. The so-called probe should be a mock service. The mock service is deployed in the production environment. Under certain input parameters, there is a certain return value. The caller can use this probe for testing)


Testing is a relatively high return on investment for engineers to improve reliability.

This article is reprinted from: https://xilidou.com/2022/05/09/sre6/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment

Your email address will not be published.