The Concise Guide to Architecture 2022 Latest Edition

Original link: http://www.rowkey.me/blog/2022/06/04/arch-usage/

An explanation of the purpose of software architecture in the book “Clean Architecture”:

The goal of software architecture is to miminize the human resources required to build and maintain the required system.

That is, the purpose of software architecture is to minimize the human cost of building and maintaining the system.

Therefore, it can be concluded that the key thinking of architecture design is judgment and trade-off (the key thinking of program design is logic and implementation), that is, how to select technologies and combine technologies to minimize the need for human resources.

One thing to note is that it is unreasonable to talk about architecture out of business, and technical architecture and its evolution are driven by business goals.

The six-step approach to architecture

The author’s understanding of the six-step thinking method of architecture proposed by Xia Huaxia, chief architect of Meituan, once shared.

It is especially important to note that when faced with problems, first try to convert unknown problems into known problems, rather than creating new ones.

architectural means

The purpose of the architecture is to solve the complexity, which mainly includes three aspects: high performance, high availability and scalability. In addition, distributed systems are typical complex systems faced in architecture work, and there are some common solutions to common problems in them.

High availability

High availability = system built on multiple machines = distributed system

Redundancy: Live more in the same city or live in different places
Downgrade: A downgrade plan needs to be established for each key node. It can ensure that the service of most users is normal when the estimated traffic is exceeded. Including multiple nodes that a request passes through. Take the live broadcast system implemented by rotation training as an example:

node	means	illustrate
client	Pull frequency downgrade	The server modifies the rotation training time in real time
.	Avalanche Prevention Strategy	After an error occurs in the rotation training, the rotation training time is automatically increased exponentially.
.	Like message merge	Merge on the client side to reduce the number of messages processed on the server side
Nginx	Interface current limiting	For interfaces, limit QPS
business container	The number of pulls is automatically downgraded	The number of returned items for each message type can be modified online
.	Upstream frequency downgrade	The frequency limit of likes and comments can be downgraded
Kafka	Disaster recovery queue	Write to the disaster recovery queue when Kafka fails
message processing BG	Automatically discard messages	Unimportant messages can be discarded as appropriate
.	Handling Latency Degradation	According to the delay size, the locking serial and unlocking parallel processing strategies are adopted

Full-link service monitoring: Add monitoring to all nodes on the requested link. Including client APM, error log, JVM monitoring, QPS, status code, delay, basic monitoring of server resources (bandwidth, CPU, memory, IO), etc. An example is as follows:

node	Monitoring content
client	APM
Nginx	Error code monitoring and alarm; access to QPS, interface time-consuming distribution, bandwidth
Business Applications	Error log, QPS, status code, delay; JVM; service-dependent QPS, status code, delay
Kafka	messages pile up
message processing BG	Error log; JVM; number of message processing, message processing delay
Basic resources	Bandwidth, CPU utilization, memory, disk

high performance

By-products of distributed systems

database cluster
cache architecture
load balancing
NoSQL: Not limited to relational databases, choosing a NoSQL database in a suitable scenario will improve performance
Heterogeneous index: In the case of partitioning, in order to improve the performance of the scenario where the query is not performed by the split key, a heterogeneous index table is constructed. First obtain the primary key of the target record by querying the heterogeneous index table, and then query according to the primary key of the record, thereby avoiding full database full table scan

Low-latency solution [system response performance improvement]

Asynchronous : Queue buffering, asynchronous requests.
Concurrency : Use multiple CPUs and multiple threads to execute business logic.
Proximity principle : cache, gradient storage.
Reduce IO : Merge fine-grained interfaces into coarse-grained interfaces, and frequent coverage operations can only do the last operation. Here is a point that needs special attention: try to avoid calling external services in the loop in the code. It is better to use the coarse-grained batch interface to make only one request outside the loop.
Partitioning : Frequently accessed datasets are kept to a reasonable size.

High throughput solution

Hierarchical calls: access layer, logic layer, data layer, and cluster management of logic layer through Proxy or Router
Asynchronous concurrency

Scalable

Layered architecture/concise architecture: One-way dependencies, clear responsibilities.
SOA: Service-oriented architecture with larger service granularity.
Microkernel: Pluggable architecture for clients.
Microservices: suitable for complex large-scale systems, fine-grained services.

System expansion ideas

Expansion via clone -> High Availability
Expand by splitting different things -> expand vertically
Split something like this to expand -> expand horizontally

Distributed Systems

Massive request problem

The essence is how to achieve high throughput, low latency, and high availability, as described above.
Mass server management
- Failure Recovery and Scalability: Distributed Directory Services, Message Queuing Services, Distributed Transaction Systems
- Operation and maintenance convenience: automatic deployment tools, centralized log analysis system, full-link monitoring
Development efficiency
- Complex communication programming: microservice framework, asynchronous programming tools
- A large number of module division of labor: Iaas/Paas/Saas cloud service

Architectural Principles

Avoid over-engineering : Simple architecture is the best architecture. The simplest solutions are the easiest to implement and maintain, and avoid wasting resources. But the scheme needs to include extensions.
Redundancy design : Redundant nodes for services and databases to ensure high availability of services. It is realized through database master-slave mode and application cluster.
Multi-active data center : For disaster recovery, the high availability of applications is fundamentally guaranteed. It is necessary to build a multi-active data center to prevent the failure of a data center due to uncontrollable factors, causing the unavailability of the entire system.
Stateless design : The design of APIs, interfaces, etc. cannot have pre- and post-dependencies, and a resource is not affected by changes to other resources. Stateless systems scale better. If state is necessary, either the client manages the state, or the server manages the state with a distributed cache.
Rollback : For any business, especially critical business, there is a recovery mechanism. Rollback can be achieved using log-based WAL, event-based Event sourcing, etc.
Can be disabled/self-protected : It has a current limiting mechanism, which can reject overflow requests when the upstream traffic exceeds its own load capacity. Traffic can be blocked at the front of the application by manual switching or automatic switching (to monitor abnormal traffic behavior). Current limiting algorithms include: token buckets (supporting burst traffic), leaky buckets (uniform traffic), counters, and semaphores (limiting the number of concurrent accesses). In addition, never rely on the reliability of third-party services. There must be service degradation measures and circuit breaker management for functions that depend on third-party services. For example, for each network operation, a timeout period needs to be set.
Problem traceability : When there is a problem with the system, it can locate the trajectory of the request, the request information of each step, etc. The distributed link tracking system solves this problem.
Monitorable : Monitorability is the key to ensuring the stable operation of the system. Including monitoring of business logic, monitoring of application processes, and monitoring of system resources such as CPU and hard disk that applications depend on. Every system needs to do a good job of monitoring at these levels.
Fault isolation : Isolate system-dependent resources (threads, CPUs) and services so that the failure of one service will not affect the invocation of other services. Faults can be isolated through thread pools or distributed deployment nodes. In addition, providing separate access channels for different users not only enables fault isolation, but also facilitates user permission control.
Mature and controllable technology selection : Use mainstream, mature, documented, and supported technologies on the market, and choose the appropriate rather than the hottest technology to implement the system. If you are faced with the choice between self-developed and open source technologies, you need to consider the degree of fit: if the functional requirements have a high degree of fit, you can choose open source; if the open source technology is a subset or superset of the requirements, you must measure the ability to understand the open source technology. The cost and the cost of self-research are that high.
Cascade storage : memory -> SSD hard disk -> traditional hard disk -> tape, data can be stored hierarchically according to its importance and life cycle.
Cache design : Isolate requests from back-end logic and storage, which is a mechanism for the principle of proximity. Including client cache (pre-delivered resources), Nginx cache, local cache and distributed cache.
Asynchronous design : For interfaces where the caller does not pay attention to the result or allows delayed return of the result, using a queue for asynchronous response can greatly improve system performance; when calling other services, it returns directly without waiting for the server to return the result, which can also improve the system. Responsive performance. Asynchronous queues are also a common means of solving distributed transactions.
Forward- looking design : According to industry experience and prediction of business volume, the scalability, backward compatibility, and capacity early warning are designed in advance. In order to prevent various problems from affecting the service after exceeding the system capacity.
Horizontal expansion : Compared with vertical expansion, the ability to solve problems through heap machines is the highest priority, and the load capacity of the system can be close to infinite expansion. In addition, cloud-based computing technology can automatically adjust the capacity according to the system’s load, which can save costs while ensuring the availability of services.
Build and Release in Small Steps: Quickly iterate on projects, with quick trial and error. There can be no project planning that spans too long.
Automation : The automation of packaging and testing is called continuous integration, and the automation of deployment is called continuous deployment. The automation mechanism is the basic guarantee of rapid iteration and trial and error.

Technical selection principle

Whether it is a production-grade, mature product. Production-level, operable, manageable, mature and stable technologies are the first choice. Technology has a life cycle, and you need to stay sensitive to new technologies, but don’t start using them at an early stage. The version number, the number of companies used, the completeness of documents, and the operation and maintenance support capabilities (logs, command lines, consoles, fault detection and recovery capabilities) are all manifestations of maturity.
The introduction of new technologies must adhere to the principle that less is more , and try not to introduce new technologies without introducing new technologies. After all, the introduction of new technologies has both learning costs and maintenance costs. And for a company, the more technology stacks, the higher the cost of learning and maintenance, the technology stack knowledge cannot be shared, and the technology system cannot be established, which will seriously affect R&D efficiency and business scale capabilities. If it reaches the point where it must be introduced, there must be a strict technical review process.
Before introducing a new technology, it is necessary to fully investigate and understand the prerequisites of the new technology , and not blindly introduce it. For those that really need to be introduced but the prerequisites are not yet met, it is necessary to make phased planning, lay a solid foundation first, and then introduce new technologies in a timely manner.
Don’t blindly follow big companies . Many times the technology that works for a large company is not suitable for a small company. After all, large companies have sufficient manpower, resources and time that small companies cannot match.
Technology is culturally specific . Technologies that are popular abroad may not be popular at home. When selecting models, try to use technologies that have a cultural foundation in China and have already landed and blossomed. In addition, the popular technology culture of different companies is different, and you need to consider your company’s business model, existing technology ecosystem, and developer skills.
Use technology you can control . According to the business scale, team size and personnel level, the technology needs to be analyzed after a comprehensive evaluation to decide whether to introduce it.
For key technologies, we must find the right people to use and develop them. Giving it to the wrong person will not only fail to solve the problem, but will create more problems.
Resist the religious belief of technology. There is no absolute good or bad technology, only suitable and inappropriate, usage scenarios, etc.
Practice brings true knowledge . For the introduction of new technologies, it is necessary to run samples, do stress tests, and even read the source code on the basis of careful study of its documentation, and then gradually expand the scale of use after verification by some pilot projects.
There is a life cycle for the implementation of some complex and heavyweight technologies. It is necessary to comprehensively consider it, formulate an implementation plan, and promote the implementation of the technology in stages (introduction, customized transformation, small-scale pilot, and then gradually expanding the production scale).
The choice of self-developed, open source, and purchase . If the technology is not the best at and cannot provide a differentiated competitive advantage, it can be directly used open source or purchased if the cost allows; the core technology on the key link must have the ability to customize or self-research. In addition, startups try to use open source technologies or purchase cloud services as much as possible. With the growth of the company’s business scale, they gradually need to have customization and self-research capabilities.

In addition, for open source technologies, it is necessary to pay attention to:

Whether it is a first-line Internet company’s landing product . For example, many of Alibaba’s open-source software have been internally verified in the production environment, forming a closed loop. Many third-party software service providers are only open source and do not have their own needs. Therefore, the community needs to use feedback to form a closed loop, which means that you have to step on the pit with him to form a closed loop.
Whether there is an endorsement of a large company or organization . For example, the K8S launched by Google at the beginning did not have an advantage. However, due to Google’s strong appeal and endorsement ability, it prompted a large number of users to use it and formed a closed loop, which made K8S basically monopolize the container PAAS market. Similarly, the vast majority of open source projects under Apache can be trusted.
Whether the open source community is active . The number of stars on Github is an important indicator, and it will also refer to the frequency of its code and documentation updates (especially in recent years). These indicators directly reflect the vitality of open source products.

Data Design Principles

Pay attention to storage efficiency
- reduce transactions
- Reduce join table queries
- Use indexes appropriately
- Consider using a cache
Avoid relying on database computing functions (functions, memory, triggers, etc.), and place the load on the business application side that is easier to expand
In the data statistics scenario, Redis can be used for data statistics with high real-time requirements; non-real-time data can use a separate table, and update data through queue asynchronous operations or timing calculations. In addition, for statistical data with high consistency requirements, it is necessary to rely on transactions or regular proofreading mechanisms to ensure accuracy.
Index Discrimination Rule: Attributes with an identification degree exceeding 20% should be indexed if there is a query requirement.
For numeric data, you can use the order-preserving compression method to reduce the length of the string on the premise of keeping the order unchanged. Such as: 36-hexadecimal conversion is an order-preserving compression method.
For the deduplication count of a large amount of data, if the error is allowed, the cardinality estimation algorithm (Hyperhyperlog, Loglogcount) or Bloom filter can be selected.

System Stability Principle

Published in grayscale to minimize the scope of impact
slow query review
Defensive programming, don’t trust anyone and services: automatic circuit breakers, manual downgrades
SOP (Standard Operating Procedure): Tooling, Automation
Routine inspection! ! ! : DB, call chain, P90 response time
Capacity Planning: See the Capacity Planning section below

SOP example:

Requirements management	Project Development	test	release online	Monitoring alarm	Troubleshooting
Task management, technical review, upstream and downstream dependency change review	Branch management, cross code review, code specification, log specification, code static check	Unit testing, smoke testing, regression testing, office testing, online stress testing	Publish and verify business results online	Business indicator monitoring, system monitoring dashboard	Feedback to superiors as soon as possible and inform business parties in a timely manner: problems, scope of impact, solutions, estimated recovery time, and online service downgrades

System Capacity Planning

It is necessary to evaluate and quantify the system/key modules so as not to overwhelm the server when the capacity is exceeded, and still be able to serve most users.

Estimate the business volume at a certain point in the future according to the traffic model, historical data, and prediction algorithm: QPS, daily data volume, etc.
Evaluate the maximum carrying capacity of a single point (the amount of data carried by a single point of the database, and the concurrent capacity of a single point carried by the application server) [pass the performance test], calculate the number of nodes to be deployed according to the business volume, and do 1.5 times the deployment (DID principle).
The performance stress test verifies the load capacity of the entire system.
Design early warning, current limiting, rapid recovery measures and subsequent expansion plans when the estimated capacity is reached.

PS: In the capacity estimation, the calculation of the number of machines follows the DID principle: 20 times the design, 3 times the implementation/implementation, and 1.5 times the deployment. That is, 1.5 times the number of machines that can carry the estimated business traffic needs to be deployed.

Architecture hidden danger analysis

FEMA Method Analysis Form

Function point: user perspective
Failure mode: the failure of the system, quantitative description
Failure Impact: How Function Points Are Affected
Severity: The degree of impact on the business. fatal/high/medium/low/none
Cause of failure: the reason for the failure
Failure probability: the probability that a specific failure cause will occur
Risk degree: comprehensively consider the severity and failure probability, severity × failure probability
Existing measures: Countermeasures when a fault occurs. Including detection alarm, fault tolerance, self-recovery, etc.
Avoidance measures: things done to reduce the probability of failure, including technical means and management means.
Workaround: Complete workaround for this issue
Follow-up planning: follow-up improvement plans, including technical means and management means, can be avoidance measures or solutions. The higher the risk level, the higher the priority of solving the hidden danger

FEMA Method Analysis Example

function points	failure mode	failure impact	severity	cause of issue	Failure probability	level of risk	existing measures	workaround	solution	Follow-up planning
Log in	User Center MySQL response time exceeds 5 seconds	User login is slow	high	Slow query in MySQL	high	high	Slow query monitoring	Kill slow query process; restart MySQL	none	Optimize slow query statements
Refresh news list	Redis not accessible	When Redis cannot be accessed, then Redis-based portraits, content, etc. cannot respond, which will affect 100% of users	high	Redis service down	Low	middle	none	none	none	Redis services that rely on UCloud are at risk, and you need to build your own Redis to share the risks

Principles of Restructuring

The architecture of a system is constantly evolving with the business, so it will inevitably leave a lot of technical debt. If left alone, one day technical debt will explode and cause unexpected damage. Therefore, many times the refactoring of the architecture is necessary. The principles to be followed are as follows:

Determine the purpose and necessity of refactoring: for business needs; whether there are other alternatives
Defining the boundaries of “refactoring done”
incremental refactoring
Determine the current state of the architecture
Don’t ignore data
Manage technical debt
stay away from vanity
Be prepared to face pressure
Learn about the business
Be prepared for non-technical factors
Ability to master code quality

Architecture Transformation Implementation Mode

Demolisher mode: Redesign the architecture according to business needs, that is, rewrite all the code at one time. This model is costly, cannot support continuous delivery well, and has a high risk of structural transformation.
Strangler mode: Keep the legacy system unchanged, and re-develop new functions into new systems/services, gradually replacing the legacy system. Suitable for large and difficult to change legacy systems.
Repairer mode: Isolate the old parts/modules to be transformed (solved by adding an intermediate layer, which can be abstract classes, interfaces, etc.), and gradually transform it within the original legacy system through iteration. To ensure that it still works with other parts.

Online data migration

Online data migration refers to migrating data that is providing online services from one place to another, and the service is not affected during the migration process. This scenario is a problem that will occur in the process of system evolution, including the need for business evolution and the need for performance expansion. Typical steps are as follows:

Online double write: write code in the business system, and write data to the old and new data storage at the same time. After this step is completed, consistency verification is required, including storage dimensions and business dimensions. The former refers to comparing the data in the original data store and the new data store, and the latter refers to comparing the data dimensions seen by the user.
Historical data migration: Migrate historical data from old storage to new storage. Including both offline and online. Offline is to write batch processing programs or rely on the synchronization mechanism of data storage to query historical data from the old storage (the data before double-write is turned on) and insert it into the new storage. Online refers to the synchronization of data online that relies on the synchronization mechanism of data storage, such as MySQL’s binlog and MongoDB’s OpLog. In this process, it is recommended to perform consistency verification after part of the data is migrated, and then migrate the full data after passing.
Cut reading: Gradually switch requests to the new system by means of grayscale, and grayscale can gradually enlarge the amount of requests to read the new system by embedding switches in the code. General process: pre-release/Tcpcopy environment (verification code is running normally) -> office environment/online environment uid whitelist (internal users, verification function is normal) -> online environment percentage 0.1%, 1%, %10% ( Further verify that the function is normal and the performance and resource pressure) -> the full amount of the online environment. This process is recommended to last one to two weeks.
Cleanup: After the data migration verification is passed, clean up logic codes such as double-write code and switch code of the business system, old stored data, supporting systems, and old resources.

In some cases, historical data can be relocated first, and then new data can be written. The new data generated during the relocation period needs to be handled carefully, and the method of writing to the queue cache is generally used, which is called “chasing data”.

In addition, if it is a single-function online data migration, you can refer to the implementation mechanism of Redis Cluster data redistribution.

The offline program migrates data and maintains the data migration status: not migrated, migration in progress, and migration completed.
Unified control of business codes. When data reading and writing occurs, if the data status is migrating, block and wait until the migration is completed before performing subsequent operations; if the data status is migration completed or new data, the subsequent logic is directly executed; if the data status is not migrated, then Actively initiate migration or wait for offline migration to complete.

other

The system design process consists of four steps: identify complexity -> design alternatives -> evaluate and select alternatives -> detailed scheme design
When discussing technical solutions, it is based on whether it is reasonable, not on the basis of low workload.
When the cost of vertical separation of legacy systems is relatively high, you can consider deploying a separate cluster directly in the clone project, separated by service addresses, and different interfaces use different service addresses.

This article is reprinted from: http://www.rowkey.me/blog/2022/06/04/arch-usage/
This site is for inclusion only, and the copyright belongs to the original author.