Talk about caching in system design

I haven’t updated it for a long time. I originally wanted to update “Is the front end really dead”, but there happened to be some discussions at work, so I changed it to update the cache first.
This article is suitable for: all kinds of engineers who don’t often design caches.

background story

A scene today is: there is a piece of country information data, the structure is probably: [{ region: 'CN', code: 12345, text: '中国' }] such an array of countries (the actual fields are not the same), and in Before this, this piece of information was stored in an external interface provided to the front end. You are a BFF provided to the front end and want to perform secondary processing based on these data.

 // 一段伪代码// 调用一个外网接口if regionList, err := fetchRegionList(); err != nil { handle(regionList) // 对region 进行处理}

So how would you optimize it? (Skipping out the communication work of contacting the underlying service to provide an intranet interface)

Let’s start with caching

To analyze in this example, first of all, we will feel that every time the external network is called, the data still looks unchanged. Can we put a “cache” and read the “cache” directly when there is content, What about getting the original data when there is no “cache”?

Great, you figured out how to optimize time-consuming references, so how do we design “caching”?

When it comes to caching, the first reaction of most newcomers is probably “caching, isn’t it just redis?”

But in fact, when we store data, we can divide it into at least three levels: “local cache”, “redis”, and “DB”. Regardless of DB, everyone knows that it is used to store data , then under what circumstances should I use the remaining two-level cache.

Back to this scenario, if we use redis, obviously, we will need to:

A network IO
A redis resource

1 Everyone can understand, then as many requests as there are upstream, there will be as many requests to redis, and 2 is not a joke, after all, I believe that most companies are still on the road of “reducing costs and improving efficiency”.

For this limited data set, using local cache can solve this problem with a small amount of memory. It has no additional network IO and will not cause additional pressure on downstream services, which fully meets the core idea of cost reduction and efficiency improvement aesthetics.

Is the cache just for

With this point, is the problem really solved? In fact, for the cache, we think more about the “write” and “update” of the cache. How to solve the problem of data consistency is the big head of the cache design. 99.9% of the cases will be updated, but there is a difference in “how long the shelf life is”.

In the above design, it is obvious that we have not considered the issue of expiration, and never expiration is the worst design in cache design. Cache is usually stored in memory. Obviously, memory is a finite level. A bad cache design will never expire, and you will soon be happy with the cache full.

A reasonable cache should be: a reasonable data structure + a reasonable elimination strategy.

How the cache is designed

In all programming, the first step we consider is to analyze our scenario . For example, we analyzed the “scene” and “update frequency” above and came to a conclusion: “We don’t need to use redis.”

For caching, we should consider the following points and prepare a defense mechanism for cache invalidation. :

The root cause of the cache: is it because of the large volume or because the downstream is slow
What is the hit rate of the cache: do we need a cache?
Timeliness: how to update the cache
What is the QPS of the cache: hot key problem

root cause of caching

Cache is not synonymous with fast, it just turns the disk IO built on the DB into the IO built on the memory, and at the same time adds a layer of cache copy. We also mentioned the four words “cost reduction and efficiency improvement” above. , the storage of the cache itself is also an additional overhead, and it also increases the complexity of the link. (If there are students who do not understand the five words “link complexity”, please leave a message. If there are many people, you can add extra meals). If we want to be clear about the fundamental purpose of cache, there are usually two points:

My QPS is too large, and my downstream cannot handle such a large QPS. Some means are needed to intervene in the downward transparent transmission of traffic. This is the most common usage scenario.
My downstream response speed is too slow, or the stability is too poor, which affects my own service quality, and the data is general (do not cache user-related data), so at this time, caching can speed up my service.

cache hit rate

To quote a passage from my previous “Talking about Front-End SSR System Design Ideas” :

Can we cache SSR, whether it is page-level caching or component-level caching-for a general solution, this is a relatively straightforward solution. Indeed, it can effectively reduce CPU overhead, but in any case, the design of “caching” must be determined for a specific business model: whether my business has certain requirements for real-time performance; what is my cache granularity, timeout What is the time; even after I have done the cache, what is my cache hit rate, it is meaningless to just say that the cache is used blindly. ——What’s more, caching will affect your entire development model at the same time, which may introduce additional development costs, and it is also not a silver bullet.

Suppose I really satisfy the “root cause”, but you find that the cache utilization rate is too poor, and the cache is not effectively used, which means that it is not used.

There are two types of design ideas here:

Preliminary evaluation: what is the generality of the business data, what is the cache timeliness requirement (this will be described in detail later), and the key is designed. This determines his hit rate.
Try it first: In some optimization attempts, we may also decide to be reckless. After installing the cache, we still need to observe the cache hit rate. Didn’t work – suggest reevaluation.

cache timeliness

We mentioned above that “a cache without a timeout is a hooligan”, so how to design our timeout and how to write the code?

In cache design, we must first design a reasonable cache life cycle. The simplest idea is to “go back to the source when it expires.” As long as the life cycle of business data is measured, for data that does not require strong real-time requirements, The general minute level is acceptable. But the problem is that if you retrieve it after it expires, there may be a “cache breakdown” problem. If the expired life cycle happens to be the same, you may even encounter a “cache avalanche”. I won’t introduce “cache breakdown” and “cache avalanche” here, there are too many Googles. It can be seen that cache timeliness is a core essence in cache design.

The main ideas here are:

Long cache + job update. For the synchronization of DB data update, the more common operation is to consume binlog data refresh. In this design, the cache time is usually allocated for a long time, or even never expires.
Distributed locks, using locks to merge read database requests, only use one thread to read, and the rest wait for the lock to be released to get cached data

How to save the cache

Assuming we have finished selecting the data structure and considering the cache timeout, is the cache finished? ——For redis, it is also a service, but it is more durable than DB, and it will also be unable to handle it. Here we need to analyze hot keys on the basis of QPS estimation. Regarding how to monitor hot keys, Usually it is not a question for the business side to consider. You can Google it to implement it. Here we mainly analyze the hot key and make reasonable strategic operations. The above example is actually an example of “selecting the cache level” – choosing a local cache , suitable for scenarios with a small amount of data but extremely popular. The local cache is often at the process level, so there will be multiple copies on a single-machine multi-core, and the possibility of a single process update failure is not ruled out.

Of course, for a large number of business hot keys, it may not be suitable enough. We can consider reading and writing in separate clusters. This is not only the separation of cold and hot keys, but also cluster segmentation for hot keys.

cache lightning protection

In addition to the points mentioned above, for redis, we also need to prevent the problem of large keys. The reading and writing of large keys will cause cluster pressure and become a full risk point.

cached imagination

Most of the cases mentioned above take redis as an example, but in fact, not only do we have redis and the server’s local cache, but if it is really a list of countries to be consumed by the front end, we can also let the front end obtain the json file and go end cache or CDN cache.

From this point of view, we have a new caching option – cache control: unchanging interface data, for example, I can be a versioned configuration data; or the so-called eternal data, the pressure is not at all will be passed to the server.

Summarize

The introduction to caching comes here first, and later I may use this example to introduce how to weigh and choose “code storage”, “configuration center”, and “redis read” (but the article should not be too detailed) long).

The article I originally wanted to write here is “Back-end cache”, but for our system design, whether you are a front-end engineer or a back-end engineer, it is more of a cost transfer process, that is, weighing the pros and cons of the entire link upstream and downstream To decide which layer to put the pressure on, how to do it, whether it is transparent or not, so the result is not limited to the “back-end” field. I hope to discuss some usage and thinking about caching with you.

This article is transferred from https://www.codesky.me/archives/cache-design-in-system.wind
This site is only for collection, and the copyright belongs to the original author.