NSCSCC2022 Godson Cup Competition Summary

Original link: https://blog.eastonman.com/blog/2022/08/nscscc2022/

In a blink of an eye, it is late August, and there is only one week left in the summer vacation, and of ~~course it was the same last year~~ . The last time I wrote the content was at the beginning of the year. Basically, the extracurricular time since the first half of the year was basically devoted to the Godson Cup and related things. The Loongson Cup literally means an event co-sponsored by Loongson Zhongke Company and the Computer Education Research Committee of Colleges and Universities. The main content of the event is to use RTL to design a complete CPU and verify and display it on FPGA. 2022 is the sixth year of the Loongson Cup. This year’s competition is also different from the past. The ~~LoongArch~~ Challenge has been added.

Add a track

As mentioned above, the biggest difference in the 2022 Loongson Cup is that in addition to the individual and team competitions of the MIPS instruction set in previous years, the LoongArch challenge track has been added. The competition system, schedule and evaluation method of this new track are quite different from the MIPS track. The differences are reflected in:

Instruction set difference : This is of course the main difference. The LoongArch Challenge uses the LA32 Reduced instruction set, while the previous individual and team competitions use the MIPS instruction set.

Schedule difference : The team competition and the individual competition are divided into preliminary and final. The 2022 preliminary competition will end on August 5, the final submission will be due on August 19, the online answer will be on August 19, and the final defense will be on August 20. The LoongArch track does not distinguish between the preliminary round and the final round ( ~~because there are not many teams in the first place~~ ), the deadline for submitting the performance test is August 14, there is no defense, and the submission time for the ppt display video is August 18.
Difference in final threshold : The final threshold of MIPS team competition is performance score. The number of final teams is generally relatively constant. As long as the performance ranking can reach a certain level, they can enter the final. Although the LoongArch Challenge has no substantive finals, there is still a threshold, which is to start Linux .
Differences in evaluation methods : The final score of the MIPS team competition is 50% performance, 50% system presentation and defense, while LoongArch is 70% performance, 30% system presentation and defense.
The difference between the development tools provided by the official : MIPS team competition only provides a series of tests such as functional test and system test, while LoongArch provides a differential test framework and a random verification instruction stream. In comparison, LoongArch’s simulation verification method is more advanced.
Differences in some detail requirements: For example, the performance test samples of the LoongArch Challenge are different. The performance test does not allow to modify the SoC, nor to use the Linux compiled by yourself. And LoongArch allows graduate students to participate.

Of course, basically the content to be completed in participating in the competition is similar. It is nothing more than designing a CPU. Most undergraduate teams follow the traditional 5-level pipeline. On this basis, increase the pipeline level, increase the sequential double launch, increase the frequency, etc. . Teams with strong abilities or teams with strong previous seniors can carry out some micro-architecture innovations according to the existing framework, which is basically the case.

The LoongArch instruction set also happens to be striding into the open source community in 2022, and many important open source software such as Linux, QEMU, GCC, etc. will join the support of LoongArch in 2022. So let’s talk about the difference between the LoongArch instruction set of Loongson and MIPS.

The first and most obvious (and ~~welcome~~ ) change is the removal of the delay slot . The delay slot is an over-design of the 5-stage pipeline micro-architecture that was common in that era when MIPS was designed. This design bound to the micro-architecture obviously has not kept up with the rapidly developing hardware design level. In 2022, the delay slot This thing belongs to the instruction set design that is useless and seriously increases the design complexity. I take it for granted that LoongArch did not use this long-criticized feature.

The second is to remove the concept of coprocessor, privileged resources are supported by privileged registers (CSR) similar to RISC-V, and floating point operations are no longer performed by coprocessors. Although there is no need to implement hardware floating-point components to participate in the Godson Cup, and in most RISC implementations, the floating-point components are usually completed by a dedicated deep pipeline component because of their huge area and complex logic, which is decoupled from the main pipeline. , but I think this is still a good design, and it also removes unnecessary microarchitecture bindings in MIPS, leaving room for more complex designs.

Once again, the immediate data encoding in MIPS has been modified, from the unified 16-bit to the separated immediate data format of long 20/short 12. The benefits of this can be fully reflected on RISC-V. Common instructions use short immediate data to increase the coding space of the instruction. Instructions such as long jump and immediate data load use special 20-bit immediate data, avoiding the use of short immediate data. The problem of increased code size caused by immediate data.

In addition to the above advantages compared to MIPS, LoongArch is also quite different from RISC-V. The first is that LoongArch has considered a lot of expansion at the beginning of its design. Compared with the minimalism and extreme modularity of RISC-V, each expansion develops slowly. The design of LA is more commercial and practical, and there are many tradeoffs and edges. case, there are many more types of instruction formats than RISC-V.

Team situation

Originally thought that with the advanced infrastructure and advanced instruction set of LoongArch, the strong schools in the past would put strong teams into the LA Challenge, but in fact, although LA had strong teams, some also came up with complex out-of-order designs. However, the traditional strong schools still basically focus on the MIPS team competition. I guess (blind guess, please correct me if I’m wrong) because these schools have infrastructures that have been passed down for several years, and the development process of MIPS can also be very advanced and efficient. It is very clear what kind of ranking can be obtained, so it is not willing to give up mature infrastructure and development process and turn to new tracks to step on the pits.

In contrast, although our school has also won the first prize and the second place in the Godson Cup, our school is seriously lacking the infrastructure of MIPS and the experience inheritance of seniors, which is reflected in the following aspects:

There is no differential testing framework. The strong schools that have participated for many years have the differential testing framework transplanted by the previous seniors, and even have a complete CI automated testing infrastructure. Our school does not have any of these infrastructures.
No senior has successfully booted Linux. Since the final threshold of the MIPS track is not to support the operating system, the support of Linux in the final score only guarantees a high starting score. Therefore, in the absence of differential testing, the previous seniors of our school often choose to make a good performance. , and then rely on more advanced peripherals and interactions to get the score displayed by the system. No experience in adapting the system is very fatal for trying to start Linux.
There is no mature SoC and peripheral adaptation code and process. In addition to the CPU, a lot of work needs to be done on the SoC to adapt the peripherals on the operating system, including address space allocation, interrupt number allocation, clock allocation and so on. Therefore, the system display of our school is basically limited to the display of bare metal programs, and no interrupts are used, and the polling method is used to avoid the work of the SoC.

This year, our team was formed on an ad hoc basis, which means that we basically didn’t know each other before we formed the team (at least I didn’t know each other), so we weren’t someone who was well versed in embedded development/familiar with CPU design. All of us, including me, didn’t get into entry-level CPU design until the team was done.

After the game, I looked through the demo video of the team game (yes, I don’t know why it appeared on station b but not the official title). The 6th Godson Cup in 2022, and from my personal feelings, I think there is still no team that has done a better job than Harry Orange and their 3rd Godson Cup in 2019. It is true that this year’s team from Tsinghua University designed the out-of-order CPU performance to exceed nontrivials-mips, but according to the ratio of IPC, I don’t think the out-of-order processor is actually able to play well, and their system adapts Basically follow the work of 2019. ~~This makes me believe that God went to Tsinghua rather than Tsinghua cultivated God.~~

The performance of our team this year is only about 60% of that of Harry Orange (because there is no direct comparison), because there is a small probability of bug in the pipeline cache, and in the end, this problem was not debugged, so we can only submit low frequency, IPC also A low version. If piped dcaches were available, we could probably outperform them by a small margin.

Competition process

The following is a review of the process of participating in the competition. I hope it will be of reference value for our school or other students who participated in the Godson Cup or who are interested in CPU design.

language selection

The development language chosen by our team is SystemVerilog, mainly considering that the digital logic and composition principle of our school are all using Verilog, but in fact, Verilog in the IEEE standard since 2005 is actually SystemVerilog. Compared with Verilog, the main advantage of SV is that it has structures and multi-dimensional arrays . In places where there are many but relatively fixed signals such as buses, SV also provides the concept of an interface to simplify the code.

Other teams in our school and other teams participating in the Godson Cup also use some other HDLs, such as SpinalHDL and Chisel. The working mode of these high-level hardware description languages is usually that the compiler compiles the high-level language into Verilog and then sends it to the simulation or synthesis tool to generate the netlist. Such high-level languages can usually solve the problem of verbose wiring and port declarations in the Verilog system. Some languages have more syntactic sugar to implement more advanced code reduction methods such as OOP.

For the choice of HDL used in the Godson Cup, there is actually no one who is good or who is bad. I summarize some key differences in the following reference for those who need it:

Advantages of Verilog languages:

There is a complete set of native toolchain support, including simulation, synthesis, implementation, editing & code completion, etc. Although it is not necessarily easy to use, it must be there.
Straightforward, the generated circuit matches what the code describes.
Quite a few open source projects/open source IP cores use Verilog.

Disadvantages of Verilog-based languages:

The code is redundant, and the development efficiency may be reduced, and developers cannot focus on functional design.
There are many ambiguous grammars. If there are undefined grammars, different tools may behave inconsistently, which greatly affects development.
Test testbench writing trouble.

Advantages of Chisel-like high-level languages:

The code is simplified and the development efficiency is high.
The generated Verilog guarantees no ambiguity.
Test samples can be written in high-level languages, which is more convenient

Disadvantages of Chisel-like high-level languages:

Often newer, toolchain support may be insufficient.
There may be compiler bugs, and when there is a problem, the generated Verilog is less readable, making it difficult to debug.
Need to learn a new language.

Of course, some advantages/disadvantages do not necessarily match your actual situation. For example, if you are familiar with java/scala syntax, it may be easy to use languages such as Chisel, so the choice of language only needs to be in line with your/your team’s situation, and Not necessarily good or bad.

Preliminary preparation

The promotion and preparation of the Godson Cup in our school started very early. After the end of the digital logic design experiment class in the fall of 2021, the promotion and preparation began. I guess it should be relatively early in most schools. When our school was preparing for the 2022 Godson Cup, there were 11 teams and more than 30 individuals participating. Yes, you read that right. I also think it is unbelievable. In previous years, it was just two teams + a few people. Of course, there are only about 3 mips+2 and 2 la+8 individuals left in the final submission (individual competition is not very clear).

In the early stage, the main focus is on learning, and don’t be too ambitious. Some students read “Superscalar Processor Design” before starting, and then plan to work on an out-of-order CPU first, and then there is no more. Basically, our team will be fishing/copying code with “Write Your Own CPU” until March 2022, and there is almost no preparation for other things. It wasn’t until I submitted the proposal for the ASC preliminary round on March 3 that everyone started to work ( ~~only started to push teammates vigorously~~ ).

At the beginning of April, we divided our work. I was responsible for branch prediction, and other teammates were responsible for the mainstream pipeline, DCache and bus. So in April, I was researching and reproducing, and finally completed the RTL reproduction of TAGE at the end of April. During the May 1st holiday, we applied for a teacher to do internal communication within the team (for the ~~first time~~ ). At that time, DCache said that there was a bug and had not been connected. Axi already had an available version, and the single-launch version of the mainstream pipeline was ready for functional testing. , the dual launch began to change.

In May, I mainly worked with teammates from the mainstream waterline to adapt chiplab and fix bugs. ~~Other teammates may be fishing~~ . At the end of May, the code writing of the dual-issue CPU was basically completed. ~~This is probably the completion of the preliminary work~~ .

Next comes debug , debug and debug .

workload and difficulty

Regarding the workload of participating in the LA track, I think the biggest one is actually the adaptation system.

Although it is not necessary to start from scratch to support the Linux operating system, the official framework chiplab provides a very complete single-issue five-stage pipeline instance CPU, which can start Linux stably, so theoretically, as long as the CPU increases Quantitative development can ensure that it adapts to Linux. But that CPU is full of assign-style circuit-level writing, which is very obscure, so we chose to write it from scratch. However, for teams with no experience in CPU design and system adaptation, even if there is a complete implementation for reference, this part is still a heavy workload. Because the system software relies heavily on privileged resources, and these privileged resources do not have a good simulation and verification load to test, so for the debugging of the privileged part, our team can basically only rely on the simulation to start Linux. Of course, this takes a long time. long time. In addition, there are hundreds of millions of instructions to start the Linux operating system (yes, it is really hundreds of millions), which requires a high degree of correctness of the CPU. Even with 60 million random verification sequences, the simulation SoC in chiplab Various random factors have also been introduced, but some problems still cannot be reproduced on loads other than Linux.

The second is the consistency of DCache. LoongArch is actually an ISA for software maintenance consistency. Compared with x86 hardware maintenance, it has greatly reduced the difficulty of hardware design. However, due to the mixed types of memory access instructions in the actual operating system, memory access and consistency maintenance instructions The confusion of DCache data makes the design of DCache data consistency especially important but very difficult to debug. Our team was ultimately stuck on the consistency maintenance of the pipelined DCache, resulting in unstable startup of Linux and a low probability of problems, so we had to submit an old version that was not optimal.

Next is the design of the CPU micro-architecture, or the design of other features. Even if it is to design a CPU with out-of-order four-launch, I believe that except for the memory access part, other parts from design to writing code to debugging can basically be completed within two months. The debugging of the fetch part may take the same amount of time. Our team takes the launch system as the highest priority, so the micro-architecture design can be said to be very crude and backward, just a 10-level static sequential dual launch.

Team time point

The following are some important time points for our team to participate in the Godson Cup in 2022:

December 2021: Team up
January-March 2022: During the learning phase, everyone will complete most of the content of “Write Your Own CPU”
April 2022: The division of labor is completed, I am responsible for branch prediction and front-end, and I complete TAGE’s thesis reproduction
May 2022: Complete the mainstream pipeline and access chiplab for functional testing
June 11, 2022: The first emulation to start Linux, no Cache
Late June 2022: Bug fixes based on random validation
Early July 2022: Encountered Verilog ambiguity problem, spending the whole team a week to debug
July 12, 2022: First on-board Linux boot with ICache
End of July 2022: Front-end branch prediction, write-through dcache completed
August 6, 2022: Front-end freeze, the first committable version is complete
August 7-14, 2022: Pipeline write back to dcache debugging, unsuccessful
August 14, 2022: Submit legacy bug-free version
August 18, 2022: Final submission of system presentation and document ppt, etc.

impression

In fact, there was another team in our school that also participated in the LA track. They used Chisel as the development language. The preparation time was much earlier than ours. After the winter vacation, there was already a five-level flow core. And their microarchitecture is much more advanced than ours, and they also use the out-of-order method of the scoreboard (without renaming), but they didn’t debug Linux in the end and missed the finals. Speaking of this, I have to mention that our final work is a static 10-level pipeline with sequential dual-issue CPU. For this 10-level static, our team members and seniors/teachers feel it is very wasteful. There is a lot of logic required for memory access, which basically requires 4-5 stages of pipeline to complete, but as a static pipeline, instructions other than memory access have nothing to do in the subsequent memory access pipeline, which is very wasteful. And our load to use problem is not very well solved, taking a conservative approach, while a more aggressive and efficient approach could have been adopted. But in fact, I think, if the LA track is the first year to participate, the primary goal is to start Linux. This is also the principle of our team from the beginning. Of course, this results in our micro-architecture not having enough time to design, so we can only follow the traditional static long-running water hastily.

This year’s Godson Cup has finally come to an end. It has been learned that there are out-of-order teams with unstable Linux startup this year, and it is planned to continue next year.

score

In the end we got the first prize. I am still very satisfied with this result ( ~~it seems that there is no more satisfactory result~~ ). In fact, it is a bit unexpected, because I saw the situation of other teams in the LA group before submitting, many teams are still debugging Linux, and it is estimated that there is no time to adjust the performance. Although we did not dare to submit the final pipeline DCache version , but the front end has been carefully tuned, and the performance is quite satisfactory.

follow-up

After the project code is sorted out, it should be open sourced.

Next, I will also write a few articles describing our front-end design and the detailed reproduction of the TAGE predictor. You can follow this blog or subscribe to RSS.

The post NSCSCC2022 Godson Cup Entry Summary first appeared on Easton Man’s Blog .

This article is reprinted from: https://blog.eastonman.com/blog/2022/08/nscscc2022/
This site is for inclusion only, and the copyright belongs to the original author.