2022-31: Engineering Efficiency Practices at Databend

Original link: https://xuanwo.io/reports/2022-31/

The importance of engineering efficiency to any engineering team is self-evident: excellent engineering efficiency practices can help the team realize its potential to the greatest extent and iterate products more efficiently; while poor engineering efficiency imitation often makes the team move forward. Every day is miserable, and finally all the enthusiasm burns out in silence.

For personal interest, I have participated in engineering efficiency related work in each company: when I was in Qingyun, I promoted the migration of the entire storage product line Golang version management to Go Modules, and during the process of participating in Databend, I redesigned the The new CI/CD Pipeline has completed the migration from self-deployed fusebot to fully SaaS PR process. Today’s article mainly summarizes my engineering efficiency practice in Databend. First, I introduce the specific practice of Databend, and then talk about some of my personal experience. I hope it can be helpful to the implementation of other open source projects.

Databend does not have a full-time engineering efficiency team. The implementation of relevant practices needs to be thankful for the efforts of @everpcpc , @ZeaLoVe and @PsiACE and other students outside their own work.

Engineering Efficiency Practices at Databend

Databend is an open source cloud-based data warehouse developed in Rust with over 30 active contributors in the past month, with an average of 10 PRs being merged per day. When the community is full of talented contributors, it is important to ensure that their ideas are successfully implemented, rather than letting the enthusiasm burn in endless test retries. Not only that, but for the open source community, engineering efficiency is particularly relevant: the CI time and merge speed of each PR are seen by everyone. Potential contributors are discouraged by hours of CI time, or worse, existing contributors are forced to give up their contributions by complicated processes, which is a major blow to the construction of the open source community.

The Databend community attempts to address these issues at the following levels:

non-inductive process
Reliable CI

non-inductive process

The best process is no process, or more precisely, a non-perceptual process: contributors can contribute intuitively and smoothly.

Databend will try to avoid disturbing contributors when introducing the process, and try to avoid worrying about anything when the process is normal, and when the process goes wrong, contributors can get informative and friendly error prompts that can be solved by themselves.

Taking Semantic PR as an example, if the title of the PR does not conform to the specification, the contributor will get feedback like this:

The advantage of this is that contributors can submit PRs intuitively without having to read and recite the developer documentation beforehand. If the PR meets the requirements, wait for the Review; if the PR does not meet the requirements, you can modify it yourself according to the information given in the review, and you will know what to do when you next PR. It doesn’t matter if you forget, our bot will still be kind enough to give the same information again. So many times, a skilled contributor is born. This process does not require manual intervention, which can reduce the energy consumption of the maintainer, avoid the maintainer from repeatedly asking the developer to modify the low-level format problems, and improve the review quality of the maintainer from the side.

In addition to the design of the process itself, ongoing maintenance of the process is also important.

In the past, Databend used fusebots developed by itself to maintain custom processes. The problems encountered were:

Insufficient follow-up maintenance: fusebots was initially developed and maintained by BohuTANG , but with the shift of development focus, BohuTANG no longer focuses on the development of fusebots, and there is no clear plan for future development.
Low adoption rate in the community: fusebots is designed to solve the specific process problems of databend, lacks the value of reuse and migration, and it is difficult for other communities to adopt and promote the solution, which further affects the future survival of fusebots
Self-deployment, operation and maintenance: fusebots is a server that monitors events and processes them accordingly. It does not provide online SaaS services and requires users to deploy and maintain by themselves. Users do not know the current status of fusebots, and they must be debugged and checked by operation and maintenance after a problem occurs.
Does not meet the requirements of insensitivity: fusebots requires users to use commands to trigger, which does not meet our expectations for the insensitivity process

Later, Databend turned to Mergify for PR automation management. The rules that have been applied so far include:

The contributor’s PR title needs to meet the semantic requirements, otherwise an error will be reported in PR Checks, and feedback will be made through comments and requested to be revised
Contributors’ PR descriptions need to comply with the specification. If they do not comply with the specification, an error will be reported in PR Checks, and feedback will be made through comments and requested to be revised.
According to different PR categories of contributors, different Checks will be run, and the final merge requirements will be different:
- Code contribution requires two Approvals and passes all test cases
- Documentation contribution only requires a successful Approval and Vercel build

Reliable CI

CI is the standard of modern open source projects. In 2022-30: How to maintain an open source project , I emphasized the importance of CI to open source projects:

Open source maintainers must maintain a good integration testing infrastructure: in an open source collaborative environment, it is impossible to have one testing team running tests for all PRs, so the community needs a continuous integration testing service to ensure that all PRs pass the necessary The tests are merged.

The meaning of reliable CI is multiple: CI Infra itself must be reliable, and it will not fail frequently due to the problems of the infrastructure itself; CI Pipelines syntax and its execution must be reliable, with sufficient documentation for reference, and its behavior can be tested predictable.

Databend’s current CI has all converged to Github Actions, where PR-related tests are executed by Self-hosted Runner sponsored and maintained by Databend Cloud , while tests on the Main branch are executed by Github-hosted Runner. This is a comprehensive consideration of cost and efficiency: the community is more concerned about the CI completion efficiency of PR, and the tests on the main branch only need to be executed normally.

Databend Cloud deploys the Self-hosted Runner Operator that supports elastic scaling in the test environment: whenever Databend has a PR and needs to execute a build, it will first look for the currently active Runner, otherwise it will start a new one immediately. After many rounds of attempts and tuning by @everpcpc , the current runner configuration of Databend is as follows:

Each time a 16C32G Spot VM is started with two Runner containers running on it
Each container will request 7.5C12G of resources, and the maximum limit set is 12C15G
In addition, each VM will be configured with an 80G EBS

This configuration also strikes a balance between cost and efficiency: At present, the necessary tests for Databend’s PR merge can be basically completed within 20 minutes, including all static checks, two architectures of x86_64/aarch, and two libcs of gnu/musl Dependencies, and full logic tests, stateless tests, stateful tests, etc.

Datafuse Labs planned to provide Data Cloud SaaS service at the beginning of its establishment, so it understands the value of SaaS very well, and actively embraces SaaS in the procurement of various tools. In fact, this is also easy to understand: for Databend, its core value is to do a good job in Data Cloud SaaS, and other products serve this goal. Maintaining your own set of CI services does nothing to contribute to the core goals of the Databend team, and instead requires a dedicated team to continuously invest and maintain, which is not cost-effective. After investigating multiple SaaS services in the market, Databend finally chose the current solution: on the one hand, Databend has its own complete K8s technology stack, adding a Self-hosted Runner with a small maintenance burden; on the other hand, a fully managed CI SaaS services are very expensive, much more expensive than your own self-hosted runner.

In practice, Databend uses services within the AWS intranet whenever possible to optimize costs:

Put the build products generated in the CI process on a dedicated internal S3 and configure the life cycle of automatic deletion, so that runner uploads and downloads are free of traffic charges
Use AWS’s own container image service, no public network traffic charges for use on the intranet
Implement resource isolation to prevent Github-hosted Runner from accessing internal services, such as Github-hosted Runner using Dockerhub instead of AWS container images

In general, Github Actions + Self-hosted CI with some cost optimization strategies is a very good practice.

Xuanwo’s Engineering Efficiency Experience

According to my practice in Databend, I have some experiences like this:

Correct positioning
incremental change
Open and transparent

Correct positioning

To engage in engineering efficiency work, you must first correct your position: engineering efficiency serves to accelerate the efficiency of product delivery. Don’t put the cart before the horse, blindly increase the cumbersome approval process, generate reports that no one reads, and fall into the rut of formalism. Implementing engineering efficiency in the open source community requires particular attention to empowerment rather than control : to provide contributors with enough information to help him solve problems, rather than giving all kinds of blunt errors to order contributors to do what they want. When designing the process and making adjustments to CI, it is necessary to start from the actual feelings of the community, communicate extensively with the front-line contributors in the community, and then make adjustments.

incremental change

It is necessary to make adjustments to the process in small quantities and several times, slowly adjusting each time, giving the community some time to adjust and adapt. After the new process is launched, you can communicate with the community, listen to the most direct feedback, and optimize the process or roll back directly.

Open and transparent

For open source projects, the improvement of engineering efficiency tends to be bottom-up: engineers on the front line usually have the deepest experience of engineering efficiency, and they are willing to improve their work experience. Therefore, it is best to be open and transparent in both the process and CI, allowing students in the community to put forward their own suggestions for improvement. Using an off-the-shelf, widely used CI service also helps contributors to migrate their own experiences from other communities. For example, Databend uses Github Actions, and multiple contributors submit PRs to apply best practices from other projects. If we were using Jenkinsfile, it would be hard to get similar feedback.

Summarize

This article summarizes the engineering efficiency practice of Databend and some of my personal experience. Welcome to communicate with me in the comment area~

This article is reproduced from: https://xuanwo.io/reports/2022-31/
This site is for inclusion only, and the copyright belongs to the original author.