Author Peter Gao is the co-founder and Chief Executive Officer (CEO) of Aquarium, a company that builds tools for finding and fixing problems with deep learning datasets. Prior to Aquarium, Peter worked on machine learning for autonomous vehicles, education and social media, working for companies such as Cruise. This article is his experience and understanding of deep learning combined with several years of experience in production and practice in the industrial field. Author | Compiled by Peter Gao | Liu Bingyi
Editor | Chen Caixian
When I started my first job after college, I thought I knew a lot about machine learning. I had two internships at Pinterest and Khan Academy building machine learning systems. In my final year at Berkeley, I started researching deep learning for computer vision and working on Caffe, one of the first popular deep learning libraries. After graduating, I joined a small startup called “Cruise,” which made self-driving cars. Now I’m at Aquarium, helping multiple companies deploy deep learning models to solve important societal problems. Over the years, I’ve built quite a cool deep learning and computer vision stack. More people are using deep learning in production applications now than when I was doing research at Berkeley. Many of the problems they face now are the same ones I faced with Cruise in 2016. I have a lot of lessons from deep learning in production that I want to share with you in the hope that you don’t have to learn them the hard way.
Legend: The author’s team developed the first machine learning model deployed on a car
1
The story of deploying an ML model to a self-driving car First, let me talk about Cruise’s first-ever ML model deployed in a car. As we developed the model, the workflow felt a lot like what I was used to during my research period. We train open source models on open source data, integrate them into company product software stacks, and deploy them in cars. After a few weeks of work, we merged the final PR, running the model on the car. “Mission accomplished!” I thought, and we should move on to the next fire. What I didn’t know was that the real work had just begun. The model went into production and our QA team started noticing performance issues. But we have other models to build and other tasks to do, so we didn’t address those issues right away. 3 months later, when we looked into these issues, we found that the training and validation scripts had all crashed because the codebase had changed since our first deployment. After a week of fixes, we looked at the glitches over the past few months and realized that many of the issues observed in the model production runs could not be easily fixed by modifying the model code, we needed to go and collect and tag new data from our company vehicles, rather than relying on open source data. This means we need to build a labelling process that includes all the tools, operations and infrastructure the process requires. After another 3 months, we ran a new model trained on data we randomly picked from the car. Then, mark up with our own tools. But as we started to solve simple problems, we had to become more acute about what changes might have consequences. About 90% of the problems are solved by careful data wrangling for difficult or rare scenarios, rather than deep model architecture changes or hyperparameter tuning. For example, we found that the model performed poorly on rainy days (rare in San Francisco), so we labeled more data for rainy days, retrained the model on the new data, and the model performance improved as a result. Again, we found that the model performed poorly on green frustums (compared to orange frustums), so we collected data on green frustums, and after the same process, the model’s performance improved. We need to establish a process that can quickly identify and resolve these types of issues. It took a few weeks for the 1.0 version of the model to be assembled, and it took another 6 months to launch an improved version of the model. Models can be retrained and redeployed on a monthly to weekly basis as we work more on a few areas (better tagging infrastructure, cloud data processing, training infrastructure, deployment monitoring). As we built more model pipelines from scratch and worked to improve them, we started to see some common themes. Applying what we’ve learned to new pipelines makes it easy to run better models faster and with less effort.
2
keep learning iteratively
Legend: Many different deep learning teams for autonomous driving have fairly similar iteration cycles for their model pipelines. From top to bottom: Waymo, Cruise and Tesla. I used to think that machine learning was mostly models. In fact, machine learning in industrial production is mostly plumbing. One of the best predictors of success is the ability to efficiently iterate on the model pipeline. This doesn’t just mean iterating fast, it means iterating smart, the second part is critical, otherwise your pipeline will quickly produce bad models. Most traditional software emphasizes rapid iteration and agile delivery process, because product requirements are unknown and must be discovered through adaptation, so instead of making exhaustive planning with unstable assumptions in the early stage, it is better to quickly deliver an MVP and iterate. Just as traditional software requirements are complex, the field of data input that a machine learning system must handle is indeed vast. Unlike normal software development, the quality of a machine learning model depends on its implementation in code, and the data the code relies on. This reliance on data means that a machine learning model can “explore” the input domain through dataset construction/management, allowing it to understand the task requirements and adapt to it over time without having to modify the code. To take advantage of this property, machine learning requires a concept of continuous learning that emphasizes iteration over data and code. Machine learning teams must:
- Identify problems in data or model performance
- Diagnose the cause of the problem
- Change data or model code to address these issues
- Validation model gets better after retraining
- Deploy new model and repeat
Teams should try to go through this cycle at least every month. Maybe every week if you’re doing well. Large companies can complete a model deployment cycle in less than a day, but building infrastructure quickly and automatically is very difficult for most teams. If the model is updated less frequently than this, it can lead to code corruption (model pipeline breaks due to codebase changes) or data domain shifting (models in production cannot generalize to data changes over time). Large companies can complete a model deployment cycle in a day, but building infrastructure quickly and automatically is very difficult for most teams. Updating models less frequently than this can lead to code corruption (model pipelines broken due to changes in the codebase) or data domain shifting (models in production that do not generalize to data changes over time). However, if done right, the team can get into a good rhythm where they deploy the improved model into production.
3
Create a feedback loop Uncertainty in calibrating models is a tantalizing area of research, where a model can flag where it thinks it might fail. A key part of effectively iterating on a model is focusing on the most impactful problems. To improve a model, you need to know what’s wrong with it and be able to categorize the problems according to product/business priorities. There are many ways to build a feedback loop, but it starts with finding and classifying errors. Take advantage of domain-specific feedback loops. If anything, this can be a very powerful and efficient way to get model feedback. Prediction tasks, for example, can get labeled data “for free” by training on historical data that actually happened, allowing them to continuously feed in large amounts of new data and adapt fairly automatically to new situations. Set up a workflow where people can review the output of your model and flag errors when they occur. This approach is especially useful when one can easily catch errors with many model inferences. The most common way this happens is when a customer notices an error in the model’s output and complains to the machine learning team. This cannot be underestimated, as this channel allows you to incorporate customer feedback directly into the development cycle! A team could have humans double-check model outputs that customers might have missed: imagine an operator watching a robot sort packages on a conveyor belt, and clicking a button when they see an error has occurred. Set up a workflow where people can review the output of your model and flag errors when they occur. This is especially appropriate when human review easily catches errors in a large number of model inferences. The most common way is when a client notices an error in the model output and complains to the ML team. This should not be underestimated, as this channel allows you to incorporate customer feedback directly into the development cycle. A team can have a human scrutinize model outputs that customers may have missed: think of an operator watching a robot sorting packages on a conveyor belt, Click a button every time they notice an error has occurred. Consider setting up automatic reviews when models are run too often for humans to check. This is especially useful when it is easy to write “sanity checks” against the model output. For example, whenever the lidar object detector and the 2D image object detector do not agree, or the frame-to-frame detector does not agree with the time tracking system, mark. When it worked, it gave us a lot of useful feedback on where the fault conditions were. When it doesn’t work, it just exposes a bug in your inspection system, or misses all the cases where the system goes wrong, which is very low risk, high reward. The most general ( but difficult ) solution is to analyze the model uncertainty of the data it runs on. A simple example is to see an example of a model producing low confidence output in production. This can show where the model is really uncertain, but not 100% accurate. Sometimes the model can be confidently wrong. Sometimes the model is indeterminate because there is a lack of information available to make good inferences (e.g. noisy input data that is difficult for humans to understand). There are models that address these issues, but this is an active area of research. Finally, the feedback from the model on the training set can be leveraged. For example, examining the inconsistency of a model with its training/validation dataset (i.e. high-loss examples) indicates high-confidence failures or mislabeling. Neural network embedding analysis can provide a way to understand failure mode patterns in training/validation datasets and can discover differences in the distribution of raw data in training and production datasets.
4
Automation and delegation
Legend: Most human time is easily removed from a typical retraining cycle. Even at the cost of reducing machine time efficiency, it eliminates a lot of manual pain. The main thing about speeding up iterations is reducing the amount of work required to complete an iteration cycle. However, there are always ways to make things easier, so you have to prioritize what to improve. I like to think of effort in two ways: clock time and human time. Clock time refers to the time it takes to run certain computing tasks, such as ETL of data, training models, running inference, computing metrics, etc. Labor time refers to the time when humans must actively intervene to run through the pipeline, such as manually reviewing results, running commands, or triggering scripts in the middle of the pipeline. For example, multiple scripts must be run manually in sequence by manually moving files between steps, which is very common, but wasteful. Some math on the back of paper towels: If a machine learning engineer spends $90 per hour and wastes 2 hours a week running scripts by hand, that adds up to $9360 per person per year! Combining multiple scripts and human interruptions into one fully automated script makes running a model pipeline loop faster and easier, saving you a lot of money and making your machine learning engineers less wacky. By contrast, clock time usually needs to be “reasonable” (eg, can be done overnight). The only exceptions are when machine learning engineers are doing a lot of experimentation, or there are extreme cost/scaling constraints. This is because clock time is generally proportional to data size and model complexity. When moving from on-premises processing to distributed cloud processing, clock time decreases significantly. After that, horizontal scaling in the cloud tends to solve most problems for most teams until the problem scales up. Unfortunately, it is impossible to fully automate certain tasks. Almost all production machine learning applications are supervised learning tasks, and most rely on some amount of human interaction to tell the model what it should do. In some areas, human-computer interaction is free (for example, social media recommendation use cases or other applications that have a lot of direct user feedback). In other cases, human time is more limited or more expensive, such as trained radiologists “labeling” CT scans for training data. Either way, it is important to minimize the labor time and other costs required to improve the model. While early teams may rely on machine learning engineers to manage datasets, it is often more economical (or, in the case of radiologists, necessary) to have an operational user or domain expert with no machine learning knowledge do the heavy lifting of data management. . At this point, it becomes important to establish an operational process to label, inspect, improve, and version control datasets using good software tools.
5
Encourage ML engineers to get fit
Legend: As ML engineers lift weights, they are also boosting the weights their models learn. Building enough tools to support a new domain or a new group of users can take a lot of time and effort, but when done right, the results will be very worthwhile. At Cruise, one of my engineers was exceptionally smart (some would say he’s lazy). The engineer set up an iterative loop in which a combination of operational feedback and metadata queries would pull data for flagging where the model was underperforming. A team of offshore operations will then label the data and add it to the new version of the training dataset. Engineers have since built infrastructure that allows them to run a script on a computer and launch a series of cloud tasks that automatically retrain and validate a simple model on newly added data. Every week, they run the retrain script. Then, while the model trains and validates itself, they go to the gym. After a few hours of fitness and dinner, they would come back to check the results. Coincidentally, new and improved data will lead to improvements in the model, after a quick double-check to make sure everything makes sense, then they ship the new model into production and the car’s drivability will improve. They then spent a week improving the infrastructure, experimenting with new model architectures, and building new model pipelines. Not only did the engineer get a promotion at the end of the quarter, but he was in great shape.
6
Conclusion To summarize: During the research and prototyping phase, the focus is on building and publishing a model. However, as a system moves into production, the core task is to build a system that can regularly release improved models with minimal effort. The better you do this, the more models you can build! To do this, we need to focus on the following:
- Run the model pipeline at a regular cadence and focus on shipping models that are better than ever before. Get a new and improved model into production every week or less!
- Establish a good feedback loop from the model output to the development process. Find out which examples the model is doing poorly on and add more examples to your training dataset.
- Automate particularly onerous tasks in your pipeline and build a team structure that allows your team members to focus on their areas of expertise. Tesla’s Andrej Karpathy calls the ideal end state “vacation action.” I suggest, build a workflow where your ML engineer goes to the gym and your ML pipeline does the heavy lifting!
Finally, it’s important to stress that in my experience, the vast majority of questions about model performance can be solved with data, but some can only be solved by modifying the model code. These changes tend to be very specific model architectures at hand, for example, after working on image object detectors for several years, I spent too much time worrying about the best prior box assignment for certain orientation ratios and improving feature map pairs for small The resolution of the object. However, as Transformer shows promise to be the type of universal model architecture for many different deep learning tasks, I suspect that more of these tricks will become less relevant and the focus of machine learning development will shift further towards improving datasets. Reference link: https://ift.tt/Hhp0tmi Kendall, A. & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Advances in Neural Information Processing Systems , 5574-5584.
Leifeng.com
This article is reprinted from: https://www.leiphone.com/category/academic/EZiaeUa90gcLDAMv.html
This site is for inclusion only, and the copyright belongs to the original author.