Considering the distribution shift alone is not enough! Real data is complex and “external validity” is essential

90

By Deborah Raji

Interpretation | Antonio

Editor丨Chen Caixian

Data distribution shift is a topic that is keenly considered by trusted AI systems, and there are countless related studies on it every year. However, is it enough to just focus on the distribution shift?

Recently, Deborah Raji, a technology researcher at NYU’s AI Now Institute, shared his views on the topic on the personal blog argmin of UC Berkeley assistant professor Benjamin Recht.

She is concerned that the academic community pays too much attention to distributional shifts, and believes that a related concept in statistics, external vadality, should be considered.

90

1
data distribution shift

Data skew has always been a “killer” for trusted AI. For example, the sepsis recognition model developed by Epic Systems and widely used by the University of Michigan Hospital was urgently stopped in April 2020 due to frequent false alarms. According to the analysis, this is because the changes in the geographical characteristics of the population caused by the new crown pandemic have caused the model to be biased.

This is an example of a data distribution shift: when the data distribution of the test set and the data distribution of the training set change, the model cannot effectively migrate to the new application scenario, resulting in errors.

This is related to the nature of constant change: real-world data is often dynamic, changing, and uncertain, such as software deployment changes, population migration, behavioral changes, language evolution, etc. If the model does not take these into account, there will be Systematic Bias.

Benjamin Recht published another surprising study . They re-collected a batch of new test sets according to the data collection method of ImageNet, and used the original model to test the accuracy of the new test set, and found the following results:

90

Among them, the horizontal axis represents the test performance on the original data test set, the vertical axis represents the test performance on the new data set, each blue point represents the result of a model, the red line is the linear fit to them, the black The dashed line y=x represents the theoretical performance the test results should have.

It can be seen that although there is still a linear correlation between the two, that is, the performance on the original dataset is also good on the new dataset, and vice versa; however, there is still a gap of nearly 15% between them. , which is due to skewed data distribution. The bias here may come from different annotator preferences, different data collection processes, etc.

2
Research status

Deborah Raji acknowledges the importance of studying this phenomenon, but believes that ML researchers are so obsessed with the topic of distribution shift that in many cases they attribute any missteps in the model to distribution shift , which she thought was inappropriate.

First, she argues that the issue of “distribution shift” is sometimes too specific and sometimes not. Any change in data can be considered a “distribution shift”, such as changes in the characteristics of the data itself, changes in the labels of the data, and changes in both.

On the other hand, the term is too broad and vague. The concept of “data distribution” itself requires the assumption that the data comes from a false “real” distribution, and the actual observable data is the sampled data that is independent and identically distributed from this overall distribution. But what is this distribution? No one knows – the real data is messy, disordered, and unpredictable.

The data distribution is skewed, but it’s impossible to know which parts have changed and why they happened.

Deborah Raji goes on to warn how obsession with the term can limit the growth of the ML community. One manifestation is that the current community is keen to develop benchmarks that detect shifts in the distribution of data, as a way to claim the degree of test shift. However, these data are static and ideal, and cannot adapt to the more complex data in the real world.

Some studies have begun to conclude that an overemphasis on data distribution bias has led ML practitioners and policymakers to focus more on retrospective studies than prospective studies. The former is aimed at statically collected historical data, while the latter is more focused on the context of the system.

90

Retrospective and prospective studies

To this end, Deborah Raji hopes that research can turn more towards the concept of “validity”. Validity is an important concept in measurement theory in statistics to measure the reliability of a system. Validity includes internal validity and construct validity. When discussing generalizability, we focus more on external validity.

3
external validity

External validity measures how well the model generalizes to other scenarios, settings. These tests are often not set up in the original environment of the experiment, and take into account more than just changes in the data.

Deborah Raji cites an article published in JAMA titled “Analysis of External Validity of a Widely Used Sepsis Prediction Model in Hospitalized Patients” which does more detail on the model in the example at the beginning” External validity” analysis.

90 Papers on External Validity Analysis Models

URL: https://ift.tt/PE162GA

First, this article describes a retrospective study on the use of sepsis models between December 2018 and October 2019, especially before the onset of the pandemic. They examined 27,697 patients who received 38,455 hospitalizations and found that the Epic model had an area under the curve for predicting the onset of sepsis of 0.63, which was “much worse than the performance reported by its developers.”

Additionally, the tool “did not identify 1,709 sepsis patients (67%), thus creating a large number of false alarms.”

These researchers correctly describe these problems as “external validity” problems and study them in detail, which goes well beyond “clinician and dataset shift” – a statically offset dataset describing the data distribution offset.

The evaluation of the Epic system is based on data from 3 US health systems from 2013 to 2015, which is different from the 2018-2019 patient record data from the University of Michigan. But the evaluation goes beyond just the data to look at changes in physician interactions with the model and how those changes affect the results, as well as other external validity factors that have little to do with the data — well beyond data distribution shifts.

Even when discussing substantive data changes, the researchers try to describe what it is and specifically analyze the differences that occurred when they were deployed in their hospitals.

4
About the author

90

Author Deborah Raji is a Nigerian-Canadian computer scientist and activist who works on algorithmic bias, AI accountability, and algorithmic auditing. She has worked with Google’s Ethical AI team and was a researcher in the NYU AI and AI Now Institute partnership on how to consider ethical considerations in machine learning engineering Timnit Gebru has worked as a colleague and has won several awards in the field.

Deborah Raji and Ben Recht have had a lot of in-depth discussions on this topic of external validity, and follow-up discussions on this issue will also be posted on arg min’s blog. Interested readers can pay attention to it~

Reference blog:

https://ift.tt/RVd1XPU https://ift.tt/X3icYNg

90

Leifeng.com

This article is reprinted from: https://www.leiphone.com/category/academic/D1SBV1ItxoWR0ClS.html
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment