Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis

Original link: http://vis.pku.edu.cn/blog/chi2022-ditl/

When doing data analysis, researchers need to transform, cluster or filter the data to make the data meet the needs of the analysis task. During this process, not only the code that handles the data will change, but the data itself will also change. However, existing tools track code changes during processing, and cannot tell users how data changes after these codes are processed. This paper designs a tool to support the tracking of data changes in data exploratory analysis.

As shown in Figure 1, the code on the left is about the processing of the car data set. The highlighted code filters the original data, and the filtered data only contains data items with a number of cylinders greater than 4. What happened to the filtered data? Researchers need to write additional code to plot some interesting information, and then compare the data before and after filtering. The histogram on the right side of Figure 1 shows the distribution of the data set before and after filtering in the dimension of “horsepower”.

Figure 1 After some operations are added to the data processing, the distribution of the data will change

In order to support users to compare more conveniently before and after data manipulation, this paper designs a tool DITL (Diff in the loop). Figure 2 shows the DITL interface. View A records the editing history of the user’s code. The user can view the modifications made by different users at different times. The modified code will be highlighted in the code editor. The user can then select certain data tables to compare. First, for a data table, the system will display the distribution of each dimension in the data table in the form of a histogram, and list some statistical information of the dimension. To compare the difference in the distribution of a dimension in two data tables, the authors used three common comparison methods, including calculating the difference (C), coincidence (D), and juxtaposition (E).

Figure 2 DITL system interface

To evaluate the system, the authors recruited 16 data scientists to complete two data analysis tasks with and without DITL, and then obtained user feedback through questionnaires.

Figure 3 User questionnaire survey results

As can be seen from Figure 3, users do have a need to compare data tables during data exploration. Through DITL, users can compare data tables more conveniently, and it is easier to get some findings from the data.

Overall, this paper proposes a tool to support the comparison of data tables in the process of exploring the data. However, there is not enough discussion on the tasks that need to be supported in the paper, only the difference in the dimensional distribution in the data table is compared. At the same time, there is also a lack of sufficient discussion on the visualization form adopted. This tool only supports the comparison of data, and cannot yet support the difference of comparing visual charts.

references:

April Yi Wang, Will Epperson, Robert A. DeLine, Steven Mark Drucker. Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis. CHI 2022: 97:1-97:10

This article is reprinted from: http://vis.pku.edu.cn/blog/chi2022-ditl/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment