ChartDetective: Easy and Accurate Interactive Data Extraction from Complex Vector Charts (ChartDetective: Easy and Accurate Interactive Data Extraction from Complex Vector Charts)

Original link: http://vis.pku.edu.cn/blog/chartdetective%EF%BC%9A%E4%BB%8E%E5%A4%8D%E6%9D%82%E7%9A%84%E7 %9F%A2%E9%87%8F%E5%9B%BE%E4%B8%AD%E8%BD%BB%E6%9D%BE%E8%80%8C%E5%87%86%E7%A1 %AE%E5%9C%B0%E6%8F%90%E5%8F%96%E4%BA%92%E5%8A%A8%E6%95%B0/

Extracting underlying data from rasterized charts is tedious and inaccurate; values ​​may be partially occluded or indistinguishable, and the quality of the images limits the accuracy of data recovery. To address these issues, the authors developed a semi-automatic system to easily and accurately extract underlying data using vector graphics. The system is designed to maximize the use of vector information, relying on a drag-and-drop interface combined with selection, filtering, and preview functions, as shown in Figure 1. The authors show through user studies that it took participants less than 4 minutes to accurately recover thousands of data points from graphs published in CHI with different styles, combinations of different encodings, and partial or complete occlusions. Elements. Compared to other methods relying on raster images, ChartDetective [1] successfully recovers all data, even hidden data, with a relative error reduction of 78%.

Figure 1: ChartDetective system operation flow

So, why do people do data extraction? First, scientists may want to compare the technology to the state of the art, but if other papers only show graphs of their results, then scientists need to extract those values. Second, some visualizations can make comparisons difficult, and the designer hopes to recover the data and recreate the diagram with another visualization, or make it black and white so that people who are colorblind can see it. With data extraction, one can also make the graph interactive and put it on the website, or perform different analyzes with the extracted data. This also has many application cases in real life.

But how can I turn this graph into a table? First, one can do it manually for each data point one by one with WebPlotDigitizer [2]. However, it can be quite difficult to do if there are many points. In addition to manual systems, there has been some research on fully automated methods, but they are very limited in what they can do. Therefore, the current stage is more feasible through human participation. In ChartSense [3], the authors propose to use a semi-automatic system that takes the best of both worlds. It works by using a top-down approach, first it detects the type of graph. It then chooses the right heuristic to help the user extract the data. However, if the diagram is unconventional, there may be some problems. For example, in that chart that has both a line and a bar chart, what if the wrong heuristic is used? Couple that with some values ​​overlapping and it’s not very clear. Users may want to help the system deal with these problems. And more importantly, data extraction will always be limited by image precision. Since the resolution is so low and the points are overlapping, it will be impossible to recover accurate data.

Figure 2: System interface of ChartDetective

Therefore, the author took a different route in ChartDetective. Instead of focusing on raster images, they considered vector graphics commonly found on the web and in files, and this was the first semi-automatic tool to work on vector graphics, as shown in Figure 2. The author’s first thought was to take advantage of vector graphics, since they store geometric shapes in floating-point coordinates rather than pixels. Most diagramming tools provide output in vector format, and this format is often used for sharing diagrams. As a result, their resolution and position accuracy are much higher, and even hidden or overlapping elements can be recovered. Also, with vector graphics, the shapes are clearly identified and the text is often directly accessible; the author’s second idea was to use a bottom-up approach, focusing on the different encodings present in the diagram and then making decisions based on those encodings Heuristics; The author’s third idea is to use the information provided by the vector format. It is simple to identify colors and shapes through filtering functions. These can be used as filters to facilitate selection.

Figure 3: Filter function of ChartDetective

So how does ChartDetective work? First, the authors modified a PDF viewer to be able to recover the geometry from the diagram. The system then preprocesses the shapes extracted from the graph. Its purpose is to clean up the shapes and give a consistent representation for easy interpretation. Thus, the authors could compute descriptors representing shape and color. These descriptors will be used for filtering and need to be resilient to some changes. Finally, specific heuristics are applied depending on the encoding being processed.

Figure 4: Example of a user experiment with 12 participants

Subsequently, the author evaluates ChartDetective from two aspects. The first question was “Can people use it?” User research was conducted with 12 participants. Participants had to extract some graphs from CHI papers, including the most common views such as bar graph, line graph, scatterplot, and boxplot. Not only that, but participants were also required to extract more complex diagrams, including a hybrid diagram of 4 different codes. For example, a bar chart combined with a line chart, or a box plot combined with a scatterplot. In addition, they extracted 4 charts with many data points, which were sometimes overlapping, as shown in Figure 4.

Figure 5: Evaluation of the results of a data graph extraction task using ChartDetective

Overall, the participants performed the task very successfully. They spend an average of less than 3 minutes per graph, and they achieve very high accuracy for all graphs, as shown in Figure 5. In addition, participants also expressed their positive opinion of this tool through subjective evaluation. They gave it a good score for system usability and rated all of its features as effective on the Likert scale. Participant reviews were also positive during and after the study. These results demonstrate that participants can use ChartDetective effectively.

The experiment has not yet shown the accuracy of the extracted data. Because some use cases require very accurate data, such as re-running an analysis task. Therefore, the authors performed a second experiment to precisely measure the quality of the recovered data. To do this, the authors collected a dataset of graphs for which the underlying data is known. The first half of the dataset was generated using public data and different generators such as excel, ggplot2, plotly and matplotlib. In order to have stronger diversity, the authors also obtained real graphs from papers published in CHI in the past 5 years. They searched for papers with datasets and found very few published their data. So this didn’t produce a ton of graphs, just 26 that were able to be reconstructed correctly.

Figure 6: Comparison of average error rates using WebPlotDigitizer and ChartDetective

The authors then extracted data from these charts using WebPlotDigitizer and ChartDetective. They used the mean relative error for each data point to compare the data obtained in the two cases. Overall, the data extraction error rate of ChartDetective is much lower, as shown in Figure 6. Through the first system usability study, participants rated the system as highly usable and capable of extracting even the most challenging diagrams. A second study, data accuracy experiments, showed that graphs extracted using vector representations yielded higher data accuracy than extracting the same graphs in raster format using existing tools. These results show that ChartDetective can support most data extraction tasks.

references:

[1] Damien Masson, Sylvain Malacria, Daniel Vogel, Edward Lank, Géry Casiez. ChartDetective: Easy and Accurate Interactive Data Extraction from Complex Vector Charts. In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI 2023), ACM, Apr. 2023, Hamburg, Germany.

[2] https://ift.tt/NE6CIbS

[3] Jung D, Kim W, Song H, et al. Chartsense: Interactive data extraction from chart images[C]//Proceedings of the 2017 chi conference on human factors in computing systems. 2017: 6706-6717.

This article is reproduced from: http://vis.pku.edu.cn/blog/chartdetective%EF%BC%9A%E4%BB%8E%E5%A4%8D%E6%9D%82%E7%9A%84%E7 %9F%A2%E9%87%8F%E5%9B%BE%E4%B8%AD%E8%BD%BB%E6%9D%BE%E8%80%8C%E5%87%86%E7%A1 %AE%E5%9C%B0%E6%8F%90%E5%8F%96%E4%BA%92%E5%8A%A8%E6%95%B0/
This site is only for collection, and the copyright belongs to the original author.