CrossData: Leveraging the association between text and data (CrossData: Leveraging Text-Data Connections for Authoring Data Documents)

Original link: http://vis.pku.edu.cn/blog/crossdata-%E5%88%A9%E7%94%A8%E6%96%87%E5%AD%97%E5%92%8C% E6%95%B0%E6%8D%AE%E4%B9%8B%E9%97%B4%E7%9A%84%E5%85%B3%E8%81%94crossdata-leveraging-text-data-connections- for-authoring-data-documents/

Researchers often use data files to document the various discoveries they make during data exploration. While there are quite a few tools available for data exploration, there are still very few tools to aid in data documentation. Researchers still need to manually record and maintain data files. At the same time, once the data itself is updated, the corresponding data file needs to be modified; due to the lack of corresponding tools, it is always tedious and error-prone to manually modify and maintain the data file. To solve this problem, researchers from the University of California, San Diego proposed CrossData, which helps in the writing of data documents by automatically identifying the relationship between data and text.

The researchers first summarized various problems encountered by users in data exploration and compilation of data documents through interviews. Users generally switch back and forth between data exploration tools (such as Excel software) and text editing software (such as Word software), and document their findings by taking screenshots and copying and pasting data documents. Sometimes, users also need to use other tools to help do some analysis, such as using a calculator to calculate some statistical values. When the data is updated, or the exploration direction changes, and the data document needs to be modified, the user can only manually modify and check the corresponding text line by line to update the data document and prevent possible inconsistencies and errors. mistake. The researchers believe that the main reason for these problems is that the writing of data documents at the current stage splits the natural relationship between data and text, so their work focuses on how to identify and maintain direct data and data documents. associations to help users write data documents.

The CrossData system mainly consists of three parts, namely, an association recognition module that specifically recognizes the association between text and data, an interactive module that helps data document writing, and an update module that maintains the consistency of data text. In the association recognition module, the researchers borrowed the relevant achievements and tools of the NLP neighborhood to parse the sentence into a dependency tree. The researchers defined two types of entities—independent entities and dependent entities, which refer to data items included in the original data and values ​​not included in the original data that depend on other data items to calculate. Through NLP tools, the CrossData system can identify entities in sentences, and for independent entities, it recommends candidate entities by means of string matching and semantic matching; for dependent entities, the system uses dependency tree parsing to calculate entity values, and automatically calculates associated value. In this way, when the user inputs a sentence, the system can automatically provide the corresponding candidate entity completion, and automatically calculate the corresponding dependency value.

The CrossData system can help users interactively write data documents, while also supporting simple diagram insertion.

In the interactive part, the CrossData system automatically filters out relevant data items or values ​​of intermediate dependent entities according to the text that has been entered, and displays them to the user, reducing the user’s switching between data exploration tools and text editing tools. , which greatly facilitates the writing of data documents by users. CrossData also provides a placeholder function. Users can use placeholders to replace dependent entities that need to be calculated, and the CrossData system will automatically calculate and fill in the entity values ​​at the placeholders after the sentence is completed. In this way, users can let the system automatically perform simple data calculations without the need to use additional tools, such as calculators, in the process of writing data documents. In addition, the system also provides corresponding error correction interactions to help users correct errors in automated algorithms, such as wrong dependencies.

Auto-completion and display of associated data items and intermediate values ​​to the user.

Use the keyword Diff as a placeholder to replace the difference value that needs to be calculated, and the system will automatically calculate the value to be filled after the sentence is completed, and replace the original placeholder with the corresponding calculated value.

The third part, the update part, uses the correspondence between the data and text established in the first two parts to automatically maintain the consistency between the data document and the data. That is, when the data is updated, or the entity in the sentence in the data document changes, the corresponding calculated value will be automatically updated according to the established correlation. The researchers also extended automatic updates for simple graphs embedded in data documents, the same principle as text sentences.

Automatic updates to maintain data and text consistency. On the left, after the data is modified, the value in the corresponding text will also be updated, and the corresponding calculation keyword (increased) is no longer applicable after the data is changed, and is also highlighted. On the right, the entity in the sentence changes, and the corresponding calculated value is updated synchronously.

For the system, the researchers conducted two evaluation experiments. First, they tested the effectiveness of automated recommendation algorithms using sentences excerpted from real data analysis reports such as WHO. Through testing, they found that among the 529 dependent entity recommendations, 88.8% of the entities successfully appeared in the top5 recommended by the system. For the failed examples, there are three main reasons, including lack of context (such as the connection between the two sentences before and after, which leads to the lack of content in analyzing a sentence alone), textual expressions for numbers (such as using about two-fifths instead of 43%) ), and the three parts of the calculation not covered by the system.

In addition, the researchers also conducted user experiments. The 8 data analysis experts who conducted the test generally agreed that the system was a useful and easy-to-learn system to help them with their data analysis. One expert expressed concern about the understanding of placeholders and the cost of learning. Experts also put forward some suggestions and opinions on the system, such as the need to provide a computing extension package for the neighborhood, to help the system better adapt to the analysis with specific needs, and so on.

This article presents CrossData, a method to aid in data documentation, and implements a prototype system. CrossData helps users to quickly write data documents by identifying and maintaining the natural relationship between data and documents, and at the same time provides automatic updates to help maintain consistency when data or text is updated and iterative, avoiding the need for users to manually to update and correct errors.

references:

[1] Chen, Zhutian, and Haijun Xia. “CrossData: Leveraging Text-Data Connections for Authoring Data Documents.” CHI Conference on Human Factors in Computing Systems . No. 410, pp. 1–15, 2022.

This article is reproduced from: http://vis.pku.edu.cn/blog/crossdata-%E5%88%A9%E7%94%A8%E6%96%87%E5%AD%97%E5%92%8C% E6%95%B0%E6%8D%AE%E4%B9%8B%E9%97%B4%E7%9A%84%E5%85%B3%E8%81%94crossdata-leveraging-text-data-connections- for-authoring-data-documents/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment