To Explore What Isn’t There — Glyph-based Visualization for Analysis of Missing Values

Original link: http://vis.pku.edu.cn/blog/%E6%8E%A2%E7%B4%A2%E4%BB%80%E4%B9%88%E6%98%AF%E4% B8%8D%E5%AD%98%E5%9C%A8%E7%9A%84-%E5%9F%BA%E4%BA%8E%E5%9B%BE%E5%85%83%E7%9A %84%E7%BC%BA%E5%A4%B1%E5%80%BC%E5%88%86%E6%9E%90%E5%8F%AF%E8%A7%86%E5%8C%96 /

Missing values ​​are a common problem in datasets, and the analysis of missing values ​​is often challenging. Aiming at the problem of missing multivariate data, this paper proposes a primitive-based visualization MissiG, which analyzes the three proposed missing modes—quantitative missing (AM), joint missing (JM) and conditional missing (CM). A more intuitive presentation to help users have a better understanding of missing values. User experiments show that MissiG performs overall better than traditional parallel coordinates (PC) and heat map (HM) on the related tasks of these three modalities.

Figure 1 MissiG

In order to clarify the goal of visualization, the author firstly investigated the existing methods for classifying missing patterns, and finally selected the missing patterns defined by Wang et al. [2] as the basis of this work. This work divides missing patterns into three types: (1) Amount Missing (AM): the relative amount of missing a variable in the dataset; (2) Joint Missingness (JM): refers to the fact that data is missing in multiple The relative number of missing variables; (3) Conditional Missingness (CM): describes the relationship between the missing variables of the data and the values ​​of other non-missing variables. Existing missing data visualizations usually focus on how to represent missing values, or simply use some data quality descriptors to represent missingness, but little attention is paid to how to further analyze some patterns of missing values. Exactly the purpose of this article.

This paper proposes a new visualization method, MissiG, which uses one primitive to represent a variable. In each primitive, the blue bar on the right half represents the missing value, and its height relative to the entire primitive represents the missing value. relative quantity. The gray histogram on the left half represents the distribution of nonmissing values. When a variable is selected, the relevant missing parts are highlighted in red. The red highlighted parts in the primitives corresponding to other variables indicate the distribution of the missing of the selected variable on the missing and non-missing values ​​of other variables. In addition, there will be some connecting lines between the selected primitive and other primitives, the thickness of which encodes the joint missing number (JM) between related variables.

MissiG can clearly show the three missing modes mentioned above. AM is the height of the blue bar relative to the entire primitive; JM is the height of the red bar relative to the blue bar. The higher the relative height, the more obvious the JM between the selected variable and the corresponding variable of the primitive; CM is the gray histogram and red histogram shape similarity. The lower the similarity, the more obvious the CM. For example, in Figure 1, the red histogram and the gray histogram of x3 are quite different in shape. It can be clearly seen that when x5 is missing, x3 tends to take a lower value.

Figure 2 Radial layout

Since each primitive is relatively independent, the layout of these primitives can be very flexible. It can be aligned on the horizontal line as in Figure 1 to facilitate the comparison of the relative number of missing values ​​on each variable, or it can be as shown in Figure 2. This radial layout, to better compare the relationship between the selected variable and other variables. It is precisely because of the flexibility of the layout that MissiG can be attached to other high-dimensional data visualizations to enhance its ability to analyze missing values.

Figure 3 MissiG Enhanced Parallel Coordinate (PC) and Heat Map (HM)

In user experiments, the authors compare 6 visualization methods: MissiG-R (MissiG in radial layout), MissiG-L (MissiG in linear layout), PC, PC+MissiG, HM, HM+MissiG. A total of two user experiments were conducted. In the first experiment, the authors selected 5 visualization methods (excluding MissiG-R), each visualization method with the necessary interaction, each method asked a question for three missing modes, and applied to different Three specific questions are formed on the three (group) variables of . The experimental evaluation criteria are the user’s reaction time and accuracy, as well as the feedback given by the user. In the second experiment, the author selected all 6 visualization methods, but canceled the interaction (this is because the author believes that the interaction operation of different visualizations is different, which may affect the experimental results), and each visualization proposes one for the three missing modes specific questions. The evaluation standard is the same as that of experiment one. Regarding the hypothesis based on the results of [2], the experimental results are as follows: (Partial means that the hypothesis is basically correct from the point of view of the confidence interval, but the significance is not strong enough)

The experimental results show that for AM and JM tasks, MissiG outperforms PC and is as good as HM; for CM tasks, MissiG performs as well as PC and may outperform HM. Of course, there are still some problems with the current results. For example, will the change of experimental conditions in the second experiment affect the results; whether PC and HM enhanced with MissiG will bring a greater cognitive load due to the use of two views; and for a completely new visualization method MissiG, whether users will spend longer to explore it, etc., these issues will cause a certain bias to the results of the experiment.

references:

[1] SJ Fernstad and JJ Westberg, “To Explore What Isn’t There—Glyph-Based Visualization for Analysis of Missing Values,” in IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 10, pp. 3513- 3529, 1 Oct. 2022.

[2] SJ Fernstad, “To identify what isn’t there: A definition of missingness patterns and evaluation of missing value visualization,” Information Visualization, vol. 18, no. 2, pp. 230–250, 2019.

This article is reproduced from: http://vis.pku.edu.cn/blog/%E6%8E%A2%E7%B4%A2%E4%BB%80%E4%B9%88%E6%98%AF%E4% B8%8D%E5%AD%98%E5%9C%A8%E7%9A%84-%E5%9F%BA%E4%BA%8E%E5%9B%BE%E5%85%83%E7%9A %84%E7%BC%BA%E5%A4%B1%E5%80%BC%E5%88%86%E6%9E%90%E5%8F%AF%E8%A7%86%E5%8C%96 /
This site is for inclusion only, and the copyright belongs to the original author.