Integrated Dual Analysis of Quantitative and Qualitative High-Dimensional Data

Original link: http://vis.pku.edu.cn/blog/integrated-dual-analysis-of-quantitative-and-qualitative-high-dimensional-data/

Dual space analysis is an advanced high-dimensional data analysis method. It contains a dimension space and a data item space, and the user’s operations in one space will be reflected in the other space, so that users can explore them at the same time, and jointly study the structure of the dimension space and the distribution of the data item space (Figure 1). However, previous work does not treat quantitative and categorical dimensions equally, the latter usually only used to define subsets of data items. This can cause interesting patterns to be ignored. Faced with this limitation, the authors propose two statistical measures capable of simultaneously describing quantitative and categorical data to extend the current framework of dual spatial analysis, and develop a prototype system to help users perform joint exploratory analysis [1].

Figure 1. Dual space analysis framework.

The two statistical measures proposed are variability and modality, respectively. Variability is used to understand data distribution or variance, and it can be measured in terms of both diversity and variation around the mean. The former describes the differences before the data items, and the latter describes the centrality trend of the data. Diversity is defined as the dissimilarity coefficient, which represents the proportion of data pairs that are dissimilar in the data sample. Since the value of the quantitative dimension can be arbitrarily selected within a certain range, and the value of the categorical dimension can only be one of several categories, the dissimilarity of the quantitative dimension is generally higher than that of the categorical dimension. To avoid this, the authors introduce a looser interpretation for quantitative dimensions: pairs of data are considered “dissimilar” only if they differ by more than a set threshold. For variation around the mean, variance and standard deviation are the most familiar related statistical measures for quantitative dimensions. For categorical dimensions, the authors use corresponding measures similar to those for quantitative dimensions: variance is analogous to variance, and stDev is analogous to standard deviation.

Modality can be used as a reliable measure of centrality trends in both quantitative and categorical dimensions, and users can further understand the shape of the data distribution by analyzing the modal distribution. For simplicity, the author focuses on the number of modalities here. For quantitative dimensions, they used kernel density estimation and then took the number of local maxima as the number of modes. For categorical dimensions, they use thresholds for high-frequency categories. The threshold is set to the highest frequency minus a predefined percentage.

The authors develop a prototype interface for the analysis of quantitative and categorical dimensions in dual spaces. As shown in Figure 2, the interface is divided into three parts. Parts A and C are related to dimension exploration, and part B is related to data item exploration. In Part B, data items are visualized by parallel coordinates, where missing values ​​are represented at the bottom of the parallel coordinates. The user can select a subset of data items by swiping in the parallel coordinates or adjusting the slider in B2. To deal with too many dimensions, the authors introduce carousel-inspired parallel coordinates, which can be navigated via a donut-chart-based icon in the lower right corner. The currently displayed dimensions are shown in dark gray, with the first dimension from the left additionally bordered by black. Dimensions that have subset selection applied turn purple for quick identification.

Figure 2 Interface for dual space analysis.

In Part A, dimensions are visualized by scatter plots whose axes are the statistics introduced earlier. Different types of dimensions are coded with different colors. Although dates are read as quantitative data, they are coded in another color as they may be particularly useful for defining subsets. Opacity indicates the relative frequency of missing data. When the mouse hovers over a dimension, its detailed statistical measures are displayed via a radar chart. Overlapping dimensions are represented by an outlined circle, and hovering over it will reveal the exact dimension it contains. After selecting a subset of data items, changes in statistical measures are represented by deviation lines. These lines point from the original statistical measure to the measure of the current subset. Part C also visualizes the dimensional space through a scatterplot. After selecting a subset, its Y axis represents the dimension type and the X axis represents the overall deviation of the dimension. Population deviation refers to the sum of the variation of all statistical measures.

The authors conduct a case study to demonstrate the usability of the proposed method. They used a clinical dataset of 307 patients with Cerebral Small Vessel Disease (CSVD). The dataset contains 193 dimensions that describe demographic information, genetic data, educational background, and more.

First, they use statistical measures to get an overview of all dimensions. Specifically, they investigated the relationship between the number of modes and the dissimilarity coefficient, as well as the relationship between standard deviation and variance (Figure 3). From an analysis of modal numbers, they observed that there were four dimensions that contained significant modal numbers: pathological findings and three identifiers. Since pathological findings are recorded as free text and identifiers are naturally unique, these dimensions contain almost exclusively unique values. The reason why there are so many modalities in these dimensions can also be verified by their high dissimilarity. As can be seen from the relationship between standard deviation and variance, four dimensions with high standard deviation and variance form clusters. These dimensions are free text or dimensions with a wide selection of criteria.

Figure 3 Overview of the analyzed clinical dataset.

The authors also examined the possible impact of education on CSVD (Figure 4). For some high-risk patients, improved education may mitigate the cognitive impact of CSVD pathology. They selected patients with 15 or more years of education. By examining overall bias, they quickly identified seven dimensions that had the greatest impact on the educated subset. They then investigated changes in the standard deviation and dissimilarity coefficients of these dimensions. Radar plots show a large change in sex, increasing dissimilarities for Ab 1-40, Ab 1-42, total protein, and total tau, while the standard deviation does not change much. Although changes in SWI and Diagnose were observed, they were ignored as they did not contain any numerical values ​​after subset selection. As can be seen from the parallel coordinates, higher education levels correspond to reductions in Ab 1-40 and Ab 1-42, suggesting that higher education levels provide resilience to disorders of amyloid metabolism and Alzheimer’s disease. In addition, it also corresponds to the reduction of total protein and total tau. This can be used as an indicator that there are fewer pathological changes in the brain and stronger resistance.

Figure 4 Analysis of the effect of education level on CSVD.

Overall, in this paper, the authors extend the dual spatial analysis framework previously explored for high-dimensional data to propose a method for jointly analyzing quantitative and categorical dimensions. They do so by proposing statistical measures that can describe both types of dimensions simultaneously. Furthermore, the retention of missing data items plays a significant role in visualizing the uncertainty of a given data.

references

1. Juliane Muller, Laura Garrison, Philipp Ulbrich, Stefanie Schreiber, Stefan Bruckner, Helwig Hauser, and Steffen Oeltze-Jafra. Integrated Dual Analysis of Quantitative and Qualitative High-Dimensional Data. IEEE Transactions on Visualization and Computer Graphics, 27(6) : 2953-2966, 2021.

This article is reprinted from: http://vis.pku.edu.cn/blog/integrated-dual-analysis-of-quantitative-and-qualitative-high-dimensional-data/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment