Original link: http://vis.pku.edu.cn/blog/linset-zip/
Based on the existing Linear Diagram, this work uses a certain strategy to compress multiple disjoint sets into one line to make the layout more compact, thereby improving the space efficiency of the view and improving the efficiency of comparison between sets.
background
Linear Diagram [2] is a classic collection visualization method. In Linear Diagram, each row represents a collection, and each overlap (as shown in the figure below, referred to as a column in the following description) represents an aggregation of a type of element, and the width of the column encodes the number of elements in the aggregation. The intersection of the segment of the column and the row indicates that the element in the corresponding aggregate belongs to the set represented by the segment.
Figure 1 Linear Diagram. The red dotted frame and the text next to it in the figure are comments added later, not elements on the original view.
The representation method of Linear Diagram is very intuitive and highly scalable, but there are also some problems:
- In the row space, when the intersection between sets is relatively sparse, there will be a lot of blank space, and the utilization rate of the row space will be very low;
- In column space, comparing two collections that are far apart can be difficult.
algorithm
LinSet.zip solves these problems by improving Linear Diagram. By compressing multiple disjoint collections into one line with a certain strategy, the layout is made more compact, thereby improving the space efficiency of the view and improving the efficiency of comparison between collections. The overall algorithm flow is as follows:
Figure 2 Schematic diagram of algorithm flow
(1) Column sorting algorithm (a)→(b). Reduce the number of segments in each collection by sorting the columns.
In fact, there is also this part of the algorithm in Linear Diagram. Linear Diagram uses the greedy method. First, take the column that contains the most sets, and then select the column with the shortest distance from the current leftmost (right) side column from the remaining columns. Columns are joined on the far left (right) side. The distance calculation method is shown in the figure below: first convert the column into a 0-1 vector, and then calculate the Manhattan distance between the two vectors.
Figure 3 Column distance calculation demonstration
In fact, this problem can be abstracted into a traveling peddler problem (TSP). By constructing an auxiliary node with all 1s as the starting and ending positions, it is solved how to pass through the corresponding nodes of all columns so that the sum of the distances is the shortest. As a classic problem, TSP has many precise algorithms and heuristic algorithms. The difference in the effect of the two algorithms applied here will be mentioned later in the evaluation section.
(2) Line compression algorithm (b) → (c). Make the layout more compact by taking appropriate strategies to put disjoint sets in a row.
There are mainly 3 strategies used here:
- Γ1: It is enough that the sets in the same row are disjoint;
- Γ2: On the basis of Γ1, the segments in any two sets cannot appear alternately;
- Γ3: On the basis of Γ1, the segments of two sets are allowed to appear alternately, but three or more segments cannot alternate with each other.
The schematic diagram is as follows:
Figure 4 Schematic diagram of different compression strategies
To solve Γ1, the problem can be transformed into a graph coloring problem (nodes of the same color cannot be connected), the set is abstracted into nodes in the graph, and the intersection of sets is abstracted into edges in the graph, so how to assign the sets to different The row becomes a matter of how to assign different colors to the nodes. The solution can be achieved by related precise algorithms and heuristic algorithms for graph coloring.
Fig.5 Schematic diagram of transforming Γ1 into graph coloring problem
Solving Γ2 and Γ3 is also very similar. For Γ2, it is only necessary to check all the alternations and add an edge between the two sets where the alternation occurs (indicating that the nodes corresponding to the two sets cannot be assigned the same color, and thus the two sets cannot be assigned to the same row); for Γ3, it is necessary to check all set triples to determine whether a new edge needs to be added.
(3) Color assignment. Assign a color to each collection.
Assigning colors is not a simple matter. Here it is required that (a) the colors of each collection in the same row cannot be the same; (b) the same color can be used between collections in different rows (if the color is not enough). If you want to minimize the number of colors used, it is actually an NP-hard problem that takes a lot to solve. The author finally adopted the method of cyclically assigning colors, which is not much different from the effect of the heuristic algorithm for the above problem.
(4) Render the view.
For Γ2 and Γ3, lines are added between the segments to indicate that these segments belong to the same set (see Figure 4(c) and (d)).
effect evaluation
The evaluation of LinSet.zip is mainly in the following aspects:
- Time Efficiency of Exact Algorithms
- Compare the heuristic algorithm and the exact algorithm in terms of the number of final segments and the compression ratio
- Exploring the compression effect on real data sets
(1) Time efficiency of precise algorithm
In the column ordering algorithm:
(1) The time does not increase exponentially with the expansion of the data scale;
(2) There are only two over 300s.
In the row compression algorithm:
(1) Γ1 (B=+∞) has the shortest time;
(2) The function curves are very similar;
(3) The case of less than 100 sets can be controlled within 5 seconds.
Figure 6 Time efficiency of exact algorithm
(2) Compare the heuristic algorithm and the precise algorithm in terms of the number of final segments and the compression ratio
Heuristic algorithms have limited performance compared to exact algorithms.
Figure 7 Comparison of heuristic and exact algorithms
(3) Research on the compression effect of real data sets
Γ1 < Γ3 < Γ2 < 0.4
Figure 8 Performance on real datasets
user experiment
Two element-related questions and three set-related questions are designed, and the effects of Linear Diagram and LinSet.zip under Γ1-Γ3 are compared. The relevant assumptions and final results of the experiment are as follows:
Figure 9 Hypothesis and conclusion of user experiment
evaluate
- Deficiencies mentioned by the author:
- The collection label is currently simply placed on the segment with the largest collection, and the problem of overlapping cannot be avoided;
- Lack of interactive support. For example, rearrange the columns to reduce the number of segments in the selected collection;
- You can try to expand to radial layout.
- Actually, LinSet.zip is also a trade-off
- Enhanced ability to search for element and set intersections (objective of this article)
- Reduced the readability of the number of collections and the size of the collection (the user experiments in this article support this point of view, as shown in the figure below, Γ0 is the best performance of Linear Diagram on Task 3)
references:
[1] M. Wallinger, A. Dobler, and M. Nöllenburg, “LinSets.zip: Compressing Linear Set Diagrams.” arXiv, Feb. 16, 2023. Accessed: Apr. 12, 2023.
[2] P. Rodgers, G. Stapleton, and P. Chapman, “Visualizing Sets with Linear Diagrams,” ACM Trans. Comput.-Hum. Interact. , vol. 22, no. 6, pp. 1–39, Dec. . 2015.
This article is transferred from: http://vis.pku.edu.cn/blog/linset-zip/
This site is only for collection, and the copyright belongs to the original author.