Original link: http://vis.pku.edu.cn/blog/visual_caption/
In everyday conversation, people bring up topics that are unfamiliar to others. In online meetings such as Zoom, instant subtitles can help people understand what others are saying. In these scenarios, this work proposes to use visual images to aid in conveying information. Specifically, this work designed an AI-assisted plug-in [1] based on an online meeting platform to perform visual image recommendations in various ways during user conversations. Users can further clarify their views and content through the visual images recommended by Visual Caption.
figure 1
Figure 1 shows the three main contributions in this work. First, the authors completed a dataset covering more than 1500 user intents through crowdsourcing. This dataset is used to fine-tune the language model. Second, the author predicts visual subtitles by fine-tuning the language model. The model produces structured output based on the spoken text prefaced by the user. Finally, the author designs an interface for visual subtitles. Users can browse alternative visual images and freely choose to display them publicly.
The authors first explore the design space of visual illustrations through a formative study. As shown in Figure 2, the authors summarize 8 design dimensions. This work mainly focuses on the immediacy of visual subtitles (1), the content of visual images (3) and the degree of automation of AI recommendations (7). The plug-in involved in this work has the ability to generate visual subtitles in real time.
figure 2
Based on the design space based on different content recommendation sources, the author designed three interaction modes, namely, based on user request (On-demand-suggest), based on AI recommendation (Auto-suggest) and AI automatic (Auto-display) . On User Request mode will only suggest visual subtitles when the user hits the space bar. In the AI recommendation mode, the system will automatically recommend visual descriptions to users, but users can selectively filter and display them. In the AI fully automatic mode, the system will automatically generate and display visual instructions without any operation from the user.
The way this work generates visual descriptions is shown in Figure 3. First, convert the speaker’s voice into text, and store two complete utterances that meet the requirements. These two utterances are input as prompts into the language model, which is fine-tuned on the VC1.5K dataset and will output structured sentences. These sentences contain three attribute information, namely “visual content” (visual content), “visual form” (visual type) and “visual source” (visual source). Then, the work parses the sentences output by the model according to the structure, and searches for images through the corresponding “visual source”. Finally, the visual image that best matches the sentence description is selected as the output through the CLIP model.
In order to ensure that the model can output structured sentences that meet the requirements, this work constructs a fine-tuning dataset VC1.5K through crowdsourcing. The original data comes from 42 Youtube video subtitles and DailyDialog dialogue text. Crowdsourcing workers will judge the “visual source”, “visual content” and “visual form” corresponding to each conversation, thereby forming a data of user intentions. The dataset contains a total of 1595 user intents.
Finally, the work conducts multiple user experiments, trying to discover the user’s habit of using visual descriptions, and the impact of visual descriptions on user communication. In addition, user experiments hope to obtain user preferences for the interaction of the AI system. The work convened 20 participants, divided into 10 groups in pairs. Each group will have 4 prepared dialogues and 10 minutes of free dialogue. Afterwards, users’ opinions on the system and interaction aspects are obtained. Through this user experiment, it can be found that visual instructions can help users understand unfamiliar concepts, reduce ambiguity in language, make known knowledge more intuitive, and make conversations more interesting without distracting users from conversations. In terms of interaction between users and AI systems, this work found that different people have great differences in their preferences for the three interaction modes in different scenarios.
[1] Liu, Xingyu” Bruce, et al. “Visual Captions: Augmenting Verbal Communication With On-the-Fly Visuals.” Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems . 2023.
This article is transferred from: http://vis.pku.edu.cn/blog/visual_caption/
This site is only for collection, and the copyright belongs to the original author.