Author: Wei Yake, Liu Xuemin
Vision and hearing are crucial in human communication and scene understanding. Audio-visual learning, which aims to explore audio-visual modalities, has become a booming field in recent years in order to mimic human perception. This article is a review of the latest audio-visual learning “Learning in Audio-visual Context: A Review, Analysis, and Interpretation of New Perspective.
This review first analyzes the cognitive science basis of audio-visual modalities, and then systematically analyzes and summarizes the recent audio-visual learning work (nearly 300 related literatures). Finally, in order to survey the current field of audio-visual learning, this review revisits the recent progress of audio-visual learning from the perspective of audio-visual scene understanding, and discusses the potential development directions of the field.
arXiv link: https://ift.tt/arZC6qQ project homepage: https://ift.tt/EdK3VPW awesome-list link: https://ift.tt/iWxYNBn
1
introduction
Visual and auditory information are the main sources of information for humans to perceive the external world. The human brain obtains a holistic understanding of the surrounding environment by integrating heterogeneous and multimodal information. For example, in a cocktail party scene with multiple speakers, we can enhance the received speech of the speaker of interest with the help of changes in lip shape. Therefore, audio-visual learning is indispensable for the exploration of human-like machine perception capabilities. Compared to other modalities, the characteristics of the audio-visual modality set it apart:
1) Cognitive basis. As the two most widely studied senses, the integration of vision and hearing pervades the human nervous system. On the one hand, the importance of these two senses in human perception provides a cognitive basis for machine perception research based on audiovisual data; on the other hand, the interaction and integration of vision and hearing in the nervous system can be used to promote audiovisual learning. basis.
2) Multiple consistency. In our daily life, sight and hearing are closely related. As shown in Figure 1, both the barking of a dog and its appearance allow us to associate with the concept of “dog” ( semantic coherence ). At the same time, we can determine the exact spatial position ( spatial coherence ) of the dog with the help of the sound or sight it hears. And when a dog barks, we can often see the dog visually at the same time ( temporal consistency ). The multiple concordance between vision and hearing is the basis of research on audiovisual learning.
3) Rich data support. The rapid development of mobile terminals and the Internet has prompted more and more people to share videos on public platforms, which reduces the cost of collecting videos. These rich public videos ease the barriers to data acquisition and provide data support for audio-visual learning.
These characteristics of the audio-visual modality naturally led to the birth of the field of audio-visual learning. In recent years, this field has achieved vigorous development, and researchers are no longer satisfied with simply introducing additional modalities into the original single-modality task, and begin to explore and solve new problems and challenges.
However, existing AV learning work is usually task-oriented. In these works, they focus on specific audiovisual tasks. There is still a lack of comprehensive work that can systematically review and analyze developments in the field of audiovisual learning. Therefore, this paper summarizes the current field of audio-visual learning, and then further looks forward to its potential development direction.
Due to the close connection between audiovisual learning and human perception, this paper first summarizes the cognitive basis of visual and auditory modalities, and then divides existing audiovisual learning research into three categories on this basis. :
1) Audio-visual Boosting. Visual and audio data each have a long history of research and a wide range of applications. Although these unimodal methods have achieved fairly effective results, they only exploit partial information about the things of interest, the performance of unimodal methods is limited, and they are susceptible to unimodal noise. Therefore, researchers introduce additional modalities into these audio or visual tasks, which not only improve the model performance by integrating complementary information, but also promote the robustness of the model.
2) Cross-modal Perception. Humans can associate related pictures when they hear sounds, and when they see pictures, they can also think of matching sounds because of the consistency of visual and auditory information. This consistency provides a basis for machines to transfer knowledge across modalities or generate corresponding modal data based on information from one modality. Therefore, many studies have been devoted to the exploration of cross-modal perception capabilities, and have achieved remarkable results.
3) Audio-visual Collaboration. In addition to fusing signals from different modalities, there are higher-level inter-modal interactions in cortical regions of the human brain to achieve deeper scene understanding. Therefore, human-like perception needs to be explored for the collaboration of audio-visual modalities. To achieve this goal, many studies have proposed more challenging scene understanding problems in recent years, which have gained widespread attention.
Figure 1: Overview of AV Consistency and AV Learning Fields
The consistency between audiovisual modalities covering semantics, space, and timing provides feasibility for the above audiovisual research. Therefore, this paper analyzes the multiple coherence of AV after summarizing recent AV research. In addition, this paper reviews the progress in the field of audio-visual learning from a new perspective of audio-visual scene understanding.
2
Audiovisual Cognition Basics
Vision and hearing are the two core senses for human scene understanding. This chapter summarizes the neural pathways of the visual and auditory senses and the integration of audiovisual modalities in cognitive neuroscience, laying the foundation for subsequent discussions of research in the field of audiovisual learning.
2.1 Visual and auditory neural pathways
Vision is the most widely studied sense, and some views even argue that it dominates human perception. Correspondingly, the neural pathways of vision are also more complex. Reflected light from objects contains visual information, which activates numerous photoreceptors (about 260 million) on the retina. The output of the photoreceptors is sent to the ganglion cells (about 2 million). This process compresses visual information. Then, through the processing of cells in the lateral geniculate nucleus, the visual information finally reaches the visual-related areas of the cerebral cortex. The visual cortex is an assemblage of functionally distinct regions with preferences for visual neurons. For example, neurons in V4 and V5 are sensitive to color and motion, respectively.
In addition to vision, hearing is also an important sense for observing the surrounding environment. Not only does it remind humans to avoid risks (for example, when they hear the call of a beast, humans take action), it is also the basis for people to communicate with each other. Sound waves are converted into neuronal signals on the eardrum. The auditory information is then conveyed to the inferior colliculus and cochlear nucleus of the brainstem. After processing in the medial geniculate nucleus of the thalamus, sounds are ultimately encoded in the primary auditory cortex. The brain takes auditory information and then uses the sound cues contained within it, such as frequency and timbre, to determine the identity of the sound source. At the same time, the difference in intensity between the two ears and the time between hearing provides clues to the location of the sound, which is called the binaural effect. In practice, human perception can combine multiple senses, especially hearing and vision, which is called multichannel perception.
2.2 Audiovisual integration in cognitive neuroscience
Each sense provides unique information about its surroundings. Although the information received by the various senses is different, the resulting representation of the environment is a unified experience rather than disparate sensations.
A representative example is the McGurk effect: semantically different visual signals and auditory signals receive a single semantic information. These phenomena suggest that in human perception, signals from multiple senses are often integrated. Among them, the intersection of auditory and visual neural pathways combines the information of two important human senses to promote the sensitivity and accuracy of perception. For example, visual information related to sound can improve the search efficiency of auditory space.
These perceptual phenomena combining multiple sensory information have attracted attention in the field of cognitive neuroscience. A well-studied multichannel sensory area in the human nervous system is the superior colliculus. Many neurons in the superior colliculus have multisensory properties and can be activated by information from sight, hearing, and even touch. This multisensory response tends to be stronger than a single response. The superior temporal sulcus in the cortex is another representative area.
Based on studies in monkeys, it has been observed to connect with multiple senses, including sight, hearing and somatosensory. More brain regions, including the parietal, frontal, and hippocampus, exhibit similar multichannel perception phenomena. From the study of the phenomenon of multichannel perception, we can observe several key findings:
1) Multimodal promotion. As noted above, many neurons can respond to fused signals from multiple senses, and when stimuli from a single sense are weak, this enhanced response is more reliable than a single-modal response.
2) Cross-modal plasticity. This phenomenon refers to the fact that deprivation of one sense can affect the development of its corresponding cortical area. For example, it is possible that the auditory-related cortex of deaf people is activated by visual stimuli.
3) Multimodal collaboration. Signals from different senses are more complexly integrated in cortical areas. The researchers found that there are modules in the cerebral cortex that have the ability to integrate multisensory information in a collaborative manner to build awareness and cognition.
Inspired by human cognition, researchers have begun to study how to achieve human-like audio-visual perception capabilities, and more audio-visual research has gradually emerged in recent years.
3
Video and audio enhancement
Although each modality itself has sufficient information for learning, and there are many tasks based on unimodal data, unimodal data only provides local information and is more sensitive to unimodal noise (for example, Visual information is affected by factors such as lighting, viewing angle, etc.). Therefore, inspired by the phenomenon of multimodal boosting in human cognition, some researchers introduce additional visual (or audio) data into the original single-modality task to improve task performance. We divide the related tasks into two parts: recognition and enhancement.
Single-modal recognition tasks have been extensively studied in the past, such as audio-based speech recognition and vision-based action recognition. However, unimodal data only observes part of the information and is susceptible to unimodal noise. Thus, the audio-visual recognition task that integrates multimodal data to facilitate the ability and robustness of models has attracted attention in recent years and covers multiple aspects such as speech recognition, speaker recognition, action recognition, and emotion recognition.
The consistency of audio-visual modalities not only provides the basis for multimodal recognition tasks, but also makes it possible to enhance the signal of another modality with the help of one modality. For example, multiple speakers are visually separated, so the visual information of the speakers can be used to aid speech separation. In addition, audio information can provide identity information such as gender, age, etc. for reconstructing masked or missing speaker facial information. These phenomena have inspired researchers to use information from other modalities for denoising or enhancement, such as speech enhancement, sound source separation, and facial hyper-reconstruction.
Figure 2: AV enhancement task
4
Cross-modal awareness
The phenomenon of cross-modal plasticity in cognitive neuroscience and the coherence between audio-visual modalities has facilitated the study of cross-modal perception, which aims to learn and establish associations between audio and visual modalities, prompting cross-modal perception. The generation of tasks such as state generation, migration and retrieval.
Humans have the ability to predict information corresponding to another modality under the guidance of a known modality. For example, we can roughly infer what the person is saying by seeing visual information about lip movements without hearing the sound. The semantic, spatial, and temporal coherence between audio and vision makes it possible for machines to possess human-like cross-modal generative capabilities. Cross-modal generation tasks have currently covered many aspects including single-channel audio generation, stereo generation, video/image generation, and depth estimation.
In addition to cross-modal generation, the semantic consistency between audiovisuals suggests that learning from one modality is expected to be aided by semantic information from the other. This is also the goal of the audio-visual transfer task. In addition, the semantic consistency of audiovisual also facilitates the development of cross-modal information retrieval tasks.
Figure 3: Cross-modal perception related tasks
5
Audiovisual collaboration
The human brain integrates the audio-visual information of the received scene, so that it cooperates and complements each other, thereby improving the ability to understand the scene. Therefore, it is necessary for machines to pursue human-like perception by exploring audiovisual collaboration, rather than just fusing or predicting multimodal information. To this end, researchers have introduced a variety of new challenges in the field of audio-visual learning including audio-visual component analysis and audio-visual inference.
At the beginning of audio-visual collaboration, how to effectively extract representations from audio-visual modalities without human annotation is an important topic. This is because high-quality representations can contribute to various downstream tasks. For audiovisual data, their semantic, spatial, and temporal coherence provides a natural signal for learning audiovisual representations in a self-supervised manner.
In addition to representation learning, collaboration between audio-visual modalities mainly focuses on scene understanding. Some researchers focus on the analysis and localization of audio-visual components in the scene, including sound source localization, audio-visual saliency detection, audio-visual navigation, etc. Such tasks establish fine-grained connections between audio-visual modalities.
In addition, in many audiovisual tasks, we tend to assume that the audiovisual content in the entire video is always matched in time, that is, the picture and sound are consistent at every moment of the video. But in fact, this assumption does not always hold. For example, in the “playing basketball” sample, the camera sometimes captures scenes that are not related to the label “playing basketball”, such as the auditorium. Therefore, tasks such as audio-visual event localization and parsing are proposed to further strip the audio-visual components in the scene in terms of time sequence.
Humans can make inferences beyond perception in audiovisual scenes. Although the above audio-visual collaboration tasks have gradually achieved a fine-grained understanding of audio-visual scenes, the reasoning analysis of audio-visual components has not been carried out. Recently, with the development of the field of audio-visual learning, some researchers have begun to pay more attention to audio-visual reasoning, such as audio-visual question answering and dialogue tasks. These tasks aim to answer scene-related questions or generate dialogues about observed audiovisual scenes by performing cross-modal spatiotemporal reasoning about audiovisual scenes.
Figure 4: Tasks related to AV collaboration
6
representative dataset
This section discusses some representative datasets in the field of audio-visual learning.
7
Trends and new perspectives
7.1 Semantic, Spatial and Temporal Consistency
Although audiovisual modalities have heterogeneous data forms, their intrinsic consistency covers semantic, spatial, and temporal aspects, laying the foundation for audiovisual research.
First, the visual and audio modalities depict the thing of interest from different perspectives. Therefore, the semantics of audiovisual data is considered to be semantically consistent. In audiovisual learning, semantic consistency plays an important role in most tasks. For example, this consistency makes it possible to combine audio-visual information for better audio-visual recognition and single-modal enhancement. In addition, semantic consistency between audiovisual modalities also plays an important role in cross-modal retrieval and transfer learning.
Second, both vision and audio can help determine the exact spatial location of sounding objects. This spatial correspondence also has a wide range of applications. For example, in a sound source localization task, this consistency is used to determine the visual location of sound-emitting objects guided by input audio. In the stereo case, it is possible to estimate visual depth information based on binaural audio or to use visual information as an aid to generate stereo audio.
Finally, visual content and the sound it produces are often temporally aligned. This consistency is also widely exploited in most audiovisual learning research, such as fusing or predicting multimodal information in audiovisual recognition or generation tasks.
In practice, these different AV coherences are not isolated, but often co-occur in AV scenes. Therefore, they are often exploited jointly in related tasks. The combination of semantic and temporal consistency is the most common case.
In simple scenarios, audiovisual segments at the same timestamp are considered to be semantically and temporally consistent. However, this strong assumption may fail, e.g., video footage and background sounds at the same timestamp are not semantically consistent. These false positives interfere with training.
Recently, researchers have begun to focus on these cases to improve the quality of scene understanding. In addition, a combination of semantic and spatial coherence is also common. For example, the success of sound source localization in video relies on semantic coherence to explore the corresponding visual spatial location based on the input sound. In addition, the vocal target produced a steady repetitive sound during the early stages of the audiovisual navigation task. While spatial coherence is satisfied, the semantic content in visual and audio is irrelevant. Subsequently, semantic consistency of sounds and utterance locations was introduced to improve the quality of audio-visual navigation.
In general, the semantic, spatial, and temporal consistency of audio-visual modalities is sufficient to provide solid support for the study of audio-visual learning. The analysis and utilization of these coherences not only improves the performance of existing AV tasks, but also contributes to a better understanding of AV scenes.
7.2 A new perspective on scene understanding
This paper summarizes the cognitive basis of audio-visual modalities, and analyzes the phenomenon of human multi-channel perception. On this basis, the current audio-visual learning research is divided into three categories: Audio-visual Boosting, Cross-modal Perception and Audio-visual Collaboration. In order to review the current development in the field of audiovisual learning from a broader perspective, the article further proposes a new perspective on audiovisual scene understanding:
1) Basic Scene Understanding. The tasks of audiovisual enhancement and cross-modal perception typically focus on fusing or predicting consistent audiovisual information. The core of these tasks is the basic understanding of audiovisual scenes (e.g., action classification of input video.) or prediction of cross-modal information (e.g., corresponding audio generation based on silent video.) However, in natural scenes, the Videos often contain a wide variety of audio-visual components, which are beyond the scope of these basic scene understanding tasks.
2) Fine-grained Scene Understanding. As mentioned above, audiovisual scenes usually have rich components of different modalities. Therefore, researchers propose some tasks to strip target components. For example, the sound source localization task aims to mark the region in vision where the target sounding object is located. The audiovisual event location and analysis task determines the target audible event or visible event in time sequence. These tasks strip out the audio-visual components, decouple the audio-visual scene, and have a more fine-grained understanding of the scene than the previous stage.
3) Causal Scene Understanding. In audiovisual scenes, humans not only perceive objects of interest around them, but also infer the interactions between them. The goal of scene understanding at this stage is closer to the pursuit of human-like perception. Currently, only a few tasks are explored at this stage. Audio-visual question answering and dialogue tasks are representative works. These tasks attempt to explore the association of audio-visual components in videos and perform spatiotemporal reasoning.
Overall, the exploration of these three phases is uneven. From basic scene understanding to causal interaction scene understanding, the diversity and richness of related research is gradually decreasing, especially causal interaction scene understanding is still in its infancy. This hints at some potential directions for audiovisual learning:
1) Task integration. Most research in the AV field is task-oriented. These individual tasks only simulate and learn specific aspects of the audiovisual scene. However, the understanding and perception of audiovisual scenes are not isolated. For example, sound source localization tasks emphasize sound-related objects in vision, while event localization and parsing tasks temporally identify target events. These two tasks are expected to be integrated to facilitate refined understanding of audiovisual scenes. The integration of multiple audiovisual learning tasks is a direction worth exploring in the future.
2) Deeper understanding of causal interaction scenarios. Currently, the diversity of research on scene understanding involving reasoning is still limited. Existing tasks, including audio-visual question answering and dialogue, mostly focus on conducting dialogues based on events in videos. More in-depth types of reasoning, such as predicting what audio or visual events might happen next based on the previewed scene, deserve further study in the future.
In order to better present the content of the article, the review is also equipped with a continuously updated project homepage, which displays the goals and development of different audio-visual tasks in more forms such as pictures and videos, so that readers can quickly understand the field of audio-visual learning.
Long press the QR code to enter the project homepage for more content , click below to follow: Scan the code to add AI technology review WeChat account, submit articles & join the group:
Without the authorization of “AI Technology Review”, it is strictly prohibited to reprint it on web pages, forums, and communities in any way!
For reprinting on the official account, please leave a message in the background of “AI Technology Review” to obtain authorization. When reprinting, you need to indicate the source and insert the business card of this official account.
Leifeng.com
This article is reprinted from: https://www.leiphone.com/category/academic/dvrc7n5jxEvDOEYi.html
This site is for inclusion only, and the copyright belongs to the original author.