IEEE Fellow Li Xuelong: Multimodal cognitive computing is the key to realizing general artificial intelligence

Author | Li Mei

Editor | Chen Caixian

In today’s data-driven artificial intelligence research, the information provided by single-modal data can no longer meet the needs of improving machine cognitive capabilities. Similar to how humans perceive the world using various sensory information such as sight, hearing, smell, and touch, machines also need to simulate human synesthesia to improve cognition.

At the same time, with the explosion of multimodal spatiotemporal data and the improvement of computing power, researchers have proposed a large number of methods to cope with the growing and diverse needs. However, the current multimodal cognitive computing is still limited to the imitation of human apparent ability, and lacks the theoretical basis at the cognitive level. Facing more complex intelligent tasks, the intersection of cognitive science and computational science has become inevitable.

Recently, Professor Li Xuelong of Northwestern Polytechnical University published the article “Multimodal Cognitive Computing” in the journal “Science China: Information Science”. Based on “Information Capacity”, he established the information transfer of cognitive processes. The model proposes the viewpoint that “multimodal cognitive computing can improve the information extraction ability of machines” , and theoretically unifies the tasks of multimodal cognitive computing.

Li Xuelong believes that multimodal cognitive computing is one of the keys to realizing general artificial intelligence, and it has broad application prospects in the fields of “Vicinagearth Security”. This paper explores the unified cognitive model of humans and machines, and brings inspiration to the research that promotes multimodal cognitive computing.

Citation format: Xuelong Li, “Multi-Modal Cognitive Computing,” SCIENTIA SINICA Informationis, DOI: 10.1360/SSI-2022-0226

Li Xuelong, a professor at Northwestern Polytechnical University, pays attention to the relationship between intelligent acquisition, processing and management of high-dimensional data, and plays a role in application systems such as “Vicinagearth Security”. In 2011, he was selected as an IEEE Fellow, and was the first mainland scholar elected to the Executive Committee of the International Association for Artificial Intelligence (AAAI).

The AI Technology Review summarized the main points of the article “Multimodal Cognitive Computing” and conducted an in-depth dialogue with Professor Li Xuelong along this direction.

— 1 —
Machine cognitive ability lies in information utilization

Based on information theory, Li Xuelong proposed that multimodal cognitive computing can improve the information extraction ability of machines, and theoretically modeled this point of view (see below).

First, we need to understand how humans extract event information.

In 1948, Shannon, the founder of information theory, proposed the concept of “information entropy” to represent the degree of uncertainty of random variables. The smaller the probability of an event, the greater the amount of information provided by its occurrence. That is, in a given cognitive task T, the amount of information brought by the occurrence of event x is inversely proportional to the probability p(x) of the event:

The information is transmitted in various modalities. Assuming that the event space X is a tensor in the perception mode (m), space (s), and time (t), then the amount of information an individual obtains from the event space can be defined as:

Humans have limited attention in a certain spatiotemporal range (assumed to be 1), so when spatiotemporal events change from unimodal to multimodal, humans do not need to constantly adjust their attention and focus on unknown events information to get the maximum amount of information:

It can be seen from this that the more modalities contained in spatiotemporal events, the greater the amount of information an individual acquires, and the higher the cognitive level.

So for machines, does the greater the amount of information obtained, the closer the machine is to the level of human cognition?

The answer is not so. In order to measure the cognitive ability of the machine, Li Xuelong expressed the process of the machine extracting information from the event space on the basis of the “information content” theory as follows. Among them, D is the data volume of event space x.

Therefore, the cognitive ability of a machine can be defined as the ability to obtain the maximum amount of information from a unit of data. In this way, the cognitive learning of humans and machines is unified as a process of improving information utilization.

So, how to improve the utilization of multi-modal data by the machine to improve the multi-modal cognitive computing capability?

Just as the improvement of human cognition is inseparable from the association, reasoning, induction and deduction of the real world, in order to improve the cognitive ability of machines, it is also necessary to cut in from the corresponding three aspects: association, generation, and collaboration. Three basic tasks of analysis.

— 2 —

Three main lines of multimodal cognitive computing

The three tasks of multi-modal association, cross-modal generation and multi-modal collaboration have different emphases in processing multi-modal data, but the core is to use as little data as possible to maximize the amount of information.

multimodal association

How are content originating from different modalities related at the spatial, temporal and semantic levels? This is the goal of the multimodal association task and the premise of improving information utilization.

The alignment of multimodal information at the spatial, temporal and semantic levels is the basis of cross-modal perception, and multimodal retrieval is the application of perception in real life. For example, relying on multimedia search technology, we can input lexical phrases to retrieve videos. Fragment.

Legend: Schematic diagram of multimodal alignment

Inspired by the human cross-sensory perception mechanism, AI researchers have used computational models for cross-modal perception tasks such as lip reading and missing modality generation.

It also further assists the cross-modal perception of the disabled group. In the future, the main application scenarios of cross-modal perception will no longer be limited to the perception replacement application of the disabled, but will combine more with human cross-sensory perception to improve the level of human multi-sensory perception.

Nowadays, with the rapid growth of digital modal content, the application requirements of cross-modal retrieval are also more and more abundant, which undoubtedly presents new opportunities and challenges for multimodal association learning.

Cross-modal generation

When we read the plot of a novel, the corresponding picture will naturally appear in our minds, which is the embodiment of human cross-modal reasoning and generative ability.

Similarly, in multimodal cognitive computing, the goal of cross-modal generation tasks is to empower machines to generate unknown modal entities. From the perspective of information theory, the essence of this task is to improve the cognitive ability of machines in multimodal information channels. There are two ways: one is to increase the amount of information, i.e. cross-modal synthesis , and the other is to reduce the The amount of data is the cross-modal transition.

The task of cross-modal synthesis is to enrich existing information when generating new modal entities, thereby increasing the amount of information. Taking text-based image generation as an example, in the early days, entity association was mainly used, and the degree of dependence on the retrieval database was often very high. Today, image generation technology is dominated by generative adversarial networks, which have been able to generate realistic high-quality images. However, face image generation is still very challenging, because from the information level, even a small change in expression may convey a very large amount of information.

At the same time, converting complex modalities to simple modalities and looking for a more concise form of expression can reduce the amount of data and improve the ability to obtain information.

Legend: Common cross-modal transition tasks

As a model of the combination of computer vision and natural language processing, cross-modal transformation can greatly improve the efficiency of online retrieval. For example, a brief natural language description is given to a lengthy video, or an audio signal light related to a piece of video information is generated.

At present, the two mainstream generative models VAE (variational autoencoder) and GAN (generative adversarial network) have their own strengths and weaknesses. Li Xuelong believes that VAE relies on assumptions, while GAN is poorly interpretable, and the two need to be reasonably combined. A particularly important point is that the challenge of multi-modal generation tasks is not only the generation quality, but also the semantic and representation gap between different modalities. How to carry out knowledge reasoning under the premise of semantic gap needs to be solved in the future. difficulty.

Multimodal collaboration

In the human cognitive mechanism, induction and deduction play an important role. We can conduct inductive fusion and joint deduction of multimodal perceptions such as what we see, hear, smell, and touch, and use this as a decision-making in accordance with.

Similarly, multimodal cognitive computing also requires the coordination of two or more modal data, cooperate with each other to complete more complex multimodal tasks, and improve the accuracy and generalization ability. From the perspective of information theory, its essence is the mutual fusion of multimodal information to achieve the purpose of information complementation, and it is the optimization of attention.

First, modal fusion is to solve the problem of multi-modal data differences caused by data format, spatio-temporal alignment, noise interference, etc. At present, the fusion methods of chance rules include serial fusion, parallel fusion and weighted fusion, and learning-based fusion methods include attention mechanism model, transfer learning and knowledge distillation.

Secondly, after the multi-modal information fusion is completed, joint learning of the modal information is needed to help the model mine the relationship between the modal data and establish an auxiliary or complementary relationship between the modalities.

Through joint learning, on the one hand, modal performance can be improved, such as applications such as vision-guided audio, audio-guided vision, depth-guided vision, etc.; Modeling, audio-visual guided music generation, etc. are the development directions of multimodal cognitive computing in the future.

—— 3 ——

Opportunities and Challenges

In recent years, deep learning techniques have greatly promoted the theoretical and engineering development of multimodal cognitive computing. But today’s application requirements are more diversified, and the speed of data iteration is also accelerating, which poses new challenges and brings many opportunities for multimodal cognitive computing.

We can look at four levels of improving machine cognitive ability:

At the data level, traditional multimodal research separates data collection and computation into two independent processes, which has drawbacks. The human world is composed of continuous analog signals, while machines deal with discrete digital signals, and the conversion process will inevitably cause information deformation and loss.

In this regard, Li Xuelong believes that intelligent optoelectronics represented by optical neural networks can bring solutions. If the integration of multi-modal data sensing and computing can be completed, the information processing efficiency and intelligence level of machines will be greatly improved.

At the information level, the key to cognitive computing is the processing of high-level semantics in information, such as the positional relationship in vision, the style of images, and the emotion of music. Current multimodal tasks are limited to interactions under simple goals and scenarios, and cannot understand deep logical semantics or subjective semantics. For example, a machine can generate an image of a flower blooming in a meadow, but cannot understand the common sense that flowers and plants wither in winter.

Therefore, building a communication bridge for complex logic and feeling semantic information in different modalities, and establishing a characteristic machine measurement system is a major trend of multimodal cognitive computing in the future.

At the fusion mechanism level, how to optimize the high-quality multimodal model composed of heterogeneous components is a current difficulty. Most of the current multimodal cognitive computing is to optimize the model under the unified learning objective. This optimization strategy lacks the targeted adjustment of the heterogeneous components inside the model, which leads to the existing multimodal model. The under-optimization problem needs to be approached from many aspects such as multimodal machine learning and optimization theory methods.

At the task level, the cognitive learning methods of machines vary with tasks. We need to design learning strategies for task feedback to improve the ability to solve a variety of related tasks.

In addition, in view of the shortcomings of the current machine learning to understand the world from images, texts and other data, we can learn from the research results of cognitive science. For example , Embodied AI is a potential solution. Solution: The agent needs to interact with the environment in a multi-modal manner in order to continuously evolve and form the ability to solve complex tasks. (Public number: Leifeng.com)

—— 4——

Dialogue with Li Xuelong

AI Tech Review: Why should we focus on multimodal data and multimodal cognitive computing in AI research? What benefits and obstacles does the growth of multimodal data bring to model performance?

Li Xuelong: Thank you for your question. The reason why we pay attention to and study multimodal data is that artificial intelligence is essentially data-dependent, and the information that a single modal data can provide is always very limited, while multimodal data can provide multiple Hierarchical and multi-perspective information; on the other hand, because the objective physical world is multi-modal, the study of many practical problems is inseparable from multi-modal data, such as searching for pictures by text, recognizing objects by listening to sounds, etc.

We analyze multimodal problems from the perspective of cognitive computing, starting from the essence of artificial intelligence, and by building a multimodal analysis system that can simulate human cognitive patterns, we hope that machines can perceive the surrounding environment as intelligently as humans.

The complex and interleaved multimodal information will also bring a lot of noise and redundancy, increase the pressure of model learning, and in some cases, the performance of multimodal data is not as good as that of a single modality, which provides better model design and optimization. big challenge.

AI Technology Review: From the perspective of information theory, what are the similarities between human cognitive learning and machine cognitive learning? What is the guiding significance of the research on human cognitive mechanism for multimodal cognitive computing? What difficulties will multimodal cognitive computing face if there is a lack of understanding of human cognition?

Li Xuelong: Aristotle believed that people’s knowledge of things starts from feeling, while Plato believes that what is obtained through feeling cannot be called knowledge.

Humans receive a large amount of external information from birth, and gradually establish a self-cognition system through perception, memory, reasoning, etc., and the learning ability of machines is achieved through training on a large amount of data, mainly to find the difference between perception and human knowledge. Correspondence between. According to Plato, what a machine learns is not knowledge yet. We cite the theory of “Information Capacity” in this article, and try to establish a cognitive connection between humans and machines starting from the ability to extract information.

Humans transmit multimodal information to the brain through various perceptual channels such as sight, hearing, smell, taste, touch, etc., and produce joint stimulation to the cerebral cortex. Psychological research has found that the joint action of multiple senses can produce cognitive learning modes such as “multi-sensory integration”, “Synaesthesia”, “perceptual reorganization”, and “perceptual memory”. These human cognitive mechanisms are multimodal. Cognitive computing has brought great inspiration, for example, it has derived typical multi-modal analysis tasks such as multi-modal collaboration, multi-modal association, and cross-modal generation, and it has also spawned local sharing, long and short-term memory, and attention mechanisms. and other typical machine analysis mechanisms.

At present, the mechanism of human cognition is not clear. Without the guidance of human cognitive research, multimodal cognitive computing will fall into the trap of data fitting, and we cannot judge whether the model has learned the knowledge that people need. This is also a point of artificial intelligence currently controversial.

AI Technology Review: From the perspective of information theory, what evidence supports your view that “multimodal cognitive computing can improve the information extraction ability of machines” in specific multimodal cognitive computing tasks?

Li Xuelong: This question can be answered from two aspects. First, multimodal information can improve the performance of a single modality in different tasks. A lot of work has verified that when adding sound information, the performance of computer vision algorithms will be significantly improved, such as object recognition, scene understanding, etc. We have also built an environmental camera and found that the image quality of the camera can be improved by fusing multi-modal information from sensors such as temperature and humidity.

Second, the joint modeling of multimodal information makes it possible to realize more complex intelligent tasks. For example, we have done the work of “Listen to the Image”, encoding visual information into sound, so that blind people can “see” The sight in front of us also proves that multimodal cognitive computing helps machines extract more information.

AI Technology Review: What are the interrelationships among alignment, perception, and retrieval in multimodal association tasks?

Li Xuelong: The relationship between these three is relatively complicated in nature. In this article, I only give some preliminary views of myself. The premise of the association of different modal information is that they jointly describe the same/similar objective existence, but this association is difficult to determine when the external information is complicated or interfered, which requires the alignment of different modalities first. information to determine the association correspondence. Then, on the basis of alignment, the perception from one modality to another modality is realized.

It’s like when we only see a person’s lip movement and can hear what he’s saying. The generation of this phenomenon is also based on the alignment of viseme and phoneme. In real life, we also further apply this cross-modal perception to applications such as retrieval, retrieving pictures or video content of commodities through text, and realizing computable multi-modal association applications.

AI Technology Review: Models such as DALL-E, which have been very popular recently, are an example of cross-modal generative tasks, they perform well in text-to-image tasks, but the semantic relevance, interpretability, etc. limited. How do you think this problem should be solved? Where is the difficulty?

Li Xuelong: Generating images from text is an “imagination” task. People see or hear a sentence, understand the semantic information in it, and then rely on the brain’s memory to imagine the most suitable scene to create a “sense of picture”. At present, DALL-E is still in the stage of using statistical learning for data fitting, inducting and summarizing large-scale data sets, which is also what deep learning is best at.

However, if we really want to learn people’s “imagination”, we also need to consider the human cognitive model to achieve “high level” intelligence. This requires the intersection of neuroscience, psychology, and information science, which is a challenge and an opportunity. In recent years, many teams have also made top-notch work in this area. Through the cross-integration of multiple disciplines, exploring the computability theory of human cognitive mode is also one of the directions of our team’s efforts. It is believed that it will also bring new breakthroughs to “high-level” intelligence.

AI Tech Review: How do you draw inspiration from cognitive science in your research work? What research in cognitive science are you particularly interested in?

Li Xuelong: Ask him how clear he is? For there is a source of living water. I often observe and think about some interesting phenomena from everyday life.

Twenty years ago, I browsed to a web page with pictures of Jiangnan landscapes. When I clicked on the music on the web page, I suddenly felt like I was there. At this time, I began to think about hearing from a cognitive perspective. and visual relationship. In the process of studying cognitive science, I learned about the phenomenon of “Synaesthesia”, and combined with my own research direction, I completed an article entitled “Visual Music and Musical Vision”, which is also the first Second, “synesthesia” was introduced into the information field.

Later, I opened the first cognitive computing course in the information field, and I also created the IEEE SMC Cognitive Computing Technical Committee to try to break the boundary between cognitive science and computing science. At that time, I also defined cognitive computing, that is, Description on the current technical committee home page. In 2002, I proposed the information capacity per unit of data, which is the concept of “Information Capacity”, and tried to measure the cognitive ability of machines. “Cognitive Computing” won the Tencent Science Exploration Award.

To this day, I also continue to follow the latest advances in synesthesia and perception. In nature, there are also many modes other than the five human senses, and there are even potential modes that are not yet clear. For example, quantum entanglement may indicate that the three-dimensional space we live in is just a projection of a high-dimensional space. If so , then our detection methods are also limited. Perhaps these latent modalities can be tapped to allow machines to approach or even surpass human perception.

AI Technology Review: On the issue of how to better combine human cognition and artificial intelligence, you proposed to build a modal interaction network with “Meta-Modal” as the core. Could you please introduce this point of view? What is its theoretical basis?

Li Xuelong: Metamodality itself is a concept originating from the field of cognitive neuroscience. It refers to the fact that the brain has such an organization. When performing a certain function or representation operation, it does not make specific assumptions about the sensory category of input information, but Still able to have better performance.

Metamodality is not a whimsical concept. It is essentially the assumptions and conjectures of cognitive scientists after the integration of phenomena and mechanisms such as cross-modal perception and neuronal plasticity. It also inspires us to construct efficient learning architectures and methods between different modalities to achieve more generalized modal representation capabilities.

AI Technology Review: What are the main applications of multimodal cognitive computing in the real world? for example.

Li Xuelong: Multimodal cognitive computing is a research that is very close to practical applications. Our team previously had a work on cross-modal perception, encoding visual information into sound signals to stimulate the primary visual cortex of the cerebral cortex. In daily life, we also often use multimodal cognitive computing technology. For example, short video platforms will integrate voice, image and text tags to recommend videos that may be of interest to users.

More broadly, multimodal cognitive computing is also widely used in the local security mentioned in the article, such as intelligent search and rescue, and drones and ground robots collect various data such as sound, image, temperature, humidity, etc. From a cognitive point of view, these data are integrated and analyzed, and different search and rescue strategies are implemented according to the scene conditions. There are many similar applications, such as intelligent inspection, cross-domain remote sensing and so on.

AI Technology Review: You mentioned in your article that at present, multimodal tasks are limited to interactions under simple goals and scenarios, and it is difficult once deeper logical semantics or subjective semantics are involved. So, is this an opportunity for a revival of symbolic AI? What other options are there for improving the ability of machines to process high-level semantic information?

Li Xuelong: Russell believes that most of the value of knowledge lies in its uncertainty. The learning of knowledge needs to be warm and able to interact and feedback with the outside world. Most of the research we have seen so far is unimodal, passive, and oriented to the given data, which can meet the research needs of some simple goals and scenarios. However, for deeper logical semantics or subjective semantics, it is necessary to fully explore and excavate the multi-dimensional space-time, more modal-supported, and actively interactable situations .

In order to achieve this goal, the research methods and methods may draw more from cognitive science. For example, some researchers have introduced the “embodied experience” hypothesis in cognitive science into the field of artificial intelligence to explore how machines actively interact with the outside world. New learning problems and tasks in the context of interaction and multimodal information input, and some promising results have been obtained. This also shows the role and positive significance of multimodal cognitive computing in connecting artificial intelligence and cognitive science.

AI Technology Review: Smart optoelectronics is also one of your research directions. You mentioned in your article that smart optoelectronics can bring exploratory solutions to the digitization of information. What can smart optoelectronics do in the perception and computation of multimodal data?

Li Xuelong: Optical signals and electrical signals are the main ways for people to understand the world. Most of the information that humans receive every day comes from vision. Going a step further, visual information mainly comes from light. The five human senses of sight, hearing, smell, taste and touch also convert different senses such as light, sound waves, pressure, smell, and stimulation into electrical signals for high-level cognition. Therefore, photoelectricity is the main source of information for humans to perceive the world. In recent years, with the help of various advanced optoelectronic devices, we perceive more information than visible light and audible sound waves.

It can be said that optoelectronic devices are the forefront of human perception of the world. The intelligent optoelectronic research we are engaged in is committed to exploring the integration of optoelectronic sensing hardware and intelligent algorithms, introducing physical priors into the algorithm design process, using algorithm results to guide hardware design, forming mutual feedback of “sense” and “calculation”, and expanding perception Boundary, to achieve the purpose of imitating or even surpassing human multimodal perception.

AI Technology Review: What research work are you currently doing in the direction of multimodal cognitive computing? What are your future research goals?

Li Xuelong: Thanks for the question. My current focus is on multimodal cognitive computing in Vicinagearth Security. Security in the traditional sense usually refers to urban security. At present, human activity space has expanded to low altitude, ground and underwater. We need to establish a three-dimensional security defense system in the ground space to perform a series of practical tasks such as cross-domain detection and autonomous unmanned systems.

A big problem faced by local security is how to intelligently process a large amount of multi-modal data generated by different sensors, such as allowing machines to understand the targets observed by drones and ground monitoring equipment at the same time from a human perspective. This involves multimodal cognitive computing and the combination of multimodal cognitive computing and smart optoelectronics.

In the future, I will continue to study the application of multimodal cognitive computing in local security, hoping to open up the connection between data acquisition and processing, make reasonable use of “positive excitation noise” (Pi-Noise), and establish multiple A local security system supported by modal cognitive computing and intelligent optoelectronics.

Reference link: https://ift.tt/JqRKkce

(Public number: Leifeng.com)

This article is reprinted from: https://www.leiphone.com/category/academic/Z5G3W8dq7O881yRt.html
This site is for inclusion only, and the copyright belongs to the original author.