【Xiong Yun from Fudan University】How to ensure data scarcity is not lost and privacy is not leaked while data is open?

IEEE x ATEC

IEEE x ATEC Technology Thinking Conference is a technology salon jointly sponsored by the professional technical society IEEE and the cutting-edge technology exploration community ATEC. Invite industry experts and scholars to share cutting-edge exploration and technical practice to help digital development.

With the continuous and in-depth development of AI technology, medical artificial intelligence applications have sprung up like mushrooms after a spring rain, and are blooming everywhere in the medical field. AI has the characteristics of intelligence and automation. It can unlock complex data and process massive data through powerful computing power. It plays an unparalleled and important role in the process of medical reform. The third session of the IEEE x ATEC Technology Thinking Conference specially invited four guests to focus on the independent TALK on “Medical Transformation Driven by AI – From Life Science to Medical Management”.

The following is a speech titled “Medical Big Data: From Simple to Deep, From Complex to Simple” by Xiong Yun, a professor/doctoral supervisor of Fudan University, deputy director of the Shanghai Key Laboratory of Data Science, and expert of the Senior Advisory Committee of ATEC Science and Technology Elite Competition.

Speaker | Xiong Yun

Professor/PhD supervisor at Fudan University

Deputy Director of Shanghai Key Laboratory of Data Science

ATEC Technology Elite Senior Advisory Committee Expert

“Medical Big Data: From Simple to Deep, From Complex to Simple”

Hello everyone, I am Xiong Yun from Fudan University. Thank you for the IEEE x ATEC Technology Thinking Conference. I am very happy to share the research progress of medical big data with you.

Today, I will introduce from the following aspects: First, I will introduce the sources, types and characteristics of medical data, and then focus on the related work of medical big data mining analysis technology and open interconnection technology.

1. Medical big data

We all know that data has become a new factor of production. Healthcare is about people’s well-being. General Secretary Xi pointed out that it is necessary to speed up the development of “Internet + medical health”. The value discovery of medical data and its role in drug research and development, auxiliary diagnosis and other aspects have played a very important role. Digital healthcare provides a feasible solution for realizing the sharing of high-quality medical resources, solving the problems of uneven distribution of medical resources and high medical costs.

There are various types of medical data. Commonly there are patient medical records (including basic information of patients, medical diagnosis, medication, etc.), electronic medical records and their diagnosis reports in the form of unstructured text, as well as medical images and medical test strips. and literature data. We analyze these data and mine their value. We should look at medical data from different perspectives and study the corresponding algorithms according to the characteristics of these different data. We have single-source data processing methods, multi-source data processing methods, and structured, unstructured and multi-modal, multi-source heterogeneous data processing methods.

The connotation of big data includes using data to solve problems and solve data problems.

Earlier we analyzed the various data types that medical big data can use. Let’s take a look at the problems and challenges of medical big data.

Medical big data includes problems such as complex data types, poor data quality, numerous data islands, weak data security and shallow data application. These problems are also common in other fields. This is also a basis for us to introduce some existing data mining machine learning (such as natural language processing, image vision processing, etc.). However, the medical field has higher requirements for the quality of data and the validity of analytical results, so we need to improve these common technologies.

For complex types of data, we need to use multimodal data fusion technology. For example, we need to align the data of medical images and medical report texts in different modalities. For those with poor data quality, we need specialized medical data normalization technologies, such as the use of electronic medical record text and ICD code alignment. There is a contradiction between the high sensitivity and high privacy requirements of medical data and our need for comprehensive data characteristics when doing medical intelligent analysis, which requires us to have a more effective sharing and interconnection mechanism and technical support.

To this end, we have carried out work on the above research and developed a series of medical data intelligent analysis and open interconnection technologies.

Second, mining analysis: from shallow to deep

Below we will focus on the big data mining analysis technology and open interconnection technology.

In terms of analysis and mining, our research work is developed from the shallower to the deeper. First of all, we are in medical big data mining, including simple mining of medical data from a single source to feature representation based on deep learning, deep learning methods from structured to unstructured and cross-modal data in medical images and texts analysis, and multi-source multi-modal omics data analysis.

Below we will expand the introduction.

We can see the basic information and medication records of some patients from the medical records of some patients. We can use the most basic training mode to mine the frequent pattern mining algorithm to get the patient’s medication pattern. For example, the three patients in the above picture, they all use the first three drugs. It can be seen that there is a certain drug association between the three drugs. This intuitive method can bring a certain role in auxiliary diagnosis, but different medication orders in actual medical scenarios also reflect the patient’s disease state. For example, using a certain medicine first and using a certain medicine later may have different principles for treating diseases. In addition, the dosage of the drug will also reflect the treatment regimen for the patient’s symptoms.

Therefore, we use different methods, including the way of considering the frequency of statistics, the way of considering the order, and the way of considering the dose, and the obtained medication patterns are also different.

It can be displayed for a certain drug and other related drugs. The characteristics of a patient’s medication reflect the patient’s own characteristics, which is helpful for personalized and precise treatment of the patient. For example, patients with similar medication patterns, they are more similar and can be used as a reference for diagnosis. But we also found that this simple (reference) pair is still limited and insufficient in terms of reflecting user characteristics.

With the introduction of deep learning technology, more data of patients can be used to characterize and capture more information. For example, only the order of drugs was considered just now, but information such as the time interval between drug administrations and the influence of the previous state on the latter state was not considered.

In order to better describe these multiple and complex factors, we modeled a graph of patient behavior to construct a bipartite graph. This node is the patient and medication, respectively. The rich interactive behaviors are recorded on the edge, that is, under what conditions, when, and when a certain drug or drug dose is used, as well as the specific situation of the drug, etc. Now our problem is to get the feature vector of each patient node in the graph to describe the user’s features for downstream tasks. For example, for user similarity recognition or user classification, a deep learning model can be used to obtain a feature vector for each node. Two patients are considered sufficiently similar if their eigenvectors are similar.

The reason why graph modeling is adopted is that time-series dependencies can be better captured first, that is, the dependencies between multiple times are modeled. For example, for a user, he used drugs at different time periods and different time points. That would tell him that he might take medicine B after taking medicine A. Therefore, the modeling of deep learning is mainly to maximize the probability of drug co-occurrence. When the user comes to use the drug A, what is the next drug he will use.

It can also model the probability and conditional proximity of a single event under different conditions, such as when a patient takes the drug. That is, our model should be able to maximize the probability of patients and medication under a certain condition.

Let’s take a look at the related technical progress in unstructured and cross-modal data.

Traditional basic text analysis methods can be used on medical texts. For example, for electronic medical records, feature extraction is performed, and then documents with more commonality are obtained to form a common document template. This method can use a relatively simple SimHash to extract text features. But it can be seen that this is very limited for the extraction of semantic features of medical treatment itself.

Therefore, medical texts can be better understood if they can be normalized with structured information in the medical domain.

Take ICD encoding as an example, that is, the medical text is mainly displayed as unstructured information of text. But each text will be marked with a certain ICD code. Therefore, it can be realized that a medical text can be given its corresponding ICD code. This is actually a multi-label classification problem. The approach we take is to learn embedding representations for words in the text.

Then introduce the method of graph deep learning. A graph is used to represent the hierarchical relationship of the ICD code that needs to be modeled. We use graph convolution to obtain the node feature representation of each graph. Under the support of this method, it can be effectively improved compared to the original shallow model or the model without adding graph. But in this process, this feature of text is still implemented with a convolutional model in the general field. Pre-trained models like BERT can also be used here.

Because the general domain contains less knowledge of medical biological information, pre-trained models in general domains, such as BERT or GPT, may not be able to better learn the knowledge in the biomedical domain, so there are some special use of biomedical knowledge. The corpus is trained to obtain a pre-trained model specialized in the biomedical field.

What we have done is to extract the pre-training model of medical text considering the semantic relationship between the various components of Chinese characters in the Chinese context on the basis of the existing ones. For example, for each Chinese character, especially for some Chinese characters in diseases, its components actually reflect certain semantic features. We split each Chinese character into smaller graphs, and then use the deep learning model of graphs to obtain the semantic features of each component, and then combine with BERT in the general field to finally get a field that better reflects the characteristics of medical texts pretrained model.

In addition to single-modal data analysis, multi-modal data fusion analysis can also do more value mining. For example, in addition to disease detection in traditional images, the generation of medical reports has become a current hot spot, that is, how to make better use of text data. This idea actually comes from the general field of image vision. For a picture, it can not only get what specific objects are in it, but also generate a corresponding text, that is, look at the picture and speak.

What more challenges exist in the field of medical imaging? First of all, in the field of medical text, the length of text report description is always relatively long. For a relatively long text, there will be a problem of frequent dependence. In addition, the abnormal area to be obtained is relatively small, and it is a challenge to mine and describe the abnormality.

Therefore, we apply the subject’s attention mechanism, as well as technologies such as gating units, and deep learning technologies to the generation of medical image text reports. Our model gets better descriptive sentences that express anomalies.

We also found another problem, that the sample size that can be obtained for some diseases may be relatively small. Therefore, a Few-shot GAN method is proposed, which allows us to generate more samples of rare diseases, and also uses disease graph convolution to model the intrinsic correlation between diseases. That is, the association between the labels of the disease is also modeled. In this way, for the association between some few diseases and other relatively more diseases, it can help to enhance our semantic representation of diseases and rare diseases, and further improve the effectiveness of text generation.

For more complex data, the development of heterogeneous network technology has played a very positive and effective role in the utilization of omics data. For example, a network like the one above can be formed, in which there are data types such as genes and diseases, and even information such as its corresponding drug compound and the possible side effects of this compound. There are different types of relationships between nodes and nodes.

In this way, if you want to study the correlation between two genes, you can not only know whether the genes and genes are similar because of the disease, or because they are both the target genes of the same disease, or because they may be related to the same disease. The treatment of a certain drug has a very important role. The semantic path in the heterogeneous network can be adopted. For example, as can be seen from the above figure, for two circular nodes (gene nodes), it can be a semantic path such as a triangle (disease), or a semantic path such as a square (compound) . In this case, more semantic relations can be obtained.

Let’s simplify this problem. For example, to identify miRNAs that are similar to some miRNAs, such a heterogeneous map can be used to consider its different original pathways. For example, whether the two miRNAs are similar through genes or through diseases.

Based on the above work, multi-source and multi-modal data can be further fused to study the task of knowledge graph-based medical image report generation.

As mentioned earlier in the generation of medical images and reports, we utilize images of medical images as well as medical text. We know that there is also a corresponding relationship between some labels of medical text or images and the knowledge graph in the medical field, so the medical knowledge graph can also be introduced for learning, and a better medical image text report can be obtained.

But there is also a challenge in this, and it is also an issue we are studying, that is, there may be knowledge graphs in different fields. In the medical field, there may be a variety of knowledge graphs from different institutions, and it is necessary to align the medical knowledge graphs, which is also a problem of knowledge standardization and quality processing in the medical field.

3. Open interconnection: from complex to simple

It can be seen from the above research content that various types of medical big data have corresponding methods, applications and optimizations, and have shown very good results, but the source of medical data itself also needs to take into account security issues .

The sharing and interconnection of medical data is an open problem, and we have also carried out some explorations on this technology. Here’s the third part we’re going to explore, Open Connectivity.

Because of the development of open technology, the trivial data acquisition process has become more convenient and simple. In order to obtain the corresponding medical data, we need to go through a very complicated application process to use the data, and during the use process, the access to the medical data may be very limited in most cases. We propose an open model for data autonomy. This pattern is that we encapsulate data in data boxes, and then users access the data in the form of data boxes as access units. Data owners have a more autonomous way of specifying which data can be accessed.

In addition, in order to restrict the data access method, we also provide a data usage behavior detection function in the data box. Therefore, for the user of these data, the operation he needs may only be to use some statistical information of the data, rather than being able to read every piece of data. When it comes to behavioral monitoring, we will limit it. This approach motivates data owners to open up their data better and more easily. It is also very convenient for users to use the data box. Thus, we can protect the rights and interests of data on the basis of data openness. And here we also use the blockchain method to record the behavior of each user who has used the data, which can be used for our tracking.

At the same time, we will also consider, for data owners, the convenience of providing data, that is, providing an interface for data interconnection. For example, multiple data owners have multiple systems, and the software interface technology can be used to realize the data link, that is, given the configuration requirements, connect the interface from the corresponding system, and connect the data to the platform.

In this process, data users will be controlled by the data interconnection platform. For example, which usage behaviors are allowed and which usage behaviors are not allowed, we will record these logs. In addition, if you want to use this data for intelligent analysis, it will allocate the corresponding container for the data, that is, which computing power it can use, and then it can perform algorithm training on the data.

We organically combine the advantages of data, computing power and methods. This allows the provider of the data owner to better share his data contribution. The data controller is mainly to protect the security of data; the research institutions or enterprises of artificial intelligence algorithms are more concerned about how to analyze and research the methods they develop. Therefore, through the above method, real-time, high-quality, and interoperable data can be efficiently provided on demand. At present, a series of technologies for interconnection and interoperability of medical big data have been formed, and a training experimental field for medical artificial intelligence algorithms has been constructed.

4. Summary and Outlook

Finally is the summary. We have seen that the utilization of shallow medical data resources has produced great value, and there are more newer technologies that can further promote the utilization and development of medical big data. Therefore, it is also necessary to explore the utilization and development methods of some data resources at a deeper level. At present, the exploration of metaverse technology in the medical industry has also received a lot of attention, which also poses some new challenges to the analysis and utilization of medical data.

It is hoped that through in-depth analysis of medical big data and deeper exploration of interconnected technologies, we can better support the development of the medical and health digital industry, empower future medical care, transform medical service models, promote comprehensive health, and strengthen health. Cornerstone. The above is my sharing, thank you all.

This article is reproduced from: https://www.leiphone.com/category/industrynews/6UjJfrb1lemgh4ib.html
This site is for inclusion only, and the copyright belongs to the original author.