[Shared by Zhuang Fuzhen of Beihang University] Application of NN model in financial risk control scenarios

IEEE x ATEC

IEEE x ATEC Technology Thinking Conference is a technology salon jointly sponsored by the professional technical society IEEE and the cutting-edge technology exploration community ATEC. Invite industry experts and scholars to share cutting-edge exploration and technical practice to help digital development.

In the process of social digitization, with the continuous deepening of networked and intelligent services, various risks derived from services cannot be ignored. The theme of this sharing session is “The Risk and Confrontation of Internet Fraud”. Five guests will share the risks and countermeasures in cyber fraud scenarios from different technical fields and observation perspectives.

The following is a speech by Researcher Zhuang Fuzhen, “The Application of NN Model in Financial Risk Control Scenarios”.

Speaker | Zhuang Fuzhen

Researcher at the Institute of Artificial Intelligence, Beihang University

ATEC Technology Elite Senior Advisory Committee Expert

“Application of NN Model in Financial Risk Control Scenarios”

It is a great pleasure to attend the IEEE x ATEC Technology Thinking Conference. The topic I share today is “The Application of NN Models in Financial Risk Control Scenarios”. The content of my speech today is mainly divided into three parts: background, research work, and our little summary.

As we all know, in the past ten years, the third-party online payment market has developed rapidly. At the same time, criminal activities related to online transactions have also increased significantly, and such transaction fraud is a serious threat to the online payment industry. In 2016, the Internet Crime Complaint Center received nearly 3.8 million complaints, resulting in more than 1.3 billion in financial losses. The most common types of online transaction fraud are account theft and card theft. Account theft refers to unauthorized account manipulation or transactions made by fraudsters after taking control of someone’s payment account, usually due to compromised credentials. Card theft means that information about someone’s card, such as card number, billing information, etc., has been obtained by a fraudster and used for some unauthorized charges.

Below I will share some of the research work we have done jointly with Ant Group. There are three main works, one is user event sequence analysis based on neural hierarchical decomposition machine (SIGIR 2020), the second is fraud detection based on double importance perceptual decomposition machine (AAAI 2021), and the third is our interpretable aspect Proposed Leveraging Hierarchical Interpretable Networks to Model Cross-Domain Fraud Detection of User Behavior Sequences (WWW 2020).

1. User event sequence analysis based on neural level decomposition machine

The first is user event sequence analysis based on neural hierarchical decomposition machine. In the payment business, everyone goes from registering to the system, logging in to the system, putting the product of their choice into the shopping cart, and finally making a transaction or payment. Based on the user’s account dynamics, we can determine whether the next payment is a fraud or not. There is a wealth of data sequence information available for the user’s account dynamics. The work that only focuses on feature combination or the work that only focuses on sequence information can only model user event sequence behavior from a separate perspective. Event representation. We want to design a hierarchical model that combines these two aspects to analyze fraud detection.

There are two cases in the picture on the right, one is a movie review record on a certain website (as shown in Figure 1), and it is also a user behavior sequence. The biggest contribution here is how to express this event. We just saw that each event actually contains a lot of features.

As shown in Figure 2, the features of an event include as many features as X1 to Xn. In the user’s event sequence, we include T events from e1 to eT, and each event has 56 features in the scene, including 50 categorical features and 6 numeric features. Combinations between features within an event are actually more discriminative to determine, predict fraud detection. For example, for a cross-border transaction within 1 minute, we can easily judge that this is a card theft. We hope to use the FM model to model this feature combination relationship. FM is a model that automatically performs second-order feature combination in the embedding space. Take a look at the representation of events (Fig. 2): vi and vj are the vectorized spatial representation of two features, which is a combination between two features, and Xi and Xj are actually a representation of a weight. In the end we will get a representation of an event, an event representation from the feature interaction.

When the event is represented, we hope to get a better sequence representation, that is, we extract a better feature representation for this sequence. Each user sequence actually contains multiple events, and two events occur in combination, which is more discriminative for fraud detection. Likewise, we also wish to consider the effect of the sequence between events. For example, if we do event A first and then do event B, the possibility of fraud may increase. We want our model to be able to model the effects of this sequence. From the point of view just now, the modeling of event combination is represented by S, which is also done by factorization machine. Different events are combined in pairs, qi and qj are also a weight of it. For the sequence impact, we consider from two aspects, one is to consider the importance of the event itself, it has a self-attention mechanism to represent Sself; the other is to use the RNN network to model the historical sequence behavior of the event information, that is, bidirectional LSTM to model. Finally, we can conclude that this sequence is composed of three parts: the combination of events; the self-attention mechanism of events; and a feature of the events themselves. Combining the three together gets the overall sequence representation.

The picture on the right is a framework we propose, also known as the Neural Hierarchy Decomposition Machine. From the bottom, are the characteristics of the event. After we encode this event feature, we can get the representation of this event to learn the representation of this sequence. After extraction, you can see the model do the output of a multilayer perceptron. We can also do a linear classification on this Feature. In the end, we regard these two parts as a parameter of a Sigmoid, and get an output between 0 and 1. The final optimization function is actually a loss function of cross entropy, and N is to learn all labeled data. This is a skeleton of our model.

In this experiment, we use a real dataset from industry. For example, on an e-commerce platform, we obtained data sets of three regions from the platform. The positive examples of this dataset are fraudulent behaviors, and the negative examples are normal transaction behaviors. You can see that normal transaction behaviors and abnormal fraudulent behaviors are very different and the categories are very unbalanced. We also did an experiment on our public datasets and datasets on movies. In the benchmark algorithm comparison, we use some advanced algorithms, such as W&D (Wide & deep) width and depth, as well as NFM, DeepFM, xDeepFM, and M3 using mixed models to learn the long-term and short-term dependencies of sequences at the same time.

Our evaluation index is the recall rate when using the low user interruption rate that is more concerned in real industrial scenarios, that is, when we give the results, we hope to call the percentage of users in front of the head to tell them that this may be a Fraud. For example, if you make 1000 calls, these 1000 should be fraudulent, that is, the higher the ratio, the better. Therefore, the evaluation index we use focuses more on the head of the ROC curve (FPR<=1%). There is an ablation ? and ?. For example, we only consider the interaction module of events. NHFM is a result of our whole, which is superior to all baseline algorithms, including the ablated version, which is much better than the baseline. Our model is also able to extract high-risk features and high-risk events, which are well interpretable for fraud detection tasks. For example, changing from a Chinese IP to an American IP in a short period of time may also be a fraudulent transaction. The consumption amount is 10 times or 100 times larger than the usual consumption amount, which may also be a fraud. Including some high-risk features and some low-risk features of the computer operating system you use, you can also see some time periods, transaction amounts, etc., which clearly point to whether the transaction is fraudulent.

From the picture on the lower left, we can see that this IP change, as well as changes in other feature values and field values, will lead to some fraudulent behaviors.

2. Fraud Detection Based on Double Importance Perceptual Decomposition Machine

A factorization machine based on dual importance perception is used for fraud detection. In the first job just now, you can see that the IP is constantly changing. We need to take into account the evolution of a value, a field of a serialized event. That is, the evolution of the same field value and the interaction of different field values are actually very important, and the existing work has not paid attention to these two points at the same time. Therefore, we would like to design a DIFM model that combines these two aspects simultaneously.

We also made a framework based on this FM model. First, for each stage, we also use the FM model to capture the evolution between pairs of different events. You can see the picture in Figure 3. From the brown direction, we consider the characteristics of f1, which means that it changes with events, and we model it. This is our new contribution. After FM modeling, we propose a perception module such as Field Importance-aware. To use the attention mechanism to perceive which field evolution is more important for our prediction, we propose a module called importance perception. In another direction, for each event, the model captures the pairwise interaction features of different field values through FM (the blue part in the figure), and then uses the attention mechanism to perceive which event is more important through the Event Importance-aware module. (green part in the picture). Finally, we output the prediction results through the two parts of the information obtained by the Field Importance-aware module and the Event Importance-aware module and the current event characteristics. It can be seen that this model is relatively simple and practical. In this business application scenario, we can deploy it online with high efficiency and good effect. This is the second work we propose.

Some experimental results of the second work also use the data sets of the three regions in the first work. In this work, we have added some precise algorithms, including AFM, and using LSTM for fraud detection, as well as using Latent Cross to integrate context information into RNN. This data is the same as the experimental data of the previous work.

It can be seen from this result that we also use the recall rate when the user disturb rate is low to evaluate our experiment, and we can see that the bottom DIFM (one of our experimental results), the experimental results are much better than all the baselines, including the ablation experiment , DIFM-α only considers field value evolution, DIFM-β only considers field value interaction, DIFM is a combination of two sub-models, DIFM is also superior to all the previous comparison algorithms, this is a simple and practical algorithm we mentioned.

In terms of interpretability, our model is also able to extract high-risk features and high-risk events. From the picture on the right above, you can see the blue circle, and each change will become a blue circle, and there are relatively some changes. It can be seen that the mantissa behind the card falls in each interval, and each change will produce a fraudulent behavior, or a change in the value of the card. There are also changes including IP, we can go and catch it. This is what we propose to explicitly model a situation where the value of a field changes with events and sequences for fraud detection, which also provides a good reference for interpretation. As we all know, interpretability is very necessary in financial fraud detection, that is, when you tell the user that this transaction is a fraud, you must tell him which features may violate which rules, or your event may lead to some fraud Behavior. Interpretability becomes a very important job. In the following work, I hope we will also look at the whole process from the perspective of interpretability, from the feature level, from the event level, and also at our cross-domain level to make a hierarchical model of interpretability. Therefore, we also propose a cross-domain fraud detection using hierarchical interpretability networks to model user behavior sequences.

3. Cross-domain fraud detection using hierarchical interpretable networks to model user behavior sequences

Motivation is actually relatively simple and direct. First, we learned earlier that the sequence of user behavior is very important. Second, we want to consider how this interpretability can help our business. Third, when this e-commerce platform starts new business in different regions, it may not be well modeled due to the small amount of data. We hope to migrate or model it from other platforms with more mature data or more mature models. Learn from it to model a cross-domain fraud detection model.

We propose this hierarchically interpretable network. First, we propose a feature-level, event-level interpretability network to detect this fraud. The picture on the right is a framework we propose, and again, we encode this feature in the front. Field-level Extractor is a representation of events. After the event is represented, it is the representation of the sequence. There is another one we call Wide layer. Wide layer is a linear classifier that is simply learned from features. We use a multi-layer perceptron to do it in series here. The interpretability here is reflected in the fact that there are two interpretables from the single-domain model, one is which fields and features are more important, and which historical events in the sequence are more important.

For each step, the first Look-up embedding, we actually perform a vector transformation on this feature value. We divide this transformation into categorical and numerical transformation rules, and use this formula to do the transformation. Field-level Extractor is a representation of an event. In the previous work, we only considered the interaction between two features, showing which feature is more important. We added a wit, which is equivalent to saying that for this feature, in T A normalization of the importance of its features over time. For events, it also has an expression for the importance of the event, which is UT, and UT is the following expression. Below is the Wide layer to learn the whitelist, that is, we use linear analysis to learn, and finally predict and learn the problem, we also use MLP and the sigmoid function to map it between 0 and 1, and use cross entropy to To learn the whole learning problem, this is L(θ).

We also propose a transfer learning framework. As mentioned earlier, there may be less data and more data in different regions or different scenarios. We want less (data) to help more (data). We call the smaller ones Target Events, and the ones with more data are called source fields or Source Events. Here, we hope to learn something specific to the source and target domains, as well as something they both share. We hope this Source can share some knowledge to help Target learn and make some predictions. Considered from several aspects, in our scenario, one is the Embedding strategy, why to propose the Embedding strategy, sharing and the extraction of your unique behavior sequence, and including your domain attention. That is, to a certain extent, it explains how my field has helped my Target problem, how much it has helped, and how we can align a distribution between different fields, that is, Aligning Distributions. Interpretability is reflected in the perspective of Domain Attention.

Why do we propose this Embedding strategy? We all know that the value corresponding to the same field in different regions may be different. For example, the consumption field and consumption amount in China and Vietnam are different. China may be 0 to 100 yuan, but in Vietnam, it may not be 0 to 100 yuan. Therefore, the value of the field may be different, the behavior habits of users in different regions may be different, and the same extractor may not be valid for two regions at the same time, so the behavior sequence extractor is also divided into Domain-Specific and Domain-Shared. That is, we transfer some specific or domain-invariant features, and keep some things unique to our own domain. In Attention in this field, we also divide it into a domain-specific and domain-shared representation, namely Shared and Specific, two factors, the calculation formula is shown in the figure. In terms of the alignment of distributions between different fields, we know that the traditional alignment method is not suitable in our application scenario, because the categories are extremely unbalanced in our scenario, that is, we get this positive and negative class ratio The difference is huge. For example, we can even compare one to ten thousand, and only one of the ten thousand may be abnormal behavior. Let’s propose this Class-aware, the class-aware Euclidean Distance. From this point of view, when we calculate the distance in this field, we do it from the category, that is, a process of considering different categories.

Further, our transfer learning framework generalizes into a general transfer learning framework. As can be seen from the right picture of the above figure: the dotted line indicates that what we propose is a hierarchical interpretable network, which is used as a sequence extractor, that is, we can replace the sequence extractor in this dotted line with other models as Extractor for events. For example, in our transfer model, we can incorporate other baselines into our transfer learning framework as a special case of ours. So we only need to define which part is used as the behavior sequence extractor, and we can do such a fraud detection model.

Similarly, we also used a data set from an e-commerce platform in this data set. This time, we added a data set to this data set with a relatively small number of data sets, that is, it may only be one of several hundred or several thousand. There may be hundreds of thousands of positive cases and negative cases. Likewise, we use the least data as Target Events for our experiments. For bassline, we also choose Fraud baselines such as W&D, NFM, LSTM4FD and M3R as our basic model. Let’s first look at some single-domain experimental results, and also use the recall rate of low user disturbance as our evaluation index.

You can see that these two figures are the experimental results in the four regions of C1, C2, C3, and C4, which are much better than the baseline. The last vertical line is the result of our model.

We also use our transfer learning framework for all base models, that is, we put the model sequence behavior extractors of all baselines into the transfer learning framework, and replace the middle dotted line. The blue line is a result we get after using the transfer learning framework. The results show that better experimental results can be obtained after transfer learning. This horizontal axis indicates that we use data from less to more, such as training data from one week to two weeks and three weeks… So as the training data increases, the results generally get better. This blue line means that our previous effect is much better than the original. This is probably the case.

From the solvability of the results, we can see that from the feature level, the darker the color of each row, the more important its features are. It can be seen that there is an obvious catch to a feature of our importance. From the vertical Y-axis, the deeper the depth, the more important the event, we can catch the importance of different events. It can be seen below that Domain-Shared is equal to 0.56, which means that when we build this Target model, the knowledge contributed by the Shared part is 56%, and the Target itself is 44%. It can be seen that we do such an interpretability from three levels, from the granularity of features to the strength of events and then to the strength of attributes.

The model we proposed has been implemented in the ATO (account takeover) scenario of the e-commerce website, which can provide account transaction risk analysis, identify prevention and control, and analyze the weight value of event granularity/attribute granularity, assisting operators to judge and restore risk paths. Our proposed work was also deployed online.

Finally, we summarize that in the process of cooperation, we proposed a neural hierarchical decomposition machine to analyze the user event sequence, and at the same time modeled the interaction between fields and the fraud detection model of the evolution of Field Value, and proposed a general transfer learning interpretable framework. Our interpretability of fraud detection results. Finally, we also carried out online deployment and application landing. It has been applied well now, especially in some scenarios where our algorithm is integrated into the fraud detection module.

That’s it for my sharing, thank you very much.

Leifeng.com

This article is reproduced from: https://www.leiphone.com/category/industrynews/gk4jhXLZUYBmGDjH.html
This site is for inclusion only, and the copyright belongs to the original author.