Tencent won two championships in international authoritative competitions in the field of international information retrieval, demonstrating the technical strength of AI large models

Recently, WSDM (Web Search and Data Mining), the world’s top academic conference in the field of information retrieval, announced the results of the WSDM CUP 2023 competition. Based on breakthroughs in large-scale model pre-training, search sorting, and integrated learning, the research team from Tencent achieved an unbiased It won the championship in two tasks on the ranking learning and Internet search pre-training model tracks, reflecting its leading technical strength in this field.

740

ACM WSDM (Web Search and Data Mining) conference is one of the top conferences in the field of information retrieval, coordinated and organized by the four special committees of SIGIR, SIGKDD, SIGMOD and SIGWEB, and enjoys a high academic reputation in the fields of Internet search and data mining. The 16th ACM International WSDM Conference will be held in Singapore from February 27 to March 3, 2023, with an acceptance rate of 17.8% of papers.

The WSDM Cup is held by the WSDM Conference. More than 400 teams participated in this year’s WSDM Cup, from well-known universities and companies in China, the United States, Singapore, Japan, India and other countries. The competition has three tracks: unbiased sorting learning and Internet Search Pre-training Model Track (Unbiased Learning to Rank and Pre-training for Web Search), Multilingual Information Retrieval Across a Continuum of Languages ​​Track (Multilingual Information Retrieval Across a Continuum of Languages) and Visual Question Answering Challenge Track (Visual Question Answering Challenge).

This time, Tencent’s “team name: Tencent Machine Learning Platform Department Search Team (TMLPS)” participated in the unbiased ranking learning and Internet search pre-training model tracks, and in the two sub-tasks of the track (Pre-training for Web Search and Unbiased Learning to Rank) won the championship.

At present, the codes and papers of the two results have been published on Github (see: GitHub – lixsh6/Tencent_wsdm_cup2023

In the field of deep learning, the quality of data annotation has a significant impact on the effect of the model, but the high cost of labeled data has always been one of the obstacles for the research team. How to use unlabeled data to train the model technically has naturally become a challenge. A hot spot of concern in academia and industry. 

740

Paper: Multi-Feature Integration for Perception-Dependent Examination-Bias Estimation

Address: https://arxiv.org/pdf/2302.13756.pdf 

740

In this competition, for the search-based pre-training task (Pre-training for Web Search), the Tencent team used methods such as large-scale model training and user behavior feature denoising to conduct model pre-training based on search ranking on click logs, and then made The model is effectively applied to the retrieval task of downstream relevance ranking. Through pre-training, model fine-tuning, ensemble learning and other aspects of optimization, it has achieved a large leading edge in the task of manual labeling correlation ranking

740

Paper: Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking

Address: https://arxiv.org/pdf/2302.13498.pdf

In Unbiased Learning to Rank, another track of this competition, the team made full use of features such as document media type, document display height, and number of scrolling times after clicking through in-depth mining of click log information. Document correlation is estimated unbiasedly, and a multi-feature integration model that can integrate multiple bias factors is proposed, which effectively improves the effect of document ranking in search engines.

It is understood that the results of the winning team are based on Tencent’s Hunyuan AI large model (hereinafter referred to as “HunYuan”) and Taiji machine learning platform. At present, through the joint WeChat search team, the two technologies have been implemented in multiple scenarios of WeChat search and related technologies, and have achieved significant effect improvement.

AI large model (also known as pre-training model) refers to a “set of algorithms” that are pre-trained and relatively versatile, and have the characteristics of “huge amount of data, huge amount of computing power, and huge amount of models”. By learning the internal laws and expression levels of sample data, the large model develops “intelligence” that is close to or beyond human level, has the ability to analyze and reason, and can recognize text, images, and sounds.

In April 2022, Tencent disclosed the development progress of the HunYuan large model for the first time. HunYuan integrates CV (Computer Vision), NLP (Natural Language Understanding), and multi-modal understanding capabilities, and has successively reached the top of the list of five authoritative data sets such as MSR-VTT and MSVD, achieving a grand slam in the cross-modal field . In May 2022, it reached the top of the three lists of the internationally recognized CLUE (Chinese Language Comprehension Evaluation Collection) at the same time, breaking three records in one fell swoop. Recently, HunYuan ushered in a new development, launching the first low-cost and practical NLP trillion-scale large-scale model in China, and reached the top of CLUE again.

Tencent Taichi machine learning platform is a high-performance machine learning platform that integrates model training and online reasoning. It has the training and reasoning capabilities of trillion-parameter models, and provides a complete end-to-end project for AI large-scale model pre-training reasoning and application implementation. Capability support, one-stop solution to algorithm engineers’ engineering problems such as feature processing, model training, and model service in the process of AI application.

Tencent has been committed to the research of cutting-edge search technology for a long time. By improving the search algorithm and enhancing the user’s search experience, the relevant technical team has rich practical experience in retrieval pre-training, large-scale model training, search and ranking task objective function design, etc., with many research results. It has achieved leading results in international competitions and academic conferences for the first time, and has been widely used in WeChat search, Tencent advertising, games and other business scenarios.

This article is reproduced from: https://www.leiphone.com/category/industrynews/IYkfEtgEY27LigaR.html
This site is only for collection, and the copyright belongs to the original author.