Synergizing Large Language Models and Tree-based Algorithms for Author Name Disambiguation

Qiang Yan; AsirAsir

Synergizing Large Language Models and Tree-based Algorithms for Author Name Disambiguation

Qiang Yan, AsirAsir

19 Jul 2024 (modified: 15 Aug 2024)KDD 2024 Workshop OAGChallenge Cup SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Instruction Tuning, Feature Engineering, LightGBM, Model Fusion

TL;DR: The 3rd Place Solution for WhoIsWho-IND

Abstract: The ultimate goal of academic data mining is to deepen our understanding of the development, nature, and trends of science. It offers the potential to uncover significant scientific, technological, and educational value. For instance, deep mining of academic data can assist governments in formulating science policies, support companies in talent discovery, and help researchers access new knowledge more effectively. Academic data mining encompasses many applications centered around academic entities, such as paper retrieval, expert discovery, and journal recommendation. However, the lack of data benchmarks related to academic knowledge graph mining has severely limited the development of this field. At KDD Cup 2024, we introduced the OAG-Challenge, consisting of three realistic and challenging academic tasks aimed at advancing the field of academic knowledge graph mining. One of these tasks, WhoIsWho-IND, focuses on the increasingly complex problem of author name disambiguation due to the rapid increase in online publications. Inaccuracies in existing disambiguation systems have led to incorrect author rankings and award fraud. This competition challenges participants to develop a model that detects misassigned papers for a given author. In this work, we approached the WhoIsWho-IND task by framing it as a binary classification problem, determining whether a paper belongs to an author. We employed two strategies: (1) extracting fundamental information from papers and authors and deriving their textual representations, followed by utilizing LightGBM for classification, and (2) fine-tuning a large model to assess the relevance of a list of papers to the author's historical publications. Both methods output a probability indicating the likelihood of correct paper assignment to the author. Our approach achieved significant results, earning us third place in the competition. Our code is published on github.

Submission Number: 20

Loading