Keywords: Large Language Models, Instruction Tuning, Feature Engineering, LightGBM, Model Fusion
TL;DR: The 3rd Place Solution for WhoIsWho-IND
Abstract: The ultimate goal of academic data mining is to deepen our
understanding of the development, nature, and trends of science. It
offers the potential to uncover significant scientific, technological,
and educational value. For instance, deep mining of academic data
can assist governments in formulating science policies, support
companies in talent discovery, and help researchers access new
knowledge more effectively. Academic data mining encompasses
many applications centered around academic entities, such as
paper retrieval, expert discovery, and journal recommendation.
However, the lack of data benchmarks related to academic
knowledge graph mining has severely limited the development of
this field. At KDD Cup 2024, we introduced the OAG-Challenge,
consisting of three realistic and challenging academic tasks aimed
at advancing the field of academic knowledge graph mining.
One of these tasks, WhoIsWho-IND, focuses on the increasingly
complex problem of author name disambiguation due to the rapid
increase in online publications. Inaccuracies in existing
disambiguation systems have led to incorrect author rankings and
award fraud. This competition challenges participants to develop a
model that detects misassigned papers for a given author.
In this work, we approached the WhoIsWho-IND task by framing
it as a binary classification problem, determining whether a paper
belongs to an author. We employed two strategies: (1) extracting
fundamental information from papers and authors and deriving
their textual representations, followed by utilizing LightGBM for
classification, and (2) fine-tuning a large model to assess the
relevance of a list of papers to the author's historical publications.
Both methods output a probability indicating the likelihood of
correct paper assignment to the author. Our approach achieved
significant results, earning us third place in the competition. Our
code is published on github.
Submission Number: 20
Loading