Keywords: data mining, academic citation network, text embedding, graph embedding, imbalanced data learning, KDD Cup
Abstract: In this technical report, we describe the solution that achieved 5th place in the Paper Source Tracing task of the KDD Cup OAG-Challenge. This task involves estimating the most significant references for each academic paper. We extracted information from the provided XML files and academic databases, and performed named entity resolution using natural language processing to acquire data. We generated features using text embedding and graph embedding techniques. Due to the small data volume, we augmented the dataset through oversampling and trained multiple Gradient Boosted Decision Tree (GBDT) models on hyperparameter tuning. By ensembling the trained models, we produced the prediction results. Finally we achieved a score of $0.44278$ on the final submission leaderboard for the Mean Average Precision (MAP) metric, securing 5th place in this task. The sample source code is publicly available at https://github.com/ToyotaInfoTech/kddcup2024-oagpst-solution.
Submission Number: 19
Loading