Make Still Further Progress: Chain of Thoughts for Tabular Data Leaderboard

Qile Zhou, Siyang Liu, Han-Jia Ye

Published: 28 May 2024, Last Modified: 09 Sept 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Tabular data, a fundamental data format in machine learning, is predominantly utilized in various competitions and real-world applications. Gradient-boosted decision trees and neural networks have been extensively researched for tabular data prediction. However, the effectiveness of these tabular models can vary significantly across different datasets, requiring specialized expert knowledge to achieve the best performance on each dataset's leaderboard. Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. Despite their success in text and image domains, the recent application of LLMs to tabular data has been limited due to LLMs' insensitivity to numerical values, the scarcity of textual descriptions, and insufficient context length caused by the abundance of tabular features. We propose a well-designed context that overcomes the aforementioned challenges, thereby activating LLMs' in-context learning ability on a wide range of tabular data. In particular, for each target instance, we construct a context based on the combination of the target instance, its neighbors, and external models. For this tabular context, we emulate the thought process of a competition expert and design a novel Chain of Tabular Thoughts (CoT$^2$) approach. CoT$^2$ transforms the prediction task constructed on our context into multiple steps, enabling LLMs to gain enhanced numerical reasoning ability on tabular data. By incorporating the tabular context and the CoT$^2$ into the prompt, LLMs can predict accurately and interpretably. Without using the datasets' textual description, CoT$^2$ surpasses ensemble baselines and other LLM-based methods that utilize semantic information.