Submission Type: Short paper (4 pages)
Keywords: spreadsheet, vision, clustering, chamfer, hausdorff, tabular, embeddings, RAG
TL;DR: Spreadsheets are made for humans, often leading to machine-unfriendly formatting. Discovering recurring layout templates within a library of spreadsheets speeds up the organisation process, unlocking spreadsheets for scaled AI.
Abstract: Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. We present a hybrid distance metric combining spatial positioning, data type information, and semantic embeddings to measure similarity between spreadsheets. Our approach transforms spreadsheets into cell-level embeddings, then applies aggregation strategies including Chamfer and Hausdorff distances to compute spreadsheet similarity. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of $1.00$ versus $0.90$) on the FUSTE dataset. Our method enables automated template discovery at scale, facilitating downstream applications including bulk data cleaning, model training, and retrieval-augmented generation over tabular collections.
Relevance Comments: This work directly addresses a core challenge in AI for tabular data: organizing and retrieving spreadsheets at scale. Our hybrid distance metric enables template discovery—a critical primitive for table-based RAG systems, foundation model pretraining, and automated data wrangling pipelines highlighted in the workshop's scope.
Submission Number: 48
Loading