Quantifying Classification Performance through Combinatorial Geometry and Localized Data Analysis

Christopher Lee; Mudassir Shabbir; Waseem Abbas

Quantifying Classification Performance through Combinatorial Geometry and Localized Data Analysis

Christopher Lee, Mudassir Shabbir, Waseem Abbas

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: lower bound, geometrical insights, local data, classification performance, combinatorics, linear separation

Abstract: Understanding the theoretical boundaries of a learning mechanism and ascertaining its fundamental capabilities remains a persistent challenge in machine learning. While the VC-dimension has been instrumental in quantifying a model's data-fitting abilities, its independence from data distribution sometimes limits its practicality. In this study, we address the problem of establishing realistic bounds on a model’s classification power by harnessing the underlying combinatorial geometry of data using novel tools. We introduce conditions that rely on \emph{local} computations performed on small data subsets to determine the \emph{global} performance of classifiers. Specifically, by considering a dataset $\{(X_i,y_i)\}_{i=1}^{n}$, where $X_i\in\mathbb{R}^d$ is a feature vector and $y_i$ is the corresponding label, we establish optimal bounds on the training error (in terms of number of misclassifications) of a linear classifier based on the linear separability of local data subsets, each comprising of $(d + 2)$ data points. We also prove an optimal bound on the margin of Support Vector Machines (SVMs) in terms of performance of SVMs on $(d+2)$ sized subsets. Furthermore, we extend these results to a non-linear classifier employing hypersphere boundary separation. Our experimental results underscore the significance and applicability of these theoretical bounds in real-world machine learning scenarios. This research contributes valuable insights into assessing the classification potential of both linear and non-linear models for large datasets. By emphasizing local computations on subsets of data with fixed cardinality, it provides a foundation for informed and efficient decision-making in practical machine learning applications.

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7551

Loading