Distributional Structured Pruning by Lower bounding the Total Variation Distance using Witness functions

Chaitanya Murti; Chiranjib Bhattacharyya

Distributional Structured Pruning by Lower bounding the Total Variation Distance using Witness functions

Chaitanya Murti, Chiranjib Bhattacharyya

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Pruning, Structured Pruning, Total Variation Distance

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Discrimination-based structured pruning using novel lower bounds on the TV distance.

Abstract: Recent literature introduced the notion of distributional structured pruning (DSP) in Deep Neural Networks by retaining discriminative filters that can effectively differentiate between classes. Crucial to DSP is the ability to estimate the discriminative ability of a filter, which is defined by the minimum pairwise Total Variation (TV) distance between the class-conditional feature distributions. Since the computation of TV distance is generally intractable, existing literature assumes the class-conditional feature distributions are Gaussian, thereby enabling the use of the tractable Hellinger lower bound to estimate discriminative ability. However, the Gaussian assumption is not only restrictive but also does not typically hold. In this work, we address this gap by deriving a lower bound on TV Distance which depends only on the moments of witness functions. Using linear witness functions, the bound establishes new relationships between the TV Distance and well-known discriminant-based classifiers, such as Fisher Discriminants and Minimax Probability machines. The lower bounds are used to produce a variety of pruning algorithms called WitnessPrune by varying the choice of witness function. We empirically show that we can achieve up to 7\% greater accuracy for similar sparsity in hard-to-prune layers using a polynomial witness function as compared to the state-of-the-art.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9311

Loading