**Abstract:**Feedforward neural networks with homogeneous activation functions possess an internal symmetry: the functions they compute do not change when the incoming and outgoing weights at any hidden unit are rescaled by reciprocal positive values. This paper makes two contributions to our understanding of these networks. The first is to describe a simple procedure, or {\it fix}, for balancing the weights in these networks: this procedure computes multiplicative rescaling factors---one at each hidden unit---that rebalance the weights of these networks without changing the end-to-end functions that they compute. Specifically, given an initial network with arbitrary weights, the procedure determines the functionally equivalent network whose weight matrix is of minimal $\ell_{p,q}$-norm; the weights at each hidden unit are said to be balanced when this norm is stationary with respect to rescaling transformations. The optimal rescaling factors are computed in an iterative fashion via simple multiplicative updates, and the updates are notable in that (a) they do not require the tuning of learning rates, (b) they operate in parallel on the rescaling factors at all hidden units, and (c) they converge monotonically to a global minimizer of the $\ell_{p,q}$-norm. The paper's second contribution is to analyze the optimization landscape for learning in these networks. We suppose that the network's loss function consists of two terms---one that is invariant to rescaling transformations, measuring predictive accuracy, and another (a regularizer) that breaks this invariance, penalizing large weights. We show how to derive a weight-balancing {\it flow} such that the regularizer remains minimal with respect to rescaling transformations as the weights descend in the loss function. These dynamics reduce to an ordinary gradient flow for $\ell_2$-norm regularization, but not otherwise. In this way our analysis suggests a canonical pairing of alternative flows and regularizers.

**License:**Creative Commons Attribution 4.0 International (CC BY 4.0)

**Submission Length:**Long submission (more than 12 pages of main content)

**Previous TMLR Submission Url:**https://openreview.net/forum?id=KsWQFVhxLR&referrer=%5BTMLR%5D(%2Fgroup%3Fid%3DTMLR)

**Changes Since Last Submission:**Sept 2023: The revised manuscript has been extensively edited for brevity. In particular, this version relegates most proofs to appendices, omits some theoretical results altogether, and collects the discussion of related work in one section. The revision also includes minor clarifications and fixes typos pointed out by the reviewers. With these changes, the main part of the manuscript is now 13 pages. Hopefully one page of leeway can be granted to include graphics (p.2), pseudocode (p.6), and background (section 3.1) that make the manuscript accessible to a broader audience. June 2023: The manuscript was updated after previous relevant work (on ENorm) was brought to our attention. The current submission contains much more general results than the previous one. In particular, the procedure in section 2 has been generalized to minimize the entriwise $\ell_{p,q}$-norm of a network's weight matrix (not just the $\ell_p$-norm) with the same convergence guarantees. This generalization is of interest because the max-norm emerges in the limit $q\rightarrow\infty$. Also, the current submission gives a general procedure (in section 3) to derive a weight-balancing flow for any differentiable regularizer. These stronger theoretical results have displaced some of the experimental results in the initial submission.

**Assigned Action Editor:**~Nadav_Cohen1

**Submission Number:**1291

Loading