Root Mean Square Layer Normalization (RMSNorm) simplifies layer normalization by removing the mean-centering step and only focusing on re-scaling invariance. This means RMSNorm normalizes the activations by dividing them by the root mean square (RMS) of the activations for each layer. The forward pass equation for RMSNorm is: \(\={a}_{i}=\frac{a_{i}}{\text{RMS}(\mathbf{a})}g_{i}\) Where: \(a_{i}\) is the \(i\)-th element of the input vector \(a\).\(\text{RMS}(\mathbf{a})=\sqrt{\frac{1}{n}\sum _{i=1}^{n}a_{i}^{2}}\) is the Root Mean Square of the input vector \(a\).\(g_{i}\) (or \(\gamma _{i}\)) is a learnable scaling parameter (similar to gamma in LayerNorm). Backward pass equations for RMSNorm The backward pass involves calculating the gradients of the loss function with respect to the inputs and the learnable scaling parameter (\(g\) or \(\gamma \)). We'll typically have an upstream gradient, \(L/\={a}\), representing the gradient of the loss with respect to the output of the RMSNorm layer. We then use the chain rule to calculate the gradients for the inputs and the learnable parameter. Gradient with respect to the learnable scaling parameter (\(g\) or \(\gamma \)) \(\frac{\partial L}{\partial g_{i}}=\frac{\partial L}{\partial \={a}_{i}}\cdot \frac{a_{i}}{\text{RMS}(\mathbf{a})}\) Summing across all elements in the input vector for the complete gradient for \(g\): \(\frac{\partial L}{\partial \mathbf{g}}=\sum _{i}\frac{\partial L}{\partial \={a}_{i}}\cdot \frac{a_{i}}{\text{RMS}(\mathbf{a})}\) Gradient with respect to the input vector (\(a\)) Calculating the gradient with respect to the input vector \(a\) is a bit more complex due to the presence of \(a\) in both the numerator and the RMS term in the denominator. The derivation uses the chain rule and involves the following components: Derivative of \(\={a}_{i}\) with respect to \(a_{i}\)Derivative of \(\={a}_{i}\) with respect to \(\text{RMS}(\mathbf{a})\)Derivative of \(\text{RMS}(\mathbf{a})\) with respect to \(a_{j}\) (where \(j\) can be equal to \(i\) or not) The final equation for the gradient with respect to \(a_{i}\) can be expressed as: \(\frac{\partial L}{\partial a_{i}}=\sum _{k}\frac{\partial L}{\partial \={a}_{k}}\cdot \left(\frac{g_{k}}{\text{RMS}(\mathbf{a})}\cdot \delta _{ik}-\frac{a_{k}g_{k}}{\text{RMS}(\mathbf{a})^{3}}\cdot \frac{a_{i}}{n}\right)\) Where: \(\delta _{ik}\) is the Kronecker delta, equal to 1 if \(i=k\) and 0 otherwise. This simplifies to: \(\frac{\partial L}{\partial a_{i}}=\frac{g_{i}}{\text{RMS}(\mathbf{a})}\frac{\partial L}{\partial \={a}_{i}}-\frac{a_{i}}{n\cdot \text{RMS}(\mathbf{a})^{3}}\sum _{k}a_{k}g_{k}\frac{\partial L}{\partial \={a}_{k}}\) Note: The backward pass for normalization layers like RMSNorm and LayerNorm can be computationally intensive. There are often optimized implementations in deep learning frameworks that leverage specific hardware instructions and fuse operations to improve efficiency. 
