{
  "metadata": {
    "forum_id": "SyMDXnCcF7",
    "review_id": "HkgmPcrZpX",
    "rebuttal_id": "BJlt5Odvam",
    "title": "A Mean Field Theory of Batch Normalization",
    "reviewer": "AnonReviewer1",
    "rating": 7,
    "conference": "ICLR2019",
    "permalink": "https://openreview.net/forum?id=SyMDXnCcF7&noteId=BJlt5Odvam",
    "annotator": "anno2"
  },
  "review_sentences": [
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 0,
      "text": "This paper develops a mean field theory for batch normalization (BN) in fully-connected networks with randomly initialized weights.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 1,
      "text": "There are a number of interesting predictions made in this paper on the basis of this analysis.",
      "suffix": "",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 2,
      "text": "The main technical results of the paper are Theorems 5-8 which compute the statistics of the covariance of the activations and the gradients.",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_summary",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 3,
      "text": "Comments:",
      "suffix": "\n\n",
      "review_action": "arg_structuring",
      "fine_review_action": "arg-structuring_heading",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 4,
      "text": "1. The observation that gradients explode in spite of BN is quite counter-intuitive. Can you give an intuitive explanation of why this occurs?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_soundness-correctness",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 5,
      "text": "2. In a similar vein, there a number of highly technical results in the paper and it would be great if the authors provide an intuitive explanation of their theorems.",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_clarification",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 6,
      "text": "3. Can the statistics of activations be controlled using activation functions or operations which break the symmetry?",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 7,
      "text": "For instance, are BSB1 fixed points good for training neural networks?",
      "suffix": "\n\n",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_explanation",
      "aspect": "asp_clarity",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 8,
      "text": "4. Mean field analysis, although it lends an insight into the statistics of the activations, needs to connected with empirical observations.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_experiment",
      "aspect": "asp_substance",
      "polarity": "pol_negative"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 9,
      "text": "For instance, when the authors observe that the structure of the fixed point is such that activations are of identical norm equally spread apart in terms of angle, this is quite far from practice.",
      "suffix": "",
      "review_action": "arg_fact",
      "fine_review_action": "none",
      "aspect": "none",
      "polarity": "none"
    },
    {
      "review_id": "HkgmPcrZpX",
      "sentence_index": 10,
      "text": "It would be good to mention this in the introduction or the conclusions.",
      "suffix": "",
      "review_action": "arg_request",
      "fine_review_action": "arg-request_edit",
      "aspect": "asp_soundness-correctness",
      "polarity": "none"
    }
  ],
  "rebuttal_sentences": [
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 0,
      "text": "Thank you for your careful review and useful comments!",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_global",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 1,
      "text": "Overall, in response to your review and that of referee 3 we will include a more intuitive discussion of our results in the next revision of our text.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 2,
      "text": "To reply to your other specific comments,",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_structuring",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 3,
      "text": "1) The intuition for batchnorm can be put in a more general setting.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 4,
      "text": "If a function f: X -> Y tends to spread out small clusters in the input space almost evenly in the output space, then one can expect that its gradients will be large typically.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 5,
      "text": "In our case, a batchnorm network can be understood as a function that sends a batch of inputs to a batch of outputs.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 6,
      "text": "In the appendix, we showed that the correlation between two different batches tend to a constant value independent of the input batches.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 7,
      "text": "No matter how close two input batches are, the output batches will have the same \u201cdistance\u201d from each other -- small movements in the input space leads to large movements in the output space.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 8,
      "text": "Thus we can expect the gradients to be large as well.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 9,
      "text": "We have added a new figure to the Appendix to further support this intuition.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 10,
      "text": "In it, we pass through a linear batchnorm network 2 minibatches.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 11,
      "text": "Both minibatches contain points on the same circle and 1 point off the circle that is unique to each minibatch.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 12,
      "text": "While the circle in each minibatch will remain an ellipse as they are propagated through the network, the angle between the planes spanned by them increasingly becomes chaotic with depth.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          4
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 13,
      "text": "3) As observed in [1] and [2], depthwise convergence to covariance fixed points is bad for training, and the best networks are either moderately deep or initialized such that the depthwise convergence rate to the fixed point is as slow as possible.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 14,
      "text": "We observe that deep networks whose activation statistics resemble a non-BSB1 fixed point typically feature worse gradient explosion than BSB1 networks.",
      "suffix": "",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 15,
      "text": "This seems to be because the nonlinearities that induce these fixed points increase rapidly (for example, polynomials with high degrees), so that the corresponding derivatives are also large, causing gradient explosion.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 16,
      "text": "(The reason that rapidly increasing nonlinearities don\u2019t converge to BSB1 fixed points is that, after a spontaneous symmetry-breaking, begins a \u201cwinner-take-all\u201d covariance dynamics, in which the activations of a few examples in the batch suddenly dominates those of the others in the batch, and this dominance persists across each layer.)",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_answer",
      "alignment": [
        "context_sentences",
        [
          6,
          7
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 17,
      "text": "4) We were a bit confused by what was meant by \u201cpractice\u201d here.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_reject-criticism",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 18,
      "text": "We have thoroughly verified that for realistic input distributions (MNIST and CIFAR10) and common initialization strategies (weights that are randomly distributed) our theory makes accurate prediction.",
      "suffix": "",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_refute-question",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 19,
      "text": "Moreover, we have shown that these predictions can be connected to practice in the sense that they predict whether or not the network can be trained.",
      "suffix": "\n\n",
      "rebuttal_stance": "dispute",
      "rebuttal_action": "rebuttal_refute-question",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 20,
      "text": "Having said this, if by practice you meant that the neural network is accurately described by our theory during training then we do not expect this to be true. We are happy to emphasize this in the camera ready.",
      "suffix": "\n\n",
      "rebuttal_stance": "concur",
      "rebuttal_action": "rebuttal_by-cr",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {
        "manuscript_change": true
      }
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 21,
      "text": "If this did not properly address your question, please feel free to let us to know and we will improve this response!",
      "suffix": "\n\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_social",
      "alignment": [
        "context_sentences",
        [
          8,
          9,
          10
        ]
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 22,
      "text": "[1] S. S. Schoenholz, J. Gilmer, S. Ganguli, J. Sohl-Dickstein.",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 23,
      "text": "Deep Information Propagation (https://arxiv.org/abs/1611.01232)",
      "suffix": "\n",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    },
    {
      "review_id": "HkgmPcrZpX",
      "rebuttal_id": "BJlt5Odvam",
      "sentence_index": 24,
      "text": "[2] L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, J. Pennington. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks (https://arxiv.org/abs/1806.05393)",
      "suffix": "",
      "rebuttal_stance": "nonarg",
      "rebuttal_action": "rebuttal_other",
      "alignment": [
        "context_in-rebuttal",
        null
      ],
      "details": {}
    }
  ]
}