[
  {
      "paper_id": "3418",
      "conference": "ICLR",
      "year": "2022",
      "paper_title": "Recursive Construction of Stable Assemblies of Recurrent Neural Networks",
      "reviews": [
          {
              "review_id": 0,
              "author": "human",
              "class": "human",
              "rating": "Borderline",
              "parsed_review": "The submission proposes new theorems showing the stability of a class of RNNs. Further by combining these RNNs into hierarchical and feedback superstructures, the submission achieve SOTA performance on a number of tasks.\n\n## Theoretical results\n\nThe theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability. The authors also show a counterexample for the common belief that linear contraction leads a sufficient condition for non-linear stability. These results are then (partly) used to construct provably stable RNN combinations.\n\n## Experimental claims\n\nThe experimental section, while interesting, seems to lack a main takeaway. Also some very dubious choice of reporting in the table.\n\n* The authors claim that for the result shown on the Tab. 1, they run the Perm-MNIST trial 4 times and the results fall between 96.65 and 96.94. I was frankly shocked to find that they choose to only report 96.94 on Tab.1! Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best. I would recommend that the authors report mean + variance. (If one were to only report the best case, one could get much better performance than is achievable on average by running the experiment many many times.)\n\n* Some claims are backed up by only a single data point. For example, the claim that increased modularity benefits performance to some point is only backed up by the fact that 44x8 performs better than 22x16 in Sec. 3.2.1. To draw a significant conclusion and demonstrate a trend, the authors can perhaps look at 50x7 and 39x9. \n\n* In general, more data points and error bars would help convince the reader that the conclusions are real and not flukes. \n\n* The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other. In 3.1 increased size makes the network better monotonically but in 3.2 it is inverted U shape. Similarly in 3.2 modularity makes the performance better monotonically, but it is inverted U in 3.1. What is the conclusion to be drawn here? \n\n* In general, the experiment section is a little hard to read and can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.\n\n\n\n### Clarity and other minor points\n\n* The paper is at times very clear and at times very confusing. For example, the discussion of stability and contraction are clear but then in the paragraph at the top of page 3, the authors use the symbol g for two different thing in the same paragraph. \n* Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement (this is rather obvious in hindsight but for a new reader it is very confusing).\n* For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.)\n* Why is there a section 3.2.1 instead of just 3.2?\n* I would suggest an overall re-read of the paper to maximize readability.\n\nThe paper constitutes an evolutionary step in understanding and designing stable RNNs. The theoretical results are novel and noteworthy. Unfortunately the experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many).",
              "parsed_review_with_titles": "Summary of the paper\nThe submission proposes new theorems showing the stability of a class of RNNs. Further by combining these RNNs into hierarchical and feedback superstructures, the submission achieve SOTA performance on a number of tasks.\n\nMain review\n## Theoretical results\n\nThe theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability. The authors also show a counterexample for the common belief that linear contraction leads a sufficient condition for non-linear stability. These results are then (partly) used to construct provably stable RNN combinations.\n\n## Experimental claims\n\nThe experimental section, while interesting, seems to lack a main takeaway. Also some very dubious choice of reporting in the table.\n\n* The authors claim that for the result shown on the Tab. 1, they run the Perm-MNIST trial 4 times and the results fall between 96.65 and 96.94. I was frankly shocked to find that they choose to only report 96.94 on Tab.1! Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best. I would recommend that the authors report mean + variance. (If one were to only report the best case, one could get much better performance than is achievable on average by running the experiment many many times.)\n\n* Some claims are backed up by only a single data point. For example, the claim that increased modularity benefits performance to some point is only backed up by the fact that 44x8 performs better than 22x16 in Sec. 3.2.1. To draw a significant conclusion and demonstrate a trend, the authors can perhaps look at 50x7 and 39x9. \n\n* In general, more data points and error bars would help convince the reader that the conclusions are real and not flukes. \n\n* The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other. In 3.1 increased size makes the network better monotonically but in 3.2 it is inverted U shape. Similarly in 3.2 modularity makes the performance better monotonically, but it is inverted U in 3.1. What is the conclusion to be drawn here? \n\n* In general, the experiment section is a little hard to read and can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.\n\n\n\n### Clarity and other minor points\n\n* The paper is at times very clear and at times very confusing. For example, the discussion of stability and contraction are clear but then in the paragraph at the top of page 3, the authors use the symbol g for two different thing in the same paragraph. \n* Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement (this is rather obvious in hindsight but for a new reader it is very confusing).\n* For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.)\n* Why is there a section 3.2.1 instead of just 3.2?\n* I would suggest an overall re-read of the paper to maximize readability.\n\nSummary of the review\nThe paper constitutes an evolutionary step in understanding and designing stable RNNs. The theoretical results are novel and noteworthy. Unfortunately the experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many).",
              "parsed_review_no_markdown": "The submission proposes new theorems showing the stability of a class of RNNs. Further by combining these RNNs into hierarchical and feedback superstructures, the submission achieve SOTA performance on a number of tasks.\n\nTheoretical results\n\nThe theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability. The authors also show a counterexample for the common belief that linear contraction leads a sufficient condition for non-linear stability. These results are then (partly) used to construct provably stable RNN combinations.\n\nExperimental claims\n\nThe experimental section, while interesting, seems to lack a main takeaway. Also some very dubious choice of reporting in the table.\n\n The authors claim that for the result shown on the Tab. 1, they run the Perm-MNIST trial 4 times and the results fall between 96.65 and 96.94. I was frankly shocked to find that they choose to only report 96.94 on Tab.1! Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best. I would recommend that the authors report mean + variance. (If one were to only report the best case, one could get much better performance than is achievable on average by running the experiment many many times.)\n\n Some claims are backed up by only a single data point. For example, the claim that increased modularity benefits performance to some point is only backed up by the fact that 44x8 performs better than 22x16 in Sec. 3.2.1. To draw a significant conclusion and demonstrate a trend, the authors can perhaps look at 50x7 and 39x9. \n\n In general, more data points and error bars would help convince the reader that the conclusions are real and not flukes. \n\n The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other. In 3.1 increased size makes the network better monotonically but in 3.2 it is inverted U shape. Similarly in 3.2 modularity makes the performance better monotonically, but it is inverted U in 3.1. What is the conclusion to be drawn here? \n\n In general, the experiment section is a little hard to read and can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.\n\n\n\nClarity and other minor points\n\n The paper is at times very clear and at times very confusing. For example, the discussion of stability and contraction are clear but then in the paragraph at the top of page 3, the authors use the symbol g for two different thing in the same paragraph. \n Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement (this is rather obvious in hindsight but for a new reader it is very confusing).\n For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.)\n Why is there a section 3.2.1 instead of just 3.2?\n I would suggest an overall re-read of the paper to maximize readability.\n\nThe paper constitutes an evolutionary step in understanding and designing stable RNNs. The theoretical results are novel and noteworthy. Unfortunately the experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many).",
              "parsed_review_with_titles_no_markdown": "Summary of the paper\nThe submission proposes new theorems showing the stability of a class of RNNs. Further by combining these RNNs into hierarchical and feedback superstructures, the submission achieve SOTA performance on a number of tasks.\n\nMain review\nTheoretical results\n\nThe theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability. The authors also show a counterexample for the common belief that linear contraction leads a sufficient condition for non-linear stability. These results are then (partly) used to construct provably stable RNN combinations.\n\nExperimental claims\n\nThe experimental section, while interesting, seems to lack a main takeaway. Also some very dubious choice of reporting in the table.\n\n The authors claim that for the result shown on the Tab. 1, they run the Perm-MNIST trial 4 times and the results fall between 96.65 and 96.94. I was frankly shocked to find that they choose to only report 96.94 on Tab.1! Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best. I would recommend that the authors report mean + variance. (If one were to only report the best case, one could get much better performance than is achievable on average by running the experiment many many times.)\n\n Some claims are backed up by only a single data point. For example, the claim that increased modularity benefits performance to some point is only backed up by the fact that 44x8 performs better than 22x16 in Sec. 3.2.1. To draw a significant conclusion and demonstrate a trend, the authors can perhaps look at 50x7 and 39x9. \n\n In general, more data points and error bars would help convince the reader that the conclusions are real and not flukes. \n\n The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other. In 3.1 increased size makes the network better monotonically but in 3.2 it is inverted U shape. Similarly in 3.2 modularity makes the performance better monotonically, but it is inverted U in 3.1. What is the conclusion to be drawn here? \n\n In general, the experiment section is a little hard to read and can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.\n\n\n\nClarity and other minor points\n\n The paper is at times very clear and at times very confusing. For example, the discussion of stability and contraction are clear but then in the paragraph at the top of page 3, the authors use the symbol g for two different thing in the same paragraph. \n Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement (this is rather obvious in hindsight but for a new reader it is very confusing).\n For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.)\n Why is there a section 3.2.1 instead of just 3.2?\n I would suggest an overall re-read of the paper to maximize readability.\n\nSummary of the review\nThe paper constitutes an evolutionary step in understanding and designing stable RNNs. The theoretical results are novel and noteworthy. Unfortunately the experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many).",
              "clean_review": "The submission proposes new theorems showing the stability of a class of RNNs. Further by combining these RNNs into hierarchical and feedback superstructures, the submission achieve SOTA performance on a number of tasks.\n\nTheoretical results\n\nThe theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability. The authors also show a counterexample for the common belief that linear contraction leads a sufficient condition for non-linear stability. These results are then (partly) used to construct provably stable RNN combinations.\n\nExperimental claims\n\nThe experimental section, while interesting, seems to lack a main takeaway. Also some very dubious choice of reporting in the table.\n\n The authors claim that for the result shown on the Tab. 1, they run the Perm-MNIST trial 4 times and the results fall between 96.65 and 96.94. I was frankly shocked to find that they choose to only report 96.94 on Tab.1! Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best. I would recommend that the authors report mean + variance. (If one were to only report the best case, one could get much better performance than is achievable on average by running the experiment many many times.)\n\n Some claims are backed up by only a single data point. For example, the claim that increased modularity benefits performance to some point is only backed up by the fact that 44x8 performs better than 22x16 in Sec. 3.2.1. To draw a significant conclusion and demonstrate a trend, the authors can perhaps look at 50x7 and 39x9. \n\n In general, more data points and error bars would help convince the reader that the conclusions are real and not flukes. \n\n The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other. In 3.1 increased size makes the network better monotonically but in 3.2 it is inverted U shape. Similarly in 3.2 modularity makes the performance better monotonically, but it is inverted U in 3.1. What is the conclusion to be drawn here? \n\n In general, the experiment section is a little hard to read and can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.\n\n\n\nClarity and other minor points\n\n The paper is at times very clear and at times very confusing. For example, the discussion of stability and contraction are clear but then in the paragraph at the top of page 3, the authors use the symbol g for two different thing in the same paragraph. \n Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement (this is rather obvious in hindsight but for a new reader it is very confusing).\n For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.)\n Why is there a section 3.2.1 instead of just 3.2?\n I would suggest an overall re-read of the paper to maximize readability.\n\nThe paper constitutes an evolutionary step in understanding and designing stable RNNs. The theoretical results are novel and noteworthy. Unfortunately the experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many).",
              "claim_extraction": {
                  "factualRestatement": [
                      "The submission proposes new theorems showing the stability of a class of RNNs.",
                      "By combining these RNNs into hierarchical and feedback superstructures, the submission achieves SOTA performance on a number of tasks.",
                      "Theorems 1-5 constitute an evolutionary step in understanding the conditions of stability.",
                      "The authors show a counterexample for the common belief that linear contraction leads to a sufficient condition for non-linear stability.",
                      "These theoretical results are then (partly) used to construct provably stable RNN combinations.",
                      "The authors claim that for the result shown in Tab. 1, they ran the Perm-MNIST trial 4 times and the results fell between 96.65 and 96.94.",
                      "The authors reported only 96.94 on Tab. 1 for the Perm-MNIST trial.",
                      "The claim that increased modularity benefits performance to some point is backed up by only a single data point: 44x8 performs better than 22x16 in Sec. 3.2.1.",
                      "The results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2 are opposite each other.",
                      "In 3.1, increased size makes the network better monotonically, but in 3.2 it is an inverted U shape.",
                      "In 3.2, modularity makes the performance better monotonically, but it is an inverted U in 3.1.",
                      "The authors use the symbol 'g' for two different things in the same paragraph at the top of page 3."
                  ],
                  "evaluation": [
                      "The theoretical results are novel and noteworthy.",
                      "The theorems 1-5 constitute an evolutionary step in the understanding the conditions of stability.",
                      "The experimental section, while interesting, seems to lack a main takeaway.",
                      "There are some very dubious choices of reporting in the table.",
                      "Running the same experiment multiple times and only reporting the best case scenario is not good practice and leads to misunderstanding at best.",
                      "The experiment section is a little hard to read.",
                      "The paper is at times very clear and at times very confusing.",
                      "The discussion of stability and contraction are clear.",
                      "The experimental results lack a clear conclusion and at times do not follow best practices (i.e. reporting only the best run out of many)."
                  ],
                  "constructiveInput": [
                      "I would recommend that the authors report mean + variance for experimental results.",
                      "To draw a significant conclusion and demonstrate a trend for modularity, the authors can perhaps look at 50x7 and 39x9, in addition to 44x8 and 22x16.",
                      "More data points and error bars would help convince the reader that the conclusions are real and not flukes.",
                      "The experiment section can use a summary of the main conclusions and clearly demarcated paragraphs and sections of the experiment that demonstrate each point.",
                      "Theorem 7 should be slightly reworded so that it is clear that the first inequality is a condition and not a statement.",
                      "For denoting multiplication, I would suggest using $\\times$ instead of x (e.g. 22x16 etc.).",
                      "I would suggest an overall re-read of the paper to maximize readability."
                  ],
                  "clarificationDialogue": [
                      "What is the conclusion to be drawn regarding the opposite results of 'performance vs network size' and 'performance vs modularity' for Sec. 3.1 and 3.2?",
                      "Why is there a section 3.2.1 instead of just 3.2?"
                  ],
                  "metaCommentary": [
                      "The paper constitutes an evolutionary step in understanding and designing stable RNNs."
                  ]
              }
          },
          {
              "review_id": 1,
              "author": "human",
              "class": "human",
              "rating": "Accept",
              "parsed_review": "This paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks. We know that combinations of learned modular components can be powerful and far more tractable than learning bespoke models from scratch, particularly in applied domains (e.g. AlphaGo). Yet so far, we have no theoretical guarantees that these combinations will actually remain stable. This paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms. Then, using fixed RNNs generated according to these constraints (leaving the connections between them as antisymmetric learnable parameters), the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks with far fewer learned parameters and the previous stability guarantee.\n\nStrengths:\n- I thought that the empirical results were rather convincing for what is primarily a theoretical contribution. The authors first thoroughly investigate various permutations of their modular sparse combination network framework (# RNNs vs size of each using absolute value weight constraints) and do another investigation of their alternative SVD weight constraint network (which doesn’t perform as well or train as quickly). Most importantly, they then show that they can best SOTA algorithms on some of the common (albeit easier) benchmarks in the field, even under (and perhaps because of) these constraints. \n- The theoretical contribution is quite powerful. There has been a lot of recent work in networks with many individual recurrent components, such as the aforementioned AlphaGo or the more general recurrent independent mechanisms (RIMs) framework, but for the most part, they rely on intuitive explanations and empirical results over theoretical guarantees. Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn, and learning such models end-to-end is tricky. If we can apply these constraint conditions and still achieve good performance (which seems like it could be realistic, particularly in the absolute value constraint case), then we can develop sets of useful modules and mix-and-match to the task in question. This paper doesn’t answer all of the intermediate questions, but the stability analysis is a key step.\n- The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background.\n\nWeaknesses:\n- This paper is very dense and difficult to follow. It took me a few reads to really understand the value of network stability and how it’s achieved in this case. The appendix is a mandatory read as are some of the references. None of the use cases are particularly intuitive. I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2), some pseudocode for the algorithms (tossed in the Appendix), and maybe an example case of an unstable network assembly diverging. I also feel like my familiarity with AlphaGo and other methods gave me more of an insight into how this would help in practice than the actual paper did.\n- As much as I liked the empirical results that were provided, they’re all of a kind: sequential image prediction. I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).\n\nOverall, I would accept this paper. Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in and that the theoretical and practical results from this work could be quite powerful. To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability. Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them.",
              "parsed_review_with_titles": "Summary of the paper\nThis paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks. We know that combinations of learned modular components can be powerful and far more tractable than learning bespoke models from scratch, particularly in applied domains (e.g. AlphaGo). Yet so far, we have no theoretical guarantees that these combinations will actually remain stable. This paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms. Then, using fixed RNNs generated according to these constraints (leaving the connections between them as antisymmetric learnable parameters), the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks with far fewer learned parameters and the previous stability guarantee.\n\nMain review\nStrengths:\n- I thought that the empirical results were rather convincing for what is primarily a theoretical contribution. The authors first thoroughly investigate various permutations of their modular sparse combination network framework (# RNNs vs size of each using absolute value weight constraints) and do another investigation of their alternative SVD weight constraint network (which doesn’t perform as well or train as quickly). Most importantly, they then show that they can best SOTA algorithms on some of the common (albeit easier) benchmarks in the field, even under (and perhaps because of) these constraints. \n- The theoretical contribution is quite powerful. There has been a lot of recent work in networks with many individual recurrent components, such as the aforementioned AlphaGo or the more general recurrent independent mechanisms (RIMs) framework, but for the most part, they rely on intuitive explanations and empirical results over theoretical guarantees. Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn, and learning such models end-to-end is tricky. If we can apply these constraint conditions and still achieve good performance (which seems like it could be realistic, particularly in the absolute value constraint case), then we can develop sets of useful modules and mix-and-match to the task in question. This paper doesn’t answer all of the intermediate questions, but the stability analysis is a key step.\n- The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background.\n\nWeaknesses:\n- This paper is very dense and difficult to follow. It took me a few reads to really understand the value of network stability and how it’s achieved in this case. The appendix is a mandatory read as are some of the references. None of the use cases are particularly intuitive. I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2), some pseudocode for the algorithms (tossed in the Appendix), and maybe an example case of an unstable network assembly diverging. I also feel like my familiarity with AlphaGo and other methods gave me more of an insight into how this would help in practice than the actual paper did.\n- As much as I liked the empirical results that were provided, they’re all of a kind: sequential image prediction. I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).\n\nSummary of the review\nOverall, I would accept this paper. Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in and that the theoretical and practical results from this work could be quite powerful. To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability. Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them.",
              "parsed_review_no_markdown": "This paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks. We know that combinations of learned modular components can be powerful and far more tractable than learning bespoke models from scratch, particularly in applied domains (e.g. AlphaGo). Yet so far, we have no theoretical guarantees that these combinations will actually remain stable. This paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms. Then, using fixed RNNs generated according to these constraints (leaving the connections between them as antisymmetric learnable parameters), the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks with far fewer learned parameters and the previous stability guarantee.\n\nStrengths:\n- I thought that the empirical results were rather convincing for what is primarily a theoretical contribution. The authors first thoroughly investigate various permutations of their modular sparse combination network framework (# RNNs vs size of each using absolute value weight constraints) and do another investigation of their alternative SVD weight constraint network (which doesn’t perform as well or train as quickly). Most importantly, they then show that they can best SOTA algorithms on some of the common (albeit easier) benchmarks in the field, even under (and perhaps because of) these constraints. \n- The theoretical contribution is quite powerful. There has been a lot of recent work in networks with many individual recurrent components, such as the aforementioned AlphaGo or the more general recurrent independent mechanisms (RIMs) framework, but for the most part, they rely on intuitive explanations and empirical results over theoretical guarantees. Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn, and learning such models end-to-end is tricky. If we can apply these constraint conditions and still achieve good performance (which seems like it could be realistic, particularly in the absolute value constraint case), then we can develop sets of useful modules and mix-and-match to the task in question. This paper doesn’t answer all of the intermediate questions, but the stability analysis is a key step.\n- The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background.\n\nWeaknesses:\n- This paper is very dense and difficult to follow. It took me a few reads to really understand the value of network stability and how it’s achieved in this case. The appendix is a mandatory read as are some of the references. None of the use cases are particularly intuitive. I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2), some pseudocode for the algorithms (tossed in the Appendix), and maybe an example case of an unstable network assembly diverging. I also feel like my familiarity with AlphaGo and other methods gave me more of an insight into how this would help in practice than the actual paper did.\n- As much as I liked the empirical results that were provided, they’re all of a kind: sequential image prediction. I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).\n\nOverall, I would accept this paper. Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in and that the theoretical and practical results from this work could be quite powerful. To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability. Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them.",
              "parsed_review_with_titles_no_markdown": "Summary of the paper\nThis paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks. We know that combinations of learned modular components can be powerful and far more tractable than learning bespoke models from scratch, particularly in applied domains (e.g. AlphaGo). Yet so far, we have no theoretical guarantees that these combinations will actually remain stable. This paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms. Then, using fixed RNNs generated according to these constraints (leaving the connections between them as antisymmetric learnable parameters), the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks with far fewer learned parameters and the previous stability guarantee.\n\nMain review\nStrengths:\n- I thought that the empirical results were rather convincing for what is primarily a theoretical contribution. The authors first thoroughly investigate various permutations of their modular sparse combination network framework (# RNNs vs size of each using absolute value weight constraints) and do another investigation of their alternative SVD weight constraint network (which doesn’t perform as well or train as quickly). Most importantly, they then show that they can best SOTA algorithms on some of the common (albeit easier) benchmarks in the field, even under (and perhaps because of) these constraints. \n- The theoretical contribution is quite powerful. There has been a lot of recent work in networks with many individual recurrent components, such as the aforementioned AlphaGo or the more general recurrent independent mechanisms (RIMs) framework, but for the most part, they rely on intuitive explanations and empirical results over theoretical guarantees. Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn, and learning such models end-to-end is tricky. If we can apply these constraint conditions and still achieve good performance (which seems like it could be realistic, particularly in the absolute value constraint case), then we can develop sets of useful modules and mix-and-match to the task in question. This paper doesn’t answer all of the intermediate questions, but the stability analysis is a key step.\n- The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background.\n\nWeaknesses:\n- This paper is very dense and difficult to follow. It took me a few reads to really understand the value of network stability and how it’s achieved in this case. The appendix is a mandatory read as are some of the references. None of the use cases are particularly intuitive. I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2), some pseudocode for the algorithms (tossed in the Appendix), and maybe an example case of an unstable network assembly diverging. I also feel like my familiarity with AlphaGo and other methods gave me more of an insight into how this would help in practice than the actual paper did.\n- As much as I liked the empirical results that were provided, they’re all of a kind: sequential image prediction. I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).\n\nSummary of the review\nOverall, I would accept this paper. Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in and that the theoretical and practical results from this work could be quite powerful. To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability. Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them.",
              "clean_review": "This paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks. We know that combinations of learned modular components can be powerful and far more tractable than learning bespoke models from scratch, particularly in applied domains (e.g. AlphaGo). Yet so far, we have no theoretical guarantees that these combinations will actually remain stable. This paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms. Then, using fixed RNNs generated according to these constraints (leaving the connections between them as antisymmetric learnable parameters), the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks with far fewer learned parameters and the previous stability guarantee.\n\nStrengths:\n- I thought that the empirical results were rather convincing for what is primarily a theoretical contribution. The authors first thoroughly investigate various permutations of their modular sparse combination network framework (# RNNs vs size of each using absolute value weight constraints) and do another investigation of their alternative SVD weight constraint network (which doesn’t perform as well or train as quickly). Most importantly, they then show that they can best SOTA algorithms on some of the common (albeit easier) benchmarks in the field, even under (and perhaps because of) these constraints. \n- The theoretical contribution is quite powerful. There has been a lot of recent work in networks with many individual recurrent components, such as the aforementioned AlphaGo or the more general recurrent independent mechanisms (RIMs) framework, but for the most part, they rely on intuitive explanations and empirical results over theoretical guarantees. Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn, and learning such models end-to-end is tricky. If we can apply these constraint conditions and still achieve good performance (which seems like it could be realistic, particularly in the absolute value constraint case), then we can develop sets of useful modules and mix-and-match to the task in question. This paper doesn’t answer all of the intermediate questions, but the stability analysis is a key step.\n- The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background.\n\nWeaknesses:\n- This paper is very dense and difficult to follow. It took me a few reads to really understand the value of network stability and how it’s achieved in this case. The appendix is a mandatory read as are some of the references. None of the use cases are particularly intuitive. I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2), some pseudocode for the algorithms (tossed in the Appendix), and maybe an example case of an unstable network assembly diverging. I also feel like my familiarity with AlphaGo and other methods gave me more of an insight into how this would help in practice than the actual paper did.\n- As much as I liked the empirical results that were provided, they’re all of a kind: sequential image prediction. I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).\n\nOverall, I would accept this paper. Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in and that the theoretical and practical results from this work could be quite powerful. To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability. Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper is primarily a theoretical contribution to the construction of assemblies of recurrent neural networks.",
                      "The paper develops the theory behind provably-stable combinations of RNNs using weight constraints and feedback mechanisms.",
                      "Using fixed RNNs generated according to these constraints, the authors show that their sparse combination network is able to achieve SOTA performance on sequential image classification benchmarks.",
                      "The empirical results investigate various permutations of their modular sparse combination network framework and an alternative SVD weight constraint network.",
                      "The paper shows that their method can best SOTA algorithms on common benchmarks with fewer learned parameters and a stability guarantee.",
                      "The theoretical contribution addresses the lack of theoretical guarantees for combinations of learned modular components, which are often used in frameworks like recurrent independent mechanisms (RIMs).",
                      "The proofs in the appendix are well-done and easy-to-follow, given a sufficient math background."
                  ],
                  "evaluation": [
                      "The empirical results were rather convincing for what is primarily a theoretical contribution.",
                      "The authors thoroughly investigate various permutations of their modular sparse combination network framework.",
                      "The SVD weight constraint network doesn’t perform as well or train as quickly.",
                      "The theoretical contribution is quite powerful.",
                      "Clearly specialized RNN modules can be quite powerful, but RNNs are notoriously unstable and difficult to learn.",
                      "The stability analysis is a key step.",
                      "This paper is very dense and difficult to follow.",
                      "It took me a few reads to really understand the value of network stability and how it’s achieved in this case.",
                      "The appendix is a mandatory read as are some of the references.",
                      "None of the use cases are particularly intuitive.",
                      "The empirical results provided are all of a kind: sequential image prediction."
                  ],
                  "constructiveInput": [
                      "I think I would have liked to see a graphical representation of the sparse combo network (rather than the weight matrices in Figure 2).",
                      "I would have liked to see some pseudocode for the algorithms (tossed in the Appendix).",
                      "I would have liked to see maybe an example case of an unstable network assembly diverging.",
                      "I would have liked to see at least one application in a different domain (NLP, RL, continuous control, etc).",
                      "To make the paper better, I would like to see some results in a different domain and more effort towards improving the readability."
                  ],
                  "clarificationDialogue": [
                      "This paper doesn’t answer all of the intermediate questions."
                  ],
                  "metaCommentary": [
                      "Overall, I would accept this paper.",
                      "Although it was difficult to follow and required a lot of consultation with the literature, I do ultimately think that this is a direction that DL algorithms are going in.",
                      "The theoretical and practical results from this work could be quite powerful.",
                      "Too often, valuable theoretical works go underutilized because they’re difficult to understand or don’t seem relevant to the empiricists and engineers who could build on them."
                  ]
              }
          },
          {
              "review_id": 2,
              "author": "human",
              "class": "human",
              "rating": "Borderline",
              "parsed_review": "The authors studied contraction properties of continuous-time recurrent neural networks. They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained.\n\nStrength\nHow to assemble a network of RNNs is an interesting problem. The theorems on contraction properties are helpful to people thinking about provably stable RNNs.\n\n\nWeaknesses\n(1)\tSection 3.2.1 is pretty dry to read. Reporting results from many individual AxB networks seem unnecessarily.\n\n(2)\tThe authors showed performance comparison with other types of networks in Table 1. I think it would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable. \n\n(3)\tIt would also be good to know what happens if all connection weights are trained, not just the connections between modules. Does the performance actually decrease despite having more parameters?\n\n(4)\tThe provably stable part is kind of separated from the training modular network part. How closely are they related? Is having stable RNN modules particularly important for sparsely connected modular networks?\n\nOverall, this is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets. The results are at places more difficult to read, but overall it is clear.\n\nI want to add that I cannot evaluate whether the mathematical derivations are correct.",
              "parsed_review_with_titles": "Summary of the paper\nThe authors studied contraction properties of continuous-time recurrent neural networks. They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained.\n\nMain review\nStrength\nHow to assemble a network of RNNs is an interesting problem. The theorems on contraction properties are helpful to people thinking about provably stable RNNs.\n\n\nWeaknesses\n(1)\tSection 3.2.1 is pretty dry to read. Reporting results from many individual AxB networks seem unnecessarily.\n\n(2)\tThe authors showed performance comparison with other types of networks in Table 1. I think it would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable. \n\n(3)\tIt would also be good to know what happens if all connection weights are trained, not just the connections between modules. Does the performance actually decrease despite having more parameters?\n\n(4)\tThe provably stable part is kind of separated from the training modular network part. How closely are they related? Is having stable RNN modules particularly important for sparsely connected modular networks?\n\nSummary of the review\nOverall, this is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets. The results are at places more difficult to read, but overall it is clear.\n\nI want to add that I cannot evaluate whether the mathematical derivations are correct.",
              "parsed_review_no_markdown": "The authors studied contraction properties of continuous-time recurrent neural networks. They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained.\n\nStrength\nHow to assemble a network of RNNs is an interesting problem. The theorems on contraction properties are helpful to people thinking about provably stable RNNs.\n\n\nWeaknesses\n(1)\tSection 3.2.1 is pretty dry to read. Reporting results from many individual AxB networks seem unnecessarily.\n\n(2)\tThe authors showed performance comparison with other types of networks in Table 1. I think it would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable. \n\n(3)\tIt would also be good to know what happens if all connection weights are trained, not just the connections between modules. Does the performance actually decrease despite having more parameters?\n\n(4)\tThe provably stable part is kind of separated from the training modular network part. How closely are they related? Is having stable RNN modules particularly important for sparsely connected modular networks?\n\nOverall, this is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets. The results are at places more difficult to read, but overall it is clear.\n\nI want to add that I cannot evaluate whether the mathematical derivations are correct.",
              "parsed_review_with_titles_no_markdown": "Summary of the paper\nThe authors studied contraction properties of continuous-time recurrent neural networks. They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained.\n\nMain review\nStrength\nHow to assemble a network of RNNs is an interesting problem. The theorems on contraction properties are helpful to people thinking about provably stable RNNs.\n\n\nWeaknesses\n(1)\tSection 3.2.1 is pretty dry to read. Reporting results from many individual AxB networks seem unnecessarily.\n\n(2)\tThe authors showed performance comparison with other types of networks in Table 1. I think it would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable. \n\n(3)\tIt would also be good to know what happens if all connection weights are trained, not just the connections between modules. Does the performance actually decrease despite having more parameters?\n\n(4)\tThe provably stable part is kind of separated from the training modular network part. How closely are they related? Is having stable RNN modules particularly important for sparsely connected modular networks?\n\nSummary of the review\nOverall, this is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets. The results are at places more difficult to read, but overall it is clear.\n\nI want to add that I cannot evaluate whether the mathematical derivations are correct.",
              "clean_review": "The authors studied contraction properties of continuous-time recurrent neural networks. They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained.\n\nStrength\nHow to assemble a network of RNNs is an interesting problem. The theorems on contraction properties are helpful to people thinking about provably stable RNNs.\n\n\nWeaknesses\n(1)\tSection 3.2.1 is pretty dry to read. Reporting results from many individual AxB networks seem unnecessarily.\n\n(2)\tThe authors showed performance comparison with other types of networks in Table 1. I think it would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable. \n\n(3)\tIt would also be good to know what happens if all connection weights are trained, not just the connections between modules. Does the performance actually decrease despite having more parameters?\n\n(4)\tThe provably stable part is kind of separated from the training modular network part. How closely are they related? Is having stable RNN modules particularly important for sparsely connected modular networks?\n\nOverall, this is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets. The results are at places more difficult to read, but overall it is clear.\n\nI want to add that I cannot evaluate whether the mathematical derivations are correct.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The authors studied contraction properties of continuous-time recurrent neural networks.",
                      "They further showed that a network of provably stable RNNs (net of nets) can be trained to reach competitive performance on several benchmarks, including sequential CIFAR10, even when only connections between modules are trained."
                  ],
                  "evaluation": [
                      "How to assemble a network of RNNs is an interesting problem.",
                      "The theorems on contraction properties are helpful to people thinking about provably stable RNNs.",
                      "Section 3.2.1 is pretty dry to read.",
                      "Reporting results from many individual AxB networks seem unnecessarily.",
                      "The results are at places more difficult to read, but overall it is clear."
                  ],
                  "constructiveInput": [
                      "It would be quite informative to show performance of networks where everything is kept the same, except that the RNNs are no longer provably stable.",
                      "It would also be good to know what happens if all connection weights are trained, not just the connections between modules."
                  ],
                  "clarificationDialogue": [
                      "Does the performance actually decrease despite having more parameters?",
                      "How closely are the provably stable part and the training modular network part related?",
                      "Is having stable RNN modules particularly important for sparsely connected modular networks?"
                  ],
                  "metaCommentary": [
                      "This is an interesting paper that takes a less common approach to RNNs: provable stability and net-of-nets.",
                      "I cannot evaluate whether the mathematical derivations are correct."
                  ]
              }
          },
          {
              "review_id": 3,
              "author": "human",
              "class": "human",
              "rating": "Borderline",
              "parsed_review": "In the paper, the authors study stable architectures for RNNs. On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive. On the modeling side, the authors propose RNN architectures that have contractive weight matrices. The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical results seem interesting, although not very surprising. However, I think there is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.\n\nFor the proposed model (Section 3.1), I might have missed something, but I fail to fully understand the model; the presentation could be improved to help the readers. For example, the mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained. It's unclear to me how the subnetworks are combined. I can only infer from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is a block matrix. \n\nFurthermore, the idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices is not novel and has been explored in expRNN [1].\n\nThe writing of the introduction section could also be improved. Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nThe theoretical results in the paper seem interesting. However, the presentation of the proposed model is not clear, and the model itself does not seem novel. Overall, I think the paper needs improvement to meet the acceptance threshold.",
              "parsed_review_with_titles": "Summary of the paper\nIn the paper, the authors study stable architectures for RNNs. On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive. On the modeling side, the authors propose RNN architectures that have contractive weight matrices. The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nMain review\nThe theoretical results seem interesting, although not very surprising. However, I think there is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.\n\nFor the proposed model (Section 3.1), I might have missed something, but I fail to fully understand the model; the presentation could be improved to help the readers. For example, the mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained. It's unclear to me how the subnetworks are combined. I can only infer from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is a block matrix. \n\nFurthermore, the idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices is not novel and has been explored in expRNN [1].\n\nThe writing of the introduction section could also be improved. Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nSummary of the review\nThe theoretical results in the paper seem interesting. However, the presentation of the proposed model is not clear, and the model itself does not seem novel. Overall, I think the paper needs improvement to meet the acceptance threshold.",
              "parsed_review_no_markdown": "In the paper, the authors study stable architectures for RNNs. On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive. On the modeling side, the authors propose RNN architectures that have contractive weight matrices. The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical results seem interesting, although not very surprising. However, I think there is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.\n\nFor the proposed model (Section 3.1), I might have missed something, but I fail to fully understand the model; the presentation could be improved to help the readers. For example, the mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained. It's unclear to me how the subnetworks are combined. I can only infer from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is a block matrix. \n\nFurthermore, the idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices is not novel and has been explored in expRNN [1].\n\nThe writing of the introduction section could also be improved. Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nThe theoretical results in the paper seem interesting. However, the presentation of the proposed model is not clear, and the model itself does not seem novel. Overall, I think the paper needs improvement to meet the acceptance threshold.",
              "parsed_review_with_titles_no_markdown": "Summary of the paper\nIn the paper, the authors study stable architectures for RNNs. On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive. On the modeling side, the authors propose RNN architectures that have contractive weight matrices. The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nMain review\nThe theoretical results seem interesting, although not very surprising. However, I think there is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.\n\nFor the proposed model (Section 3.1), I might have missed something, but I fail to fully understand the model; the presentation could be improved to help the readers. For example, the mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained. It's unclear to me how the subnetworks are combined. I can only infer from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is a block matrix. \n\nFurthermore, the idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices is not novel and has been explored in expRNN [1].\n\nThe writing of the introduction section could also be improved. Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nSummary of the review\nThe theoretical results in the paper seem interesting. However, the presentation of the proposed model is not clear, and the model itself does not seem novel. Overall, I think the paper needs improvement to meet the acceptance threshold.",
              "clean_review": "In the paper, the authors study stable architectures for RNNs. On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive. On the modeling side, the authors propose RNN architectures that have contractive weight matrices. The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical results seem interesting, although not very surprising. However, I think there is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.\n\nFor the proposed model (Section 3.1), I might have missed something, but I fail to fully understand the model; the presentation could be improved to help the readers. For example, the mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained. It's unclear to me how the subnetworks are combined. I can only infer from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is a block matrix. \n\nFurthermore, the idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices is not novel and has been explored in expRNN [1].\n\nThe writing of the introduction section could also be improved. Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nThe theoretical results in the paper seem interesting. However, the presentation of the proposed model is not clear, and the model itself does not seem novel. Overall, I think the paper needs improvement to meet the acceptance threshold.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The authors study stable architectures for RNNs.",
                      "On the theoretical side, the authors present a series of conditions such that a weight matrix of an RNN is contractive.",
                      "On the modeling side, the authors propose RNN architectures that have contractive weight matrices.",
                      "The proposed methods are evaluated on benchmark datasets including sequential MNIST, permuted MNIST, and sequential CIFAR-10.",
                      "The idea of parametrizing orthogonal weight matrices by exponentiating skew-symmetric matrices has been explored in expRNN [1]."
                  ],
                  "evaluation": [
                      "The theoretical results seem interesting, although not very surprising.",
                      "There is disconnection between the theoretical results and the proposed model: not all theorems are relevant to the proposed model.",
                      "The presentation of the proposed model could be improved to help the readers.",
                      "The model itself does not seem novel.",
                      "The writing of the introduction section could also be improved."
                  ],
                  "constructiveInput": [
                      "The presentation of the proposed model (Section 3.1) could be improved.",
                      "Instead of discussing AlphaGo and modules in evolution, the reader might benefit from a more thorough literature review of the RNN trainability and long-term dependence in the introduction section."
                  ],
                  "clarificationDialogue": [
                      "I might have missed something, but I fail to fully understand the model in Section 3.1.",
                      "The mentioning of \"subnetworks\" at the beginning of Section 3.1 is not defined/explained.",
                      "It's unclear to me how the subnetworks are combined."
                  ],
                  "metaCommentary": [
                      "Overall, I think the paper needs improvement to meet the acceptance threshold."
                  ]
              }
          },
          {
              "review_id": "ai_gemini-2.5-flash_Borderline_2ca800b2",
              "author": "gemini-2.5-flash",
              "class": "ai",
              "prompt": "Borderline",
              "raw_review": "# Summary of paper\n\nThis paper introduces a novel framework for constructing provably stable \"networks of networks\" using contraction theory. The core idea is to leverage established contraction properties of individual continuous-time recurrent neural networks (RNNs) to recursively build larger, stable composite systems. The authors derive five new sufficient conditions for the global contraction of individual RNNs and, crucially, present a theorem (Theorem 6) that describes how to combine multiple independently contracting RNN subnetworks into a larger system while guaranteeing its overall stability. This combination is achieved through a specific parameterization of the inter-module connections. Experimentally, the paper demonstrates the effectiveness of these \"Combo Nets\" on sequential image classification tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10), achieving state-of-the-art performance among provably stable RNNs, often with fewer parameters than competing methods. They also highlight the significant positive impact of sparsity in subnetwork weights on performance and provide control studies validating the importance of their stability-preserving interconnection mechanism.\n\n# Strengths\n\n*   **Novel Contraction Conditions and Combination Framework:** The paper makes a significant theoretical contribution by providing several novel, simple-to-check contraction conditions for continuous-time RNNs (Theorems 1-5). More importantly, Theorem 6 establishes a principled, modular framework for combining these contracting subnetworks into larger, stable systems. This \"network of networks\" approach is highly relevant for building complex, reliable AI systems.\n*   **Strong Experimental Results with Provable Guarantees:** The experimental section demonstrates that these provably stable architectures not only work but also achieve competitive performance on benchmark sequential tasks. Setting new state-of-the-art for *stable* RNNs on permuted sequential MNIST and sequential CIFAR10, often with fewer parameters, is a compelling result that bridges the gap between theoretical guarantees and practical utility.\n*   **Emphasis on Modularity and Biological Plausibility:** The motivation for \"networks of networks\" and modularity is well-articulated, connecting to biological systems and general principles of computer science. The use of contraction theory naturally aligns with these ideas, offering a path towards more interpretable and evolvable AI.\n*   **Insightful Analysis of Sparsity:** The discovery that sparsity in subnetwork weights significantly boosts performance and correlates with stronger stability (lower max eigenvalue of |W|) is an interesting and valuable empirical finding. It suggests a practical design principle for these stable architectures.\n*   **Clear Discussion of Stability Nuances:** Section 2.3 provides a valuable discussion on common pitfalls and misconceptions regarding stability in RNNs (e.g., differentiating between autonomous and non-autonomous stability, and critiquing Jacobian eigenvalue conditions), enhancing the paper's rigor and educational value.\n\n# Weaknesses\n\n*   **Limited Scope of Individual RNN Contraction Conditions:** While five conditions are provided, their practical applicability for training *individual* subnetworks is not fully explored. For instance, the 'Sparse Combo Net' relies on randomly generated fixed subnetworks satisfying Theorem 1, implying direct training of `W` under Theorem 1 is hard. Only Theorem 5 (SVD Combo Net) gets a direct parameterization for training. This limits the practical utility of the other theoretical conditions for end-to-end learning within modules.\n*   **Complexity of Stability Guarantees in Practice:** While the paper provides provable stability, the conditions (e.g., `P(g|W| - I) + (g|W| - I)^T P < 0` in Theorem 1) involve finding a positive diagonal matrix `P`. The paper mentions \"Solving for M in -I = MA + A^T M will produce a valid metric for any stable LTI system A\", but for the *nonlinear* case, finding such a `P` for the contraction conditions might not always be straightforward or computationally efficient during training, especially for large `W`. The authors fixed `W` for Sparse Combo Net, side-stepping this issue for most of their results.\n*   **Lack of Deeper Analysis on Inter-Module Dynamics:** Theorem 6 provides a powerful way to ensure overall stability, but the choice of the arbitrary matrix `B` (which defines `L`) is only partially explored (e.g., masking diagonal/upper triangular blocks). A deeper understanding or principled guidance on how `B` influences the *dynamics* and information flow *between* modules, beyond just ensuring stability, would enhance the practical application and interpretability of these combination networks.\n*   **Comparison to Non-Stable SOTA:** While the paper sets a new SOTA for *stable* RNNs, the absolute performance on sequential tasks (e.g., 96.94% on permuted seqMNIST vs. 98.54% for CKConv or 98.13% for Trellis) is still behind the overall SOTA achieved by non-stable architectures, especially those with significantly more parameters. While the parameter count comparison is made, the gap still suggests a trade-off between provable stability and peak performance, which could be more explicitly acknowledged and discussed as a future challenge.\n*   **Training Time and Scalability:** The paper mentions that constraining `L` (inter-subnetwork connections) to be antisymmetric \"with respect to the overall network metric\" increases training speed, but details on the computational overhead of maintaining contraction conditions during training, especially for Theorem 5's parameterization and `L`'s constraint, are not thoroughly discussed. Scalability to much larger networks or more complex tasks might pose challenges not fully addressed.\n\n# Questions for authors (if any)\n\n*   **Training under Contraction Conditions:** For Theorems 1-4, which do not seem to have a direct parameterization like Theorem 5, what are the challenges and potential approaches for training the individual `W` matrices *while satisfying* these contraction conditions, rather than fixing them as in the Sparse Combo Net? Could projected gradient descent or other optimization techniques be used?\n*   **Influence of `B` on Dynamics:** Theorem 6 states that `B` is an arbitrary square matrix defining `L`. How does the choice of `B` (beyond the limited masking strategies explored) impact the learned inter-module dynamics, information flow, and ultimately the task performance? Are there theoretical guidelines or empirical observations that could inform better choices for `B` in different scenarios?\n*   **Metric Choice and Generalization:** The contraction conditions depend on the existence of a metric `P` (or `M`). How sensitive are the results to the specific choice or discovery of these metrics, especially given that some are fixed during training for the Sparse Combo Net? Does the choice of metric influence the generalization ability of the combined network?\n*   **Scalability to Deeper/Larger Architectures:** The current experiments use a modest number of subnetworks and units. What are the computational implications (e.g., memory, training time) and performance characteristics of scaling this \"network of networks\" approach to significantly deeper or wider architectures, especially given the matrix operations involved in checking/maintaining contraction?\n\n# Final Judgement\n\n*   **Rationale of recommendation**: This paper presents a theoretically sound and empirically strong approach to building provably stable recurrent neural networks. The derivation of novel contraction conditions for individual RNNs and, more significantly, a systematic method for combining these into stable \"networks of networks\" are substantial contributions. The experimental results, demonstrating SOTA performance among stable RNNs on benchmark tasks with competitive parameter counts, are compelling and show the practical utility of the framework. The paper is well-written, clearly articulates its contributions, and provides valuable discussions on the nuances of RNN stability. However, the practical implications of training *all* parameters under *all* proposed contraction conditions are not fully explored (relying on fixed subnetworks for the best-performing model), and a deeper understanding of how the inter-module connections' specific parameterization impacts complex dynamics beyond simple stability could further strengthen the work. The performance gap to overall (non-stable) SOTA, while acknowledged, suggests a trade-off that warrants more explicit discussion. These limitations slightly temper the excitement, but the fundamental advancements make it a valuable contribution to the field.\n*   **Recommendation**: Borderline",
              "clean_review": "This paper introduces a novel framework for constructing provably stable \"networks of networks\" using contraction theory. The core idea is to leverage established contraction properties of individual continuous-time recurrent neural networks (RNNs) to recursively build larger, stable composite systems. The authors derive five new sufficient conditions for the global contraction of individual RNNs and, crucially, present a theorem (Theorem 6) that describes how to combine multiple independently contracting RNN subnetworks into a larger system while guaranteeing its overall stability. This combination is achieved through a specific parameterization of the inter-module connections. Experimentally, the paper demonstrates the effectiveness of these \"Combo Nets\" on sequential image classification tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10), achieving state-of-the-art performance among provably stable RNNs, often with fewer parameters than competing methods. They also highlight the significant positive impact of sparsity in subnetwork weights on performance and provide control studies validating the importance of their stability-preserving interconnection mechanism.\n\nStrengths\n\nNovel Contraction Conditions and Combination Framework: The paper makes a significant theoretical contribution by providing several novel, simple-to-check contraction conditions for continuous-time RNNs (Theorems 1-5). More importantly, Theorem 6 establishes a principled, modular framework for combining these contracting subnetworks into larger, stable systems. This \"network of networks\" approach is highly relevant for building complex, reliable AI systems.\nStrong Experimental Results with Provable Guarantees: The experimental section demonstrates that these provably stable architectures not only work but also achieve competitive performance on benchmark sequential tasks. Setting new state-of-the-art for stable RNNs on permuted sequential MNIST and sequential CIFAR10, often with fewer parameters, is a compelling result that bridges the gap between theoretical guarantees and practical utility.\nEmphasis on Modularity and Biological Plausibility: The motivation for \"networks of networks\" and modularity is well-articulated, connecting to biological systems and general principles of computer science. The use of contraction theory naturally aligns with these ideas, offering a path towards more interpretable and evolvable AI.\nInsightful Analysis of Sparsity: The discovery that sparsity in subnetwork weights significantly boosts performance and correlates with stronger stability (lower max eigenvalue of |W|) is an interesting and valuable empirical finding. It suggests a practical design principle for these stable architectures.\nClear Discussion of Stability Nuances: Section 2.3 provides a valuable discussion on common pitfalls and misconceptions regarding stability in RNNs (e.g., differentiating between autonomous and non-autonomous stability, and critiquing Jacobian eigenvalue conditions), enhancing the paper's rigor and educational value.\n\nWeaknesses\n\nLimited Scope of Individual RNN Contraction Conditions: While five conditions are provided, their practical applicability for training individual subnetworks is not fully explored. For instance, the 'Sparse Combo Net' relies on randomly generated fixed subnetworks satisfying Theorem 1, implying direct training of W under Theorem 1 is hard. Only Theorem 5 (SVD Combo Net) gets a direct parameterization for training. This limits the practical utility of the other theoretical conditions for end-to-end learning within modules.\nComplexity of Stability Guarantees in Practice: While the paper provides provable stability, the conditions (e.g., P(g|W| - I) + (g|W| - I)^T P < 0 in Theorem 1) involve finding a positive diagonal matrix P. The paper mentions \"Solving for M in -I = MA + A^T M will produce a valid metric for any stable LTI system A\", but for the nonlinear case, finding such a P for the contraction conditions might not always be straightforward or computationally efficient during training, especially for large W. The authors fixed W for Sparse Combo Net, side-stepping this issue for most of their results.\nLack of Deeper Analysis on Inter-Module Dynamics: Theorem 6 provides a powerful way to ensure overall stability, but the choice of the arbitrary matrix B (which defines L) is only partially explored (e.g., masking diagonal/upper triangular blocks). A deeper understanding or principled guidance on how B influences the dynamics and information flow between modules, beyond just ensuring stability, would enhance the practical application and interpretability of these combination networks.\nComparison to Non-Stable SOTA: While the paper sets a new SOTA for stable RNNs, the absolute performance on sequential tasks (e.g., 96.94% on permuted seqMNIST vs. 98.54% for CKConv or 98.13% for Trellis) is still behind the overall SOTA achieved by non-stable architectures, especially those with significantly more parameters. While the parameter count comparison is made, the gap still suggests a trade-off between provable stability and peak performance, which could be more explicitly acknowledged and discussed as a future challenge.\nTraining Time and Scalability: The paper mentions that constraining L (inter-subnetwork connections) to be antisymmetric \"with respect to the overall network metric\" increases training speed, but details on the computational overhead of maintaining contraction conditions during training, especially for Theorem 5's parameterization and L's constraint, are not thoroughly discussed. Scalability to much larger networks or more complex tasks might pose challenges not fully addressed.\n\nQuestions for authors (if any)\n\nTraining under Contraction Conditions: For Theorems 1-4, which do not seem to have a direct parameterization like Theorem 5, what are the challenges and potential approaches for training the individual W matrices while satisfying these contraction conditions, rather than fixing them as in the Sparse Combo Net? Could projected gradient descent or other optimization techniques be used?\nInfluence of B on Dynamics: Theorem 6 states that B is an arbitrary square matrix defining L. How does the choice of B (beyond the limited masking strategies explored) impact the learned inter-module dynamics, information flow, and ultimately the task performance? Are there theoretical guidelines or empirical observations that could inform better choices for B in different scenarios?\nMetric Choice and Generalization: The contraction conditions depend on the existence of a metric P (or M). How sensitive are the results to the specific choice or discovery of these metrics, especially given that some are fixed during training for the Sparse Combo Net? Does the choice of metric influence the generalization ability of the combined network?\nScalability to Deeper/Larger Architectures: The current experiments use a modest number of subnetworks and units. What are the computational implications (e.g., memory, training time) and performance characteristics of scaling this \"network of networks\" approach to significantly deeper or wider architectures, especially given the matrix operations involved in checking/maintaining contraction?\n\nFinal Judgement\n\nRationale of recommendation: This paper presents a theoretically sound and empirically strong approach to building provably stable recurrent neural networks. The derivation of novel contraction conditions for individual RNNs and, more significantly, a systematic method for combining these into stable \"networks of networks\" are substantial contributions. The experimental results, demonstrating SOTA performance among stable RNNs on benchmark tasks with competitive parameter counts, are compelling and show the practical utility of the framework. The paper is well-written, clearly articulates its contributions, and provides valuable discussions on the nuances of RNN stability. However, the practical implications of training all parameters under all proposed contraction conditions are not fully explored (relying on fixed subnetworks for the best-performing model), and a deeper understanding of how the inter-module connections' specific parameterization impacts complex dynamics beyond simple stability could further strengthen the work. The performance gap to overall (non-stable) SOTA, while acknowledged, suggests a trade-off that warrants more explicit discussion. These limitations slightly temper the excitement, but the fundamental advancements make it a valuable contribution to the field.\nRecommendation: Borderline",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper introduces a novel framework for constructing provably stable \"networks of networks\" using contraction theory.",
                      "The core idea is to leverage established contraction properties of individual continuous-time recurrent neural networks (RNNs) to recursively build larger, stable composite systems.",
                      "The authors derive five new sufficient conditions for the global contraction of individual RNNs.",
                      "Theorem 6 describes how to combine multiple independently contracting RNN subnetworks into a larger system while guaranteeing its overall stability, achieved through a specific parameterization of the inter-module connections.",
                      "The paper demonstrates the effectiveness of these \"Combo Nets\" on sequential image classification tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10).",
                      "The paper achieves state-of-the-art performance among provably stable RNNs, often with fewer parameters than competing methods.",
                      "The paper highlights the significant positive impact of sparsity in subnetwork weights on performance.",
                      "Control studies validate the importance of their stability-preserving interconnection mechanism.",
                      "The paper provides several novel, simple-to-check contraction conditions for continuous-time RNNs (Theorems 1-5).",
                      "Theorem 6 establishes a principled, modular framework for combining contracting subnetworks into larger, stable systems.",
                      "The experimental section demonstrates that these provably stable architectures achieve competitive performance on benchmark sequential tasks.",
                      "The paper sets new state-of-the-art for stable RNNs on permuted sequential MNIST and sequential CIFAR10, often with fewer parameters.",
                      "Section 2.3 provides a discussion on common pitfalls and misconceptions regarding stability in RNNs, differentiating between autonomous and non-autonomous stability and critiquing Jacobian eigenvalue conditions.",
                      "The 'Sparse Combo Net' relies on randomly generated fixed subnetworks satisfying Theorem 1.",
                      "Only Theorem 5 (SVD Combo Net) gets a direct parameterization for training.",
                      "Theorem 6 provides a way to ensure overall stability, but the choice of the arbitrary matrix B (which defines L) is only partially explored (e.g., masking diagonal/upper triangular blocks).",
                      "The paper mentions that constraining L (inter-subnetwork connections) to be antisymmetric \"with respect to the overall network metric\" increases training speed."
                  ],
                  "evaluation": [
                      "The paper makes a significant theoretical contribution by providing several novel, simple-to-check contraction conditions for continuous-time RNNs.",
                      "Theorem 6 establishes a principled, modular framework for combining these contracting subnetworks into larger, stable systems, which is highly relevant for building complex, reliable AI systems.",
                      "The experimental results are strong, demonstrating that these provably stable architectures not only work but also achieve competitive performance, bridging the gap between theoretical guarantees and practical utility.",
                      "Setting new state-of-the-art for stable RNNs on permuted sequential MNIST and sequential CIFAR10, often with fewer parameters, is a compelling result.",
                      "The motivation for \"networks of networks\" and modularity is well-articulated, connecting to biological systems and general principles of computer science.",
                      "The use of contraction theory naturally aligns with ideas of interpretability and evolvable AI.",
                      "The discovery that sparsity in subnetwork weights significantly boosts performance and correlates with stronger stability is an interesting and valuable empirical finding, suggesting a practical design principle.",
                      "Section 2.3 provides a valuable discussion on common pitfalls and misconceptions regarding stability in RNNs, enhancing the paper's rigor and educational value.",
                      "The practical applicability of individual RNN contraction conditions (Theorems 1-4) for training individual subnetworks is not fully explored, limiting their practical utility for end-to-end learning within modules.",
                      "Finding a positive diagonal matrix P for the contraction conditions (e.g., in Theorem 1) might not always be straightforward or computationally efficient during training, especially for large W.",
                      "The authors fixed W for Sparse Combo Net, side-stepping the issue of finding P for most of their results.",
                      "A deeper understanding or principled guidance on how B influences the dynamics and information flow between modules, beyond just ensuring stability, would enhance the practical application and interpretability of these combination networks.",
                      "The absolute performance on sequential tasks is still behind the overall SOTA achieved by non-stable architectures, suggesting a trade-off between provable stability and peak performance.",
                      "Details on the computational overhead of maintaining contraction conditions during training are not thoroughly discussed.",
                      "Scalability to much larger networks or more complex tasks might pose challenges not fully addressed.",
                      "The paper is well-written, clearly articulates its contributions, and provides valuable discussions on the nuances of RNN stability.",
                      "The limitations regarding practical training of all parameters under all proposed contraction conditions and the lack of deeper analysis on inter-module dynamics slightly temper the excitement."
                  ],
                  "constructiveInput": [
                      "The paper could explore the practical applicability of contraction conditions for training individual subnetworks more fully.",
                      "A deeper understanding or principled guidance on how B influences the dynamics and information flow between modules, beyond just ensuring stability, would enhance the practical application and interpretability of these combination networks.",
                      "The trade-off between provable stability and peak performance compared to non-stable SOTA could be more explicitly acknowledged and discussed as a future challenge.",
                      "More thorough discussion on the computational overhead of maintaining contraction conditions during training and scalability to larger networks or more complex tasks would be beneficial."
                  ],
                  "clarificationDialogue": [
                      "For Theorems 1-4, which do not seem to have a direct parameterization like Theorem 5, what are the challenges and potential approaches for training the individual W matrices while satisfying these contraction conditions, rather than fixing them as in the Sparse Combo Net?",
                      "Could projected gradient descent or other optimization techniques be used for training individual W matrices under contraction conditions?",
                      "How does the choice of B (beyond the limited masking strategies explored) impact the learned inter-module dynamics, information flow, and ultimately the task performance?",
                      "Are there theoretical guidelines or empirical observations that could inform better choices for B in different scenarios?",
                      "How sensitive are the results to the specific choice or discovery of these metrics (P or M), especially given that some are fixed during training for the Sparse Combo Net?",
                      "Does the choice of metric influence the generalization ability of the combined network?",
                      "What are the computational implications (e.g., memory, training time) and performance characteristics of scaling this \"network of networks\" approach to significantly deeper or wider architectures, especially given the matrix operations involved in checking/maintaining contraction?"
                  ],
                  "metaCommentary": [
                      "This paper presents a theoretically sound and empirically strong approach to building provably stable recurrent neural networks.",
                      "The derivation of novel contraction conditions for individual RNNs and a systematic method for combining these into stable \"networks of networks\" are substantial contributions.",
                      "The experimental results, demonstrating SOTA performance among stable RNNs on benchmark tasks with competitive parameter counts, are compelling and show the practical utility of the framework.",
                      "The fundamental advancements make it a valuable contribution to the field.",
                      "The recommendation is Borderline."
                  ]
              }
          },
          {
              "review_id": "ai_gemini-2.5-flash_Accept_3e1b9993",
              "author": "gemini-2.5-flash",
              "class": "ai",
              "prompt": "Accept",
              "raw_review": "# Summary of paper\n\nThis paper introduces a novel approach to constructing provably stable recurrent neural networks (RNNs) by leveraging contraction theory. The core idea is to establish strong contraction conditions for individual continuous-time RNN subnetworks and then to demonstrate a modular principle that allows for recursively combining these subnetworks into larger \"networks of networks\" while preserving global stability. The authors present five novel contraction conditions for individual RNNs and, critically, derive Theorem 6, which specifies how to parameterize the inter-module connections ($L$) to guarantee overall system contraction, irrespective of the specific subnetwork contraction condition met. The proposed \"Sparse Combo Net\" and \"SVD Combo Net\" architectures, built upon these principles, achieve state-of-the-art performance among stability-guaranteed RNNs on benchmark sequential tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10) with competitive parameter counts. The paper also includes detailed experimental analysis on the impact of sparsity and network size, and control studies to validate the importance of the stability-preserving inter-module connections.\n\n# Strengths\n - **Rigorous Mathematical Foundation for Modular Stability**: The paper makes significant theoretical contributions by deriving novel, practical contraction conditions for continuous-time RNNs (Theorems 1-5) and, most importantly, providing a general theorem (Theorem 6) for composing these stable subnetworks into larger, provably stable \"networks of networks.\" This modularity principle is a critical step towards building complex, yet verifiable, deep learning systems.\n - **State-of-the-Art Performance for Stable RNNs**: The proposed architectures, especially the Sparse Combo Net, achieve new state-of-the-art test accuracies on challenging sequential tasks like permuted sequential MNIST and sequential CIFAR10 *among stability-guaranteed RNNs*. This demonstrates that enforcing strong stability constraints does not necessarily compromise performance, and can even lead to highly competitive results compared to non-stable SOTA models with significantly fewer parameters.\n - **Practical Demonstration of Sparsity Benefits**: The experimental results clearly show that sparsity in the hidden-to-hidden weights of subnetworks leads to substantial performance improvements. This finding is highly relevant for efficient model design and aligns with biological principles, suggesting that the combination of stability and sparsity is a powerful avenue for future research.\n - **Comprehensive Experimental Validation and Controls**: The paper provides a thorough experimental section, including detailed analyses of sparsity levels, network size, and strong control studies that directly validate the importance of the contraction-preserving connections between subnetworks (Section 3.2.3). The repeatability studies further strengthen the empirical claims.\n - **Clear Distinction from Prior Work**: The authors effectively differentiate their contribution from previous work on RNN stability, particularly by focusing on continuous-time models and the critical aspect of *combining* stable modules rather than just stabilizing individual RNNs. The discussion in Section 2.3 also clarifies common misconceptions about stability conditions.\n\n# Weaknesses\n - **Limited Exploration of Continuous-Time RNNs**: While the paper focuses on continuous-time RNNs, the experimental section only uses ReLU activation functions. Given that continuous-time models and their stability properties are often discussed in the context of smooth activations (e.g., tanh, sigmoid), exploring the empirical performance and implications of different activation functions (especially smoother ones that naturally fit the continuous-time framework) would strengthen the generalizability of the findings.\n - **Parameterization of Subnetwork Contraction Conditions**: For the Sparse Combo Net, the individual subnetworks are initialized to satisfy Theorem 1, but their weights are *not trained*. This limits the adaptability of the individual modules. For the SVD Combo Net, direct parameterization using Theorem 5 allows training, but Theorem 1's non-trainability is a significant constraint. Further work on direct, trainable parameterizations for *all* presented contraction conditions would enhance practical utility.\n - **Complexity of Contraction Conditions for Training**: Some of the derived contraction conditions (e.g., Theorem 1 and 5) involve finding a positive diagonal matrix P or Phi, which might add complexity or computational overhead during training if these metrics are also learned or need to be continuously verified. A discussion on the practical implications or limitations of finding/maintaining these metrics during large-scale training could be beneficial.\n - **Modularity Assumptions in B Matrix**: The choice to only train off-diagonal and lower-triangular blocks of B (to reflect modularity assumptions and speed up training) is explained but could be further justified. While pragmatic, it restricts the general form of inter-module connections implied by Theorem 6. Exploring the impact of more generalized B matrices, even if computationally more intensive, could offer insights into the full potential of the modularity principle.\n\n# Questions for authors (if any)\n - **Activation Functions**: You've primarily used ReLU activation in experiments. How do your stability guarantees and empirical performance hold with other common continuous-time RNN activations like tanh or sigmoid, which might be more naturally aligned with continuous dynamics? Did you explore these, and if so, what were the observations?\n - **Scalability and Parameterization of `P` or `Φ`**: For Theorems 1 and 5, a positive diagonal matrix (P or $\\Phi$) is crucial. How do you handle the selection or learning of this matrix in practice for larger networks, especially given the conditions $g|\\mathbf{W}| - \\mathbf{I})^T \\mathbf{P} \\prec 0$ (Theorem 1) or $g^2 \\mathbf{W}^T \\mathbf{P} \\mathbf{W} - \\mathbf{P} \\prec 0$ (Theorem 5)? Does this introduce computational challenges or restrict the types of learnable W matrices?\n - **Extending Trainable Parameterizations**: The Sparse Combo Net fixes subnetwork weights after initialization, which is a strong constraint. Are there promising directions or ongoing work to develop trainable parameterizations for Theorem 1 (Absolute Value Restricted Weights) that allow for learning within the subnetworks while maintaining stability?\n - **Generalizing B Matrix Training**: You chose to restrict the training of the B matrix to off-diagonal and lower-triangular blocks. What would be the implications (both theoretically and empirically, in terms of performance and stability) of training a fully general B matrix as allowed by Theorem 6? Have you considered or attempted this, and what were the challenges?\n\n# Final Judgement\n - **Rationale of recommendation**: This paper makes significant theoretical and empirical contributions towards developing provably stable recurrent neural networks, particularly focusing on their modular composition. The derivation of novel contraction conditions for individual continuous-time RNNs and the powerful \"network of networks\" theorem (Theorem 6) provides a robust mathematical framework for building complex, verifiable deep learning architectures. The experimental results convincingly demonstrate that these stable models can achieve state-of-the-art performance among stability-guaranteed RNNs, even outperforming some non-stable models with significantly fewer parameters. The findings on the benefits of sparsity are particularly compelling and open promising avenues for future research. While there are some areas for further exploration (e.g., broader activation function analysis, more flexible parameterization of subnetwork weights during training), these do not detract from the core achievements. The work is well-motivated, rigorously executed, and presents a clear path forward for enhancing the trustworthiness and predictability of deep learning systems.\n - **Recommendation**: Accept",
              "clean_review": "This paper introduces a novel approach to constructing provably stable recurrent neural networks (RNNs) by leveraging contraction theory. The core idea is to establish strong contraction conditions for individual continuous-time RNN subnetworks and then to demonstrate a modular principle that allows for recursively combining these subnetworks into larger \"networks of networks\" while preserving global stability. The authors present five novel contraction conditions for individual RNNs and, critically, derive Theorem 6, which specifies how to parameterize the inter-module connections (L) to guarantee overall system contraction, irrespective of the specific subnetwork contraction condition met. The proposed \"Sparse Combo Net\" and \"SVD Combo Net\" architectures, built upon these principles, achieve state-of-the-art performance among stability-guaranteed RNNs on benchmark sequential tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10) with competitive parameter counts. The paper also includes detailed experimental analysis on the impact of sparsity and network size, and control studies to validate the importance of the stability-preserving inter-module connections.\n\nStrengths\nRigorous Mathematical Foundation for Modular Stability: The paper makes significant theoretical contributions by deriving novel, practical contraction conditions for continuous-time RNNs (Theorems 1-5) and, most importantly, providing a general theorem (Theorem 6) for composing these stable subnetworks into larger, provably stable \"networks of networks.\" This modularity principle is a critical step towards building complex, yet verifiable, deep learning systems.\nState-of-the-Art Performance for Stable RNNs: The proposed architectures, especially the Sparse Combo Net, achieve new state-of-the-art test accuracies on challenging sequential tasks like permuted sequential MNIST and sequential CIFAR10 among stability-guaranteed RNNs. This demonstrates that enforcing strong stability constraints does not necessarily compromise performance, and can even lead to highly competitive results compared to non-stable SOTA models with significantly fewer parameters.\nPractical Demonstration of Sparsity Benefits: The experimental results clearly show that sparsity in the hidden-to-hidden weights of subnetworks leads to substantial performance improvements. This finding is highly relevant for efficient model design and aligns with biological principles, suggesting that the combination of stability and sparsity is a powerful avenue for future research.\nComprehensive Experimental Validation and Controls: The paper provides a thorough experimental section, including detailed analyses of sparsity levels, network size, and strong control studies that directly validate the importance of the contraction-preserving connections between subnetworks (Section 3.2.3). The repeatability studies further strengthen the empirical claims.\nClear Distinction from Prior Work: The authors effectively differentiate their contribution from previous work on RNN stability, particularly by focusing on continuous-time models and the critical aspect of combining stable modules rather than just stabilizing individual RNNs. The discussion in Section 2.3 also clarifies common misconceptions about stability conditions.\n\nWeaknesses\nLimited Exploration of Continuous-Time RNNs: While the paper focuses on continuous-time RNNs, the experimental section only uses ReLU activation functions. Given that continuous-time models and their stability properties are often discussed in the context of smooth activations (e.g., tanh, sigmoid), exploring the empirical performance and implications of different activation functions (especially smoother ones that naturally fit the continuous-time framework) would strengthen the generalizability of the findings.\nParameterization of Subnetwork Contraction Conditions: For the Sparse Combo Net, the individual subnetworks are initialized to satisfy Theorem 1, but their weights are not trained. This limits the adaptability of the individual modules. For the SVD Combo Net, direct parameterization using Theorem 5 allows training, but Theorem 1's non-trainability is a significant constraint. Further work on direct, trainable parameterizations for all presented contraction conditions would enhance practical utility.\nComplexity of Contraction Conditions for Training: Some of the derived contraction conditions (e.g., Theorem 1 and 5) involve finding a positive diagonal matrix P or Phi, which might add complexity or computational overhead during training if these metrics are also learned or need to be continuously verified. A discussion on the practical implications or limitations of finding/maintaining these metrics during large-scale training could be beneficial.\nModularity Assumptions in B Matrix: The choice to only train off-diagonal and lower-triangular blocks of B (to reflect modularity assumptions and speed up training) is explained but could be further justified. While pragmatic, it restricts the general form of inter-module connections implied by Theorem 6. Exploring the impact of more generalized B matrices, even if computationally more intensive, could offer insights into the full potential of the modularity principle.\n\nQuestions for authors (if any)\nActivation Functions: You've primarily used ReLU activation in experiments. How do your stability guarantees and empirical performance hold with other common continuous-time RNN activations like tanh or sigmoid, which might be more naturally aligned with continuous dynamics? Did you explore these, and if so, what were the observations?\nScalability and Parameterization of P or Φ: For Theorems 1 and 5, a positive diagonal matrix (P or Φ) is crucial. How do you handle the selection or learning of this matrix in practice for larger networks, especially given the conditions g|W| - I)^T P ≺ 0 (Theorem 1) or g^2 W^T P W - P ≺ 0 (Theorem 5)? Does this introduce computational challenges or restrict the types of learnable W matrices?\nExtending Trainable Parameterizations: The Sparse Combo Net fixes subnetwork weights after initialization, which is a strong constraint. Are there promising directions or ongoing work to develop trainable parameterizations for Theorem 1 (Absolute Value Restricted Weights) that allow for learning within the subnetworks while maintaining stability?\nGeneralizing B Matrix Training: You chose to restrict the training of the B matrix to off-diagonal and lower-triangular blocks. What would be the implications (both theoretically and empirically, in terms of performance and stability) of training a fully general B matrix as allowed by Theorem 6? Have you considered or attempted this, and what were the challenges?\n\nFinal Judgement\nRationale of recommendation: This paper makes significant theoretical and empirical contributions towards developing provably stable recurrent neural networks, particularly focusing on their modular composition. The derivation of novel contraction conditions for individual continuous-time RNNs and the powerful \"network of networks\" theorem (Theorem 6) provides a robust mathematical framework for building complex, verifiable deep learning architectures. The experimental results convincingly demonstrate that these stable models can achieve state-of-the-art performance among stability-guaranteed RNNs, even outperforming some non-stable models with significantly fewer parameters. The findings on the benefits of sparsity are particularly compelling and open promising avenues for future research. While there are some areas for further exploration (e.g., broader activation function analysis, more flexible parameterization of subnetwork weights during training), these do not detract from the core achievements. The work is well-motivated, rigorously executed, and presents a clear path forward for enhancing the trustworthiness and predictability of deep learning systems.\nRecommendation: Accept",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper introduces a novel approach to constructing provably stable recurrent neural networks (RNNs) by leveraging contraction theory.",
                      "The core idea is to establish strong contraction conditions for individual continuous-time RNN subnetworks and then to demonstrate a modular principle that allows for recursively combining these subnetworks into larger \"networks of networks\" while preserving global stability.",
                      "The authors present five novel contraction conditions for individual RNNs and, critically, derive Theorem 6, which specifies how to parameterize the inter-module connections (L) to guarantee overall system contraction, irrespective of the specific subnetwork contraction condition met.",
                      "The proposed \"Sparse Combo Net\" and \"SVD Combo Net\" architectures, built upon these principles, achieve state-of-the-art performance among stability-guaranteed RNNs on benchmark sequential tasks (sequential MNIST, permuted sequential MNIST, and sequential CIFAR10) with competitive parameter counts.",
                      "The paper also includes detailed experimental analysis on the impact of sparsity and network size, and control studies to validate the importance of the stability-preserving inter-module connections.",
                      "The paper makes significant theoretical contributions by deriving novel, practical contraction conditions for continuous-time RNNs (Theorems 1-5) and, most importantly, providing a general theorem (Theorem 6) for composing these stable subnetworks into larger, provably stable \"networks of networks.\"",
                      "The proposed architectures, especially the Sparse Combo Net, achieve new state-of-the-art test accuracies on challenging sequential tasks like permuted sequential MNIST and sequential CIFAR10 among stability-guaranteed RNNs.",
                      "The experimental results clearly show that sparsity in the hidden-to-hidden weights of subnetworks leads to substantial performance improvements.",
                      "The paper provides a thorough experimental section, including detailed analyses of sparsity levels, network size, and strong control studies that directly validate the importance of the contraction-preserving connections between subnetworks (Section 3.2.3).",
                      "The repeatability studies further strengthen the empirical claims.",
                      "The authors effectively differentiate their contribution from previous work on RNN stability, particularly by focusing on continuous-time models and the critical aspect of combining stable modules rather than just stabilizing individual RNNs.",
                      "The discussion in Section 2.3 also clarifies common misconceptions about stability conditions.",
                      "For the Sparse Combo Net, the individual subnetworks are initialized to satisfy Theorem 1, but their weights are not trained.",
                      "For the SVD Combo Net, direct parameterization using Theorem 5 allows training."
                  ],
                  "evaluation": [
                      "This modularity principle is a critical step towards building complex, yet verifiable, deep learning systems.",
                      "This demonstrates that enforcing strong stability constraints does not necessarily compromise performance, and can even lead to highly competitive results compared to non-stable SOTA models with significantly fewer parameters.",
                      "This finding is highly relevant for efficient model design and aligns with biological principles, suggesting that the combination of stability and sparsity is a powerful avenue for future research.",
                      "The paper provides a thorough experimental section, including detailed analyses of sparsity levels, network size, and strong control studies that directly validate the importance of the contraction-preserving connections between subnetworks (Section 3.2.3).",
                      "The authors effectively differentiate their contribution from previous work on RNN stability, particularly by focusing on continuous-time models and the critical aspect of combining stable modules rather than just stabilizing individual RNNs.",
                      "The discussion in Section 2.3 also clarifies common misconceptions about stability conditions.",
                      "While the paper focuses on continuous-time RNNs, the experimental section only uses ReLU activation functions.",
                      "Given that continuous-time models and their stability properties are often discussed in the context of smooth activations (e.g., tanh, sigmoid), exploring the empirical performance and implications of different activation functions (especially smoother ones that naturally fit the continuous-time framework) would strengthen the generalizability of the findings.",
                      "The non-trainability of Theorem 1 is a significant constraint.",
                      "Further work on direct, trainable parameterizations for all presented contraction conditions would enhance practical utility.",
                      "Some of the derived contraction conditions (e.g., Theorem 1 and 5) involve finding a positive diagonal matrix P or Phi, which might add complexity or computational overhead during training if these metrics are also learned or need to be continuously verified.",
                      "The choice to only train off-diagonal and lower-triangular blocks of B (to reflect modularity assumptions and speed up training) is explained but could be further justified.",
                      "While pragmatic, it restricts the general form of inter-module connections implied by Theorem 6.",
                      "This paper makes significant theoretical and empirical contributions towards developing provably stable recurrent neural networks, particularly focusing on their modular composition.",
                      "The derivation of novel contraction conditions for individual continuous-time RNNs and the powerful \"network of networks\" theorem (Theorem 6) provides a robust mathematical framework for building complex, verifiable deep learning architectures.",
                      "The experimental results convincingly demonstrate that these stable models can achieve state-of-the-art performance among stability-guaranteed RNNs, even outperforming some non-stable models with significantly fewer parameters.",
                      "The findings on the benefits of sparsity are particularly compelling and open promising avenues for future research.",
                      "While there are some areas for further exploration (e.g., broader activation function analysis, more flexible parameterization of subnetwork weights during training), these do not detract from the core achievements.",
                      "The work is well-motivated, rigorously executed, and presents a clear path forward for enhancing the trustworthiness and predictability of deep learning systems."
                  ],
                  "constructiveInput": [
                      "Exploring the empirical performance and implications of different activation functions (especially smoother ones that naturally fit the continuous-time framework) would strengthen the generalizability of the findings.",
                      "Further work on direct, trainable parameterizations for all presented contraction conditions would enhance practical utility.",
                      "A discussion on the practical implications or limitations of finding/maintaining these metrics during large-scale training could be beneficial.",
                      "Exploring the impact of more generalized B matrices, even if computationally more intensive, could offer insights into the full potential of the modularity principle."
                  ],
                  "clarificationDialogue": [
                      "How do your stability guarantees and empirical performance hold with other common continuous-time RNN activations like tanh or sigmoid, which might be more naturally aligned with continuous dynamics?",
                      "Did you explore these, and if so, what were the observations?",
                      "How do you handle the selection or learning of this matrix in practice for larger networks, especially given the conditions g|W| - I)^T P ≺ 0 (Theorem 1) or g^2 W^T P W - P ≺ 0 (Theorem 5)?",
                      "Does this introduce computational challenges or restrict the types of learnable W matrices?",
                      "Are there promising directions or ongoing work to develop trainable parameterizations for Theorem 1 (Absolute Value Restricted Weights) that allow for learning within the subnetworks while maintaining stability?",
                      "What would be the implications (both theoretically and empirically, in terms of performance and stability) of training a fully general B matrix as allowed by Theorem 6?",
                      "Have you considered or attempted this, and what were the challenges?"
                  ],
                  "metaCommentary": [
                      "Recommendation: Accept"
                  ]
              }
          },
          {
              "review_id": "rewrite_0_gemini-2.5-flash_61d09116",
              "og_review_id": 0,
              "author": "gemini-2.5-flash",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This submission introduces novel theorems demonstrating the stability of a specific class of Recurrent Neural Networks (RNNs). By integrating these RNNs into hierarchical and feedback architectures, the authors report state-of-the-art performance across several tasks.\n\n**Theoretical Results**\n\nThe theoretical contributions, particularly Theorems 1-5, represent a significant advancement in understanding the conditions governing RNN stability. The authors also present a compelling counterexample to the prevailing assumption that linear contraction is a sufficient condition for non-linear stability. These theoretical findings are subsequently (and partially) leveraged to design demonstrably stable RNN architectures.\n\n**Experimental Claims**\n\nWhile the experimental section presents interesting findings, it appears to lack a clear overarching conclusion. Furthermore, certain reporting practices observed in the tables raise concerns.\n\nFor instance, regarding the results presented in Table 1, the authors state that the Perm-MNIST trial was executed four times, yielding results between 96.65% and 96.94%. It is noted with concern that only the highest value, 96.94%, is reported in Table 1. Reporting only the best-case scenario from multiple experimental runs is not considered best practice and may lead to misinterpretations of the typical performance. It is recommended that the authors report the mean and variance of their experimental results. Relying solely on the best performance from multiple runs could significantly overestimate the average achievable performance.\n\nSome claims are substantiated by only a single data point. For example, the assertion that increased modularity benefits performance up to a certain point is supported solely by the observation that a 44x8 configuration outperforms a 22x16 configuration in Section 3.2.1. To draw a more robust conclusion and demonstrate a clear trend, the authors might consider exploring additional configurations, such as 50x7 and 39x9.\n\nIn general, the inclusion of more data points and error bars would enhance the credibility of the conclusions, helping to persuade the reader that the observed results are statistically significant and not merely anomalous.\n\nA discrepancy is observed between the results for 'performance vs. network size' and 'performance vs. modularity' in Sections 3.1 and 3.2. In Section 3.1, increased network size monotonically improves performance, whereas in Section 3.2, it exhibits an inverted U-shape trend. Similarly, in Section 3.2, modularity monotonically enhances performance, while in Section 3.1, it follows an inverted U-shape. The overarching conclusion to be drawn from these seemingly contradictory trends is not explicitly stated.\n\nOverall, the experimental section is somewhat challenging to follow and would benefit from a summary of its main conclusions, along with clearly demarcated paragraphs and sections that individually illustrate each point.\n\n**Clarity and Other Minor Points**\n\nThe paper's clarity varies, being exceptionally clear in certain discussions, such as those on stability and contraction, but becoming confusing in other areas. For instance, in the paragraph at the top of page 3, the symbol 'g' is used to denote two distinct concepts within the same paragraph.\n\nTheorem 7 should be rephrased slightly to explicitly clarify that the first inequality represents a condition rather than a statement. While this may be evident upon closer inspection, it can be confusing for a new reader.\n\nFor denoting multiplication, it is suggested to use the symbol '$\\times$' instead of 'x' (e.g., 22$\\times$16).\n\nThe rationale for including a subsection 3.2.1 rather than simply section 3.2 is unclear.\n\nAn overall review of the paper for readability is suggested to maximize comprehension.\n\nThe paper undeniably represents an evolutionary step in the understanding and design of stable RNNs, with novel and noteworthy theoretical results. However, the experimental results lack clear conclusions and, at times, do not adhere to best practices, particularly in the reporting of experimental runs.",
              "clean_review": "This submission introduces novel theorems demonstrating the stability of a specific class of Recurrent Neural Networks (RNNs). By integrating these RNNs into hierarchical and feedback architectures, the authors report state-of-the-art performance across several tasks.\n\nTheoretical Results\n\nThe theoretical contributions, particularly Theorems 1-5, represent a significant advancement in understanding the conditions governing RNN stability. The authors also present a compelling counterexample to the prevailing assumption that linear contraction is a sufficient condition for non-linear stability. These theoretical findings are subsequently (and partially) leveraged to design demonstrably stable RNN architectures.\n\nExperimental Claims\n\nWhile the experimental section presents interesting findings, it appears to lack a clear overarching conclusion. Furthermore, certain reporting practices observed in the tables raise concerns.\n\nFor instance, regarding the results presented in Table 1, the authors state that the Perm-MNIST trial was executed four times, yielding results between 96.65% and 96.94%. It is noted with concern that only the highest value, 96.94%, is reported in Table 1. Reporting only the best-case scenario from multiple experimental runs is not considered best practice and may lead to misinterpretations of the typical performance. It is recommended that the authors report the mean and variance of their experimental results. Relying solely on the best performance from multiple runs could significantly overestimate the average achievable performance.\n\nSome claims are substantiated by only a single data point. For example, the assertion that increased modularity benefits performance up to a certain point is supported solely by the observation that a 44x8 configuration outperforms a 22x16 configuration in Section 3.2.1. To draw a more robust conclusion and demonstrate a clear trend, the authors might consider exploring additional configurations, such as 50x7 and 39x9.\n\nIn general, the inclusion of more data points and error bars would enhance the credibility of the conclusions, helping to persuade the reader that the observed results are statistically significant and not merely anomalous.\n\nA discrepancy is observed between the results for 'performance vs. network size' and 'performance vs. modularity' in Sections 3.1 and 3.2. In Section 3.1, increased network size monotonically improves performance, whereas in Section 3.2, it exhibits an inverted U-shape trend. Similarly, in Section 3.2, modularity monotonically enhances performance, while in Section 3.1, it follows an inverted U-shape. The overarching conclusion to be drawn from these seemingly contradictory trends is not explicitly stated.\n\nOverall, the experimental section is somewhat challenging to follow and would benefit from a summary of its main conclusions, along with clearly demarcated paragraphs and sections that individually illustrate each point.\n\nClarity and Other Minor Points\n\nThe paper's clarity varies, being exceptionally clear in certain discussions, such as those on stability and contraction, but becoming confusing in other areas. For instance, in the paragraph at the top of page 3, the symbol 'g' is used to denote two distinct concepts within the same paragraph.\n\nTheorem 7 should be rephrased slightly to explicitly clarify that the first inequality represents a condition rather than a statement. While this may be evident upon closer inspection, it can be confusing for a new reader.\n\nFor denoting multiplication, it is suggested to use the symbol '' instead of 'x' (e.g., 2216).\n\nThe rationale for including a subsection 3.2.1 rather than simply section 3.2 is unclear.\n\nAn overall review of the paper for readability is suggested to maximize comprehension.\n\nThe paper undeniably represents an evolutionary step in the understanding and design of stable RNNs, with novel and noteworthy theoretical results. However, the experimental results lack clear conclusions and, at times, do not adhere to best practices, particularly in the reporting of experimental runs.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The submission introduces novel theorems demonstrating the stability of a specific class of Recurrent Neural Networks (RNNs).",
                      "The authors report state-of-the-art performance across several tasks by integrating these RNNs into hierarchical and feedback architectures.",
                      "The theoretical contributions, particularly Theorems 1-5, represent a significant advancement in understanding the conditions governing RNN stability.",
                      "The authors present a compelling counterexample to the prevailing assumption that linear contraction is a sufficient condition for non-linear stability.",
                      "These theoretical findings are subsequently (and partially) leveraged to design demonstrably stable RNN architectures.",
                      "Regarding the results presented in Table 1, the authors state that the Perm-MNIST trial was executed four times, yielding results between 96.65% and 96.94%.",
                      "Only the highest value, 96.94%, is reported in Table 1.",
                      "The assertion that increased modularity benefits performance up to a certain point is supported solely by the observation that a 44x8 configuration outperforms a 22x16 configuration in Section 3.2.1.",
                      "A discrepancy is observed between the results for 'performance vs. network size' and 'performance vs. modularity' in Sections 3.1 and 3.2.",
                      "In Section 3.1, increased network size monotonically improves performance, whereas in Section 3.2, it exhibits an inverted U-shape trend.",
                      "In Section 3.2, modularity monotonically enhances performance, while in Section 3.1, it follows an inverted U-shape.",
                      "In the paragraph at the top of page 3, the symbol 'g' is used to denote two distinct concepts within the same paragraph."
                  ],
                  "evaluation": [
                      "The theoretical contributions, particularly Theorems 1-5, represent a significant advancement in understanding the conditions governing RNN stability.",
                      "The authors also present a compelling counterexample to the prevailing assumption that linear contraction is a sufficient condition for non-linear stability.",
                      "While the experimental section presents interesting findings, it appears to lack a clear overarching conclusion.",
                      "Certain reporting practices observed in the tables raise concerns.",
                      "Reporting only the best-case scenario from multiple experimental runs is not considered best practice and may lead to misinterpretations of the typical performance.",
                      "Relying solely on the best performance from multiple runs could significantly overestimate the average achievable performance.",
                      "Some claims are substantiated by only a single data point.",
                      "The experimental section is somewhat challenging to follow.",
                      "The paper's clarity varies, being exceptionally clear in certain discussions, such as those on stability and contraction, but becoming confusing in other areas.",
                      "The paper undeniably represents an evolutionary step in the understanding and design of stable RNNs, with novel and noteworthy theoretical results.",
                      "The experimental results lack clear conclusions and, at times, do not adhere to best practices, particularly in the reporting of experimental runs."
                  ],
                  "constructiveInput": [
                      "It is recommended that the authors report the mean and variance of their experimental results.",
                      "To draw a more robust conclusion and demonstrate a clear trend, the authors might consider exploring additional configurations, such as 50x7 and 39x9.",
                      "In general, the inclusion of more data points and error bars would enhance the credibility of the conclusions, helping to persuade the reader that the observed results are statistically significant and not merely anomalous.",
                      "The experimental section would benefit from a summary of its main conclusions, along with clearly demarcated paragraphs and sections that individually illustrate each point.",
                      "Theorem 7 should be rephrased slightly to explicitly clarify that the first inequality represents a condition rather than a statement.",
                      "For denoting multiplication, it is suggested to use the symbol '' instead of 'x' (e.g., 2216).",
                      "An overall review of the paper for readability is suggested to maximize comprehension."
                  ],
                  "clarificationDialogue": [
                      "The overarching conclusion to be drawn from these seemingly contradictory trends is not explicitly stated.",
                      "The rationale for including a subsection 3.2.1 rather than simply section 3.2 is unclear."
                  ],
                  "metaCommentary": []
              }
          },
          {
              "review_id": "rewrite_1_gemini-2.5-flash_60a6f007",
              "og_review_id": 1,
              "author": "gemini-2.5-flash",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Accept",
              "raw_review": "This paper offers a substantial theoretical contribution to the construction of recurrent neural network assemblies. While the efficacy of combining learned modular components, particularly in applied domains such as AlphaGo, is well-established for its power and tractability compared to developing bespoke models, theoretical guarantees regarding the stability of such combinations remain elusive. This work addresses this gap by developing a theoretical framework for provably-stable combinations of RNNs, leveraging weight constraints and feedback mechanisms. Employing fixed RNNs designed under these constraints, with antisymmetric learnable parameters governing inter-module connections, the authors demonstrate that their sparse combination network achieves state-of-the-art (SOTA) performance on sequential image classification benchmarks. Notably, this is accomplished with substantially fewer learned parameters and the additional benefit of theoretical stability guarantees.\n\n**Strengths:**\n*   The empirical results are remarkably convincing for a paper that is primarily theoretical in nature. The authors conduct a thorough investigation into various permutations of their modular sparse combination network framework, exploring the interplay between the number and size of RNNs using absolute value weight constraints. They also examine an alternative SVD weight constraint network, noting its comparatively lower performance and slower training. Crucially, the authors demonstrate that their approach surpasses SOTA algorithms on several common, albeit less complex, benchmarks in the field, even within (and potentially due to) the imposed constraints.\n*   The theoretical contribution presented is highly significant. Recent work, including systems like AlphaGo and the more general Recurrent Independent Mechanisms (RIMs) framework, has explored networks with numerous individual recurrent components. However, these often rely on intuitive explanations and empirical validation rather than robust theoretical guarantees. While specialized RNN modules clearly hold considerable potential, RNNs are known for their instability and training challenges, making end-to-end learning of such models particularly difficult. The ability to apply these constraint conditions while maintaining strong performance, particularly evident in the absolute value constraint case, suggests a promising path towards developing reusable modules that can be flexibly combined for specific tasks. Although this paper does not address all intermediate questions, its stability analysis constitutes a pivotal step forward.\n*   The proofs provided in the appendix are well-structured and readily comprehensible for readers possessing a sufficient mathematical background.\n\n**Weaknesses:**\n*   The paper is notably dense and challenging to follow. It required multiple readings to fully grasp the significance of network stability and the mechanisms by which it is achieved in this context. The appendix is an essential read, as are several cited references. The presented use cases lack immediate intuitive clarity. Enhancements such as a graphical representation of the sparse combination network (as an alternative to the weight matrices in Figure 2), the inclusion of pseudocode for the algorithms (perhaps in the Appendix), and an illustrative example of an unstable network assembly diverging, would significantly improve readability. Furthermore, prior familiarity with AlphaGo and similar methodologies provided a greater insight into the practical implications than the paper itself conveyed.\n*   While the provided empirical results are appreciated, they are exclusively focused on sequential image prediction. The inclusion of at least one application in a different domain (e.g., Natural Language Processing, Reinforcement Learning, continuous control) would broaden the demonstrated applicability of the proposed framework.\n\n**Overall Recommendation:**\nI recommend this paper for acceptance. Despite its demanding nature and the need for extensive consultation with external literature, I believe this work aligns with the future trajectory of Deep Learning algorithms. The theoretical and practical outcomes presented here possess considerable potential. To further enhance the paper's impact, I suggest incorporating results from an additional domain and dedicating more effort to improving its overall readability. Valuable theoretical contributions are too often underutilized because they are difficult to understand or do not immediately resonate with the empiricists and engineers who could build upon them.",
              "clean_review": "This paper offers a substantial theoretical contribution to the construction of recurrent neural network assemblies. While the efficacy of combining learned modular components, particularly in applied domains such as AlphaGo, is well-established for its power and tractability compared to developing bespoke models, theoretical guarantees regarding the stability of such combinations remain elusive. This work addresses this gap by developing a theoretical framework for provably-stable combinations of RNNs, leveraging weight constraints and feedback mechanisms. Employing fixed RNNs designed under these constraints, with antisymmetric learnable parameters governing inter-module connections, the authors demonstrate that their sparse combination network achieves state-of-the-art (SOTA) performance on sequential image classification benchmarks. Notably, this is accomplished with substantially fewer learned parameters and the additional benefit of theoretical stability guarantees.\n\nStrengths:\nThe empirical results are remarkably convincing for a paper that is primarily theoretical in nature. The authors conduct a thorough investigation into various permutations of their modular sparse combination network framework, exploring the interplay between the number and size of RNNs using absolute value weight constraints. They also examine an alternative SVD weight constraint network, noting its comparatively lower performance and slower training. Crucially, the authors demonstrate that their approach surpasses SOTA algorithms on several common, albeit less complex, benchmarks in the field, even within (and potentially due to) the imposed constraints.\nThe theoretical contribution presented is highly significant. Recent work, including systems like AlphaGo and the more general Recurrent Independent Mechanisms (RIMs) framework, has explored networks with numerous individual recurrent components. However, these often rely on intuitive explanations and empirical validation rather than robust theoretical guarantees. While specialized RNN modules clearly hold considerable potential, RNNs are known for their instability and training challenges, making end-to-end learning of such models particularly difficult. The ability to apply these constraint conditions while maintaining strong performance, particularly evident in the absolute value constraint case, suggests a promising path towards developing reusable modules that can be flexibly combined for specific tasks. Although this paper does not address all intermediate questions, its stability analysis constitutes a pivotal step forward.\nThe proofs provided in the appendix are well-structured and readily comprehensible for readers possessing a sufficient mathematical background.\n\nWeaknesses:\nThe paper is notably dense and challenging to follow. It required multiple readings to fully grasp the significance of network stability and the mechanisms by which it is achieved in this context. The appendix is an essential read, as are several cited references. The presented use cases lack immediate intuitive clarity. Enhancements such as a graphical representation of the sparse combination network (as an alternative to the weight matrices in Figure 2), the inclusion of pseudocode for the algorithms (perhaps in the Appendix), and an illustrative example of an unstable network assembly diverging, would significantly improve readability. Furthermore, prior familiarity with AlphaGo and similar methodologies provided a greater insight into the practical implications than the paper itself conveyed.\nWhile the provided empirical results are appreciated, they are exclusively focused on sequential image prediction. The inclusion of at least one application in a different domain (e.g., Natural Language Processing, Reinforcement Learning, continuous control) would broaden the demonstrated applicability of the proposed framework.\n\nOverall Recommendation:\nI recommend this paper for acceptance. Despite its demanding nature and the need for extensive consultation with external literature, I believe this work aligns with the future trajectory of Deep Learning algorithms. The theoretical and practical outcomes presented here possess considerable potential. To further enhance the paper's impact, I suggest incorporating results from an additional domain and dedicating more effort to improving its overall readability. Valuable theoretical contributions are too often underutilized because they are difficult to understand or do not immediately resonate with the empiricists and engineers who could build upon them.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper offers a substantial theoretical contribution to the construction of recurrent neural network assemblies.",
                      "The work addresses the gap in theoretical guarantees regarding the stability of combining learned modular components by developing a theoretical framework for provably-stable combinations of RNNs, leveraging weight constraints and feedback mechanisms.",
                      "The authors employ fixed RNNs designed under these constraints, with antisymmetric learnable parameters governing inter-module connections.",
                      "The authors demonstrate that their sparse combination network achieves state-of-the-art (SOTA) performance on sequential image classification benchmarks with substantially fewer learned parameters and theoretical stability guarantees.",
                      "The authors conduct a thorough investigation into various permutations of their modular sparse combination network framework, exploring the interplay between the number and size of RNNs using absolute value weight constraints.",
                      "They also examine an alternative SVD weight constraint network, noting its comparatively lower performance and slower training.",
                      "The proofs provided in the appendix are well-structured and readily comprehensible for readers possessing a sufficient mathematical background.",
                      "The provided empirical results are exclusively focused on sequential image prediction."
                  ],
                  "evaluation": [
                      "The efficacy of combining learned modular components, particularly in applied domains such as AlphaGo, is well-established for its power and tractability compared to developing bespoke models.",
                      "The empirical results are remarkably convincing for a paper that is primarily theoretical in nature.",
                      "The authors demonstrate that their approach surpasses SOTA algorithms on several common, albeit less complex, benchmarks in the field, even within (and potentially due to) the imposed constraints.",
                      "The theoretical contribution presented is highly significant.",
                      "While specialized RNN modules clearly hold considerable potential, RNNs are known for their instability and training challenges, making end-to-end learning of such models particularly difficult.",
                      "The ability to apply these constraint conditions while maintaining strong performance, particularly evident in the absolute value constraint case, suggests a promising path towards developing reusable modules that can be flexibly combined for specific tasks.",
                      "Although this paper does not address all intermediate questions, its stability analysis constitutes a pivotal step forward.",
                      "The paper is notably dense and challenging to follow.",
                      "The presented use cases lack immediate intuitive clarity."
                  ],
                  "constructiveInput": [
                      "Enhancements such as a graphical representation of the sparse combination network (as an alternative to the weight matrices in Figure 2), the inclusion of pseudocode for the algorithms (perhaps in the Appendix), and an illustrative example of an unstable network assembly diverging, would significantly improve readability.",
                      "The inclusion of at least one application in a different domain (e.g., Natural Language Processing, Reinforcement Learning, continuous control) would broaden the demonstrated applicability of the proposed framework.",
                      "To further enhance the paper's impact, I suggest incorporating results from an additional domain and dedicating more effort to improving its overall readability."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "It required multiple readings to fully grasp the significance of network stability and the mechanisms by which it is achieved in this context.",
                      "The appendix is an essential read, as are several cited references.",
                      "Prior familiarity with AlphaGo and similar methodologies provided a greater insight into the practical implications than the paper itself conveyed.",
                      "I recommend this paper for acceptance.",
                      "Despite its demanding nature and the need for extensive consultation with external literature, I believe this work aligns with the future trajectory of Deep Learning algorithms.",
                      "The theoretical and practical outcomes presented here possess considerable potential.",
                      "Valuable theoretical contributions are too often underutilized because they are difficult to understand or do not immediately resonate with the empiricists and engineers who could build upon them."
                  ]
              }
          },
          {
              "review_id": "rewrite_2_gemini-2.5-flash_b176ddbf",
              "og_review_id": 2,
              "author": "gemini-2.5-flash",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This review assesses the paper titled \"[Paper Title, if known, otherwise omit]\".\n\nThe authors investigated the contraction properties of continuous-time recurrent neural networks. Additionally, they demonstrated that a network of provably stable RNNs (referred to as a \"net of nets\") can achieve competitive performance on various benchmarks, including sequential CIFAR10, even when only inter-module connections are trained.\n\n**Strengths**\n\nThe exploration of methods for assembling networks of RNNs presents an interesting research problem. The theorems concerning contraction properties are particularly valuable for researchers working on provably stable RNNs.\n\n**Weaknesses**\n\n1.  Section 3.2.1 could benefit from a more concise presentation. The detailed reporting of results from numerous individual AxB networks might be overly extensive.\n2.  While Table 1 presents a performance comparison with other network architectures, it would be highly informative to include the performance of networks under identical conditions, with the sole exception that the RNNs are not provably stable.\n3.  It would also be beneficial to understand the implications of training all connection weights, rather than exclusively the connections between modules. Specifically, does performance decrease despite the increased number of parameters?\n4.  The discussion of provable stability appears somewhat distinct from the training of modular networks. A clearer exposition on their interrelationship would be beneficial. In particular, is the stability of individual RNN modules critically important for sparsely connected modular networks?\n\n**Overall Assessment**\n\nThis paper presents an interesting approach to RNNs, focusing on provable stability and the \"net-of-nets\" architecture, which represents a less common direction in the field. While some sections presenting results are challenging to parse, the overall clarity of the paper is satisfactory.\n\nIt is important to note that I am unable to evaluate the correctness of the mathematical derivations.",
              "clean_review": "The authors investigated the contraction properties of continuous-time recurrent neural networks. Additionally, they demonstrated that a network of provably stable RNNs (referred to as a \"net of nets\") can achieve competitive performance on various benchmarks, including sequential CIFAR10, even when only inter-module connections are trained.\n\nStrengths\n\nThe exploration of methods for assembling networks of RNNs presents an interesting research problem. The theorems concerning contraction properties are particularly valuable for researchers working on provably stable RNNs.\n\nWeaknesses\n\n1. Section 3.2.1 could benefit from a more concise presentation. The detailed reporting of results from numerous individual AxB networks might be overly extensive.\n2. While Table 1 presents a performance comparison with other network architectures, it would be highly informative to include the performance of networks under identical conditions, with the sole exception that the RNNs are not provably stable.\n3. It would also be beneficial to understand the implications of training all connection weights, rather than exclusively the connections between modules. Specifically, does performance decrease despite the increased number of parameters?\n4. The discussion of provable stability appears somewhat distinct from the training of modular networks. A clearer exposition on their interrelationship would be beneficial. In particular, is the stability of individual RNN modules critically important for sparsely connected modular networks?\n\nOverall Assessment\n\nThis paper presents an interesting approach to RNNs, focusing on provable stability and the \"net-of-nets\" architecture, which represents a less common direction in the field. While some sections presenting results are challenging to parse, the overall clarity of the paper is satisfactory.\n\nIt is important to note that I am unable to evaluate the correctness of the mathematical derivations.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The authors investigated the contraction properties of continuous-time recurrent neural networks.",
                      "They demonstrated that a network of provably stable RNNs (referred to as a \"net of nets\") can achieve competitive performance on various benchmarks, including sequential CIFAR10, even when only inter-module connections are trained."
                  ],
                  "evaluation": [
                      "The exploration of methods for assembling networks of RNNs presents an interesting research problem.",
                      "The theorems concerning contraction properties are particularly valuable for researchers working on provably stable RNNs.",
                      "Section 3.2.1 could benefit from a more concise presentation.",
                      "The detailed reporting of results from numerous individual AxB networks might be overly extensive.",
                      "While some sections presenting results are challenging to parse, the overall clarity of the paper is satisfactory."
                  ],
                  "constructiveInput": [
                      "Section 3.2.1 could benefit from a more concise presentation.",
                      "It would be highly informative to include the performance of networks under identical conditions, with the sole exception that the RNNs are not provably stable in Table 1.",
                      "It would also be beneficial to understand the implications of training all connection weights, rather than exclusively the connections between modules.",
                      "A clearer exposition on the interrelationship between provable stability and the training of modular networks would be beneficial."
                  ],
                  "clarificationDialogue": [
                      "Specifically, does performance decrease despite the increased number of parameters when training all connection weights?",
                      "Is the stability of individual RNN modules critically important for sparsely connected modular networks?"
                  ],
                  "metaCommentary": [
                      "This paper presents an interesting approach to RNNs, focusing on provable stability and the \"net-of-nets\" architecture, which represents a less common direction in the field.",
                      "It is important to note that I am unable to evaluate the correctness of the mathematical derivations."
                  ]
              }
          },
          {
              "review_id": "rewrite_3_gemini-2.5-flash_49e87143",
              "og_review_id": 3,
              "author": "gemini-2.5-flash",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This paper investigates stable architectures for Recurrent Neural Networks (RNNs). From a theoretical perspective, the authors establish a series of conditions under which an RNN's weight matrix exhibits contractive properties. On the modeling front, the paper introduces RNN architectures designed to possess contractive weight matrices. The efficacy of the proposed methods is evaluated using benchmark datasets, specifically sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical findings appear interesting, though perhaps not entirely surprising. However, a perceived disconnect exists between the theoretical results and the proposed model, as not all theorems appear directly relevant to the architectural design.\n\nRegarding the proposed model (Section 3.1), the reviewer acknowledges the possibility of overlooking details but found the model's presentation challenging to fully comprehend. Clarification in its exposition would significantly benefit readers. For instance, the term \"subnetworks\" mentioned at the outset of Section 3.1 lacks clear definition or explanation. The method by which these subnetworks are integrated remains ambiguous. It can only be inferred from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is structured as a block matrix.\n\nFurthermore, the approach of parametrizing orthogonal weight matrices through the exponentiation of skew-symmetric matrices is not novel, having been previously explored in expRNN [1].\n\nThe introductory section's writing could also be enhanced. Instead of discussing topics such as AlphaGo and evolutionary modules, readers might benefit from a more comprehensive literature review pertaining to RNN trainability and long-term dependency challenges.\n\n[1] Lezcano-Casado, Mario, and David Martınez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nWhile the theoretical results presented in the paper hold interest, the clarity of the proposed model's presentation is deficient, and the model itself does not appear to introduce substantial novelty. Overall, the paper requires notable improvements to meet the acceptance threshold.",
              "clean_review": "This paper investigates stable architectures for Recurrent Neural Networks (RNNs). From a theoretical perspective, the authors establish a series of conditions under which an RNN's weight matrix exhibits contractive properties. On the modeling front, the paper introduces RNN architectures designed to possess contractive weight matrices. The efficacy of the proposed methods is evaluated using benchmark datasets, specifically sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical findings appear interesting, though perhaps not entirely surprising. However, a perceived disconnect exists between the theoretical results and the proposed model, as not all theorems appear directly relevant to the architectural design.\n\nRegarding the proposed model (Section 3.1), the reviewer acknowledges the possibility of overlooking details but found the model's presentation challenging to fully comprehend. Clarification in its exposition would significantly benefit readers. For instance, the term \"subnetworks\" mentioned at the outset of Section 3.1 lacks clear definition or explanation. The method by which these subnetworks are integrated remains ambiguous. It can only be inferred from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is structured as a block matrix.\n\nFurthermore, the approach of parametrizing orthogonal weight matrices through the exponentiation of skew-symmetric matrices is not novel, having been previously explored in expRNN.\n\nThe introductory section's writing could also be enhanced. Instead of discussing topics such as AlphaGo and evolutionary modules, readers might benefit from a more comprehensive literature review pertaining to RNN trainability and long-term dependency challenges.\n\nWhile the theoretical results presented in the paper hold interest, the clarity of the proposed model's presentation is deficient, and the model itself does not appear to introduce substantial novelty. Overall, the paper requires notable improvements to meet the acceptance threshold.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper investigates stable architectures for Recurrent Neural Networks (RNNs).",
                      "From a theoretical perspective, the authors establish a series of conditions under which an RNN's weight matrix exhibits contractive properties.",
                      "On the modeling front, the paper introduces RNN architectures designed to possess contractive weight matrices.",
                      "The efficacy of the proposed methods is evaluated using benchmark datasets, specifically sequential MNIST, permuted MNIST, and sequential CIFAR-10.",
                      "The approach of parametrizing orthogonal weight matrices through the exponentiation of skew-symmetric matrices has been previously explored in expRNN."
                  ],
                  "evaluation": [
                      "The theoretical findings appear interesting, though perhaps not entirely surprising.",
                      "A perceived disconnect exists between the theoretical results and the proposed model, as not all theorems appear directly relevant to the architectural design.",
                      "The model's presentation was challenging to fully comprehend.",
                      "The approach of parametrizing orthogonal weight matrices through the exponentiation of skew-symmetric matrices is not novel.",
                      "The clarity of the proposed model's presentation is deficient.",
                      "The model itself does not appear to introduce substantial novelty."
                  ],
                  "constructiveInput": [
                      "Clarification in the model's exposition would significantly benefit readers.",
                      "The introductory section's writing could be enhanced.",
                      "Instead of discussing topics such as AlphaGo and evolutionary modules, readers might benefit from a more comprehensive literature review pertaining to RNN trainability and long-term dependency challenges."
                  ],
                  "clarificationDialogue": [
                      "The reviewer acknowledges the possibility of overlooking details but found the model's presentation challenging to fully comprehend.",
                      "For instance, the term \"subnetworks\" mentioned at the outset of Section 3.1 lacks clear definition or explanation.",
                      "The method by which these subnetworks are integrated remains ambiguous.",
                      "It can only be inferred from the \"recursive construction\" in the title and Figure 2 that the resulting weight matrix is structured as a block matrix."
                  ],
                  "metaCommentary": [
                      "Overall, the paper requires notable improvements to meet the acceptance threshold."
                  ]
              }
          },
          {
              "review_id": "ai_gemini-2.5-pro_Accept_a89802b1",
              "author": "gemini-2.5-pro",
              "class": "ai",
              "prompt": "Accept",
              "raw_review": "# Summary of paper\nThis paper introduces a novel framework for constructing provably stable recurrent neural networks (RNNs) by composing smaller, independently contracting subnetworks. Drawing from contraction analysis in nonlinear control theory, the authors first derive five new, simple conditions on the weight matrices of continuous-time RNNs that guarantee global contraction. The main theoretical contribution is a \"network of networks\" theorem that provides a parameterization for the connections between these contracting subnetworks, ensuring the stability of the overall composite system is preserved. The authors experimentally validate this framework by creating two architectures, \"Sparse Combo Net\" and \"SVD Combo Net,\" which they test on sequential image classification benchmarks. The proposed models achieve state-of-the-art results among stability-guaranteed RNNs on permuted sequential MNIST and sequential CIFAR10, demonstrating that strong stability constraints can be imposed without sacrificing, and in some cases even enhancing, performance.\n\n# Strengths\n- **Strong Theoretical Foundation and Novelty**: The paper is built on a solid theoretical foundation of contraction analysis, providing formal stability guarantees. The derivation of five novel contraction conditions for continuous-time RNNs (Theorems 1-5) and the key composition theorem (Theorem 6) are significant and novel contributions. This work provides a principled and elegant solution to the critical problem of ensuring stability in complex, modular systems.\n\n- **Practical and Performant Implementation**: The theoretical results are not merely abstract; they are translated into concrete, trainable network architectures. The experimental results are impressive, showing that the proposed \"Sparse Combo Net\" achieves state-of-the-art performance among provably stable RNNs on challenging long-timescale tasks like permuted sequential MNIST and sequential CIFAR10. The fact that this is achieved with fewer parameters than many competitors is a strong testament to the method's efficacy.\n\n- **Thorough and Insightful Experiments**: The experimental section is comprehensive. The authors not only demonstrate the performance of their models but also provide valuable ablation studies and analyses. The investigations into the effects of sparsity and network size/modularity offer compelling insights, particularly the finding that increased sparsity can improve performance. Furthermore, the control study that removes the stabilizing constraint on inter-network connections effectively demonstrates the practical importance of the paper's core theoretical result.\n\n- **Clarity and Contribution to the Field**: The paper is well-written and clearly articulates its contributions. It also provides a valuable service to the community by clarifying common misconceptions regarding stability conditions in RNNs (Section 2.3), distinguishing between autonomous and non-autonomous stability, and formally discussing the limitations of Jacobian eigenvalue analysis.\n\n# Weaknesses\n- **Fixed Subnetworks in the Best-Performing Model**: The highest-performing model, the Sparse Combo Net, relies on randomly initialized, fixed subnetworks, with only the inter-network connections being trained. While this is an interesting and valid design choice that simplifies training, it raises questions about how much the performance depends on a fortunate initialization. A discussion on the variance of performance with different random seeds for the fixed weights, and perhaps a brief exploration into fine-tuning these weights post-hoc, could strengthen the results.\n\n- **Limited Scope of Tasks**: The experiments are confined to sequential image classification tasks. While these are standard benchmarks for measuring long-range dependency handling, the stability guarantees offered by this framework could be particularly impactful in other domains like control, robotics, or reinforcement learning. Demonstrating the approach on a simple control task, even in the appendix, would broaden the paper's demonstrated impact.\n\n- **Parameterization of Sparse Combo Net**: Unlike the SVD Combo Net, the Sparse Combo Net lacks a direct parameterization to enforce its stability condition (Theorem 1) during training. The authors rely on a generate-and-test procedure for initialization. While this works well experimentally, the lack of a differentiable parameterization that could allow for end-to-end training of the subnetwork weights is a limitation of the current approach for this specific model.\n\n# Questions for authors\n- **On the Role of Sparsity**: The strong performance of the Sparse Combo Net is fascinating. The paper suggests that sparsity allows for higher magnitude weights while maintaining stability. Could you elaborate on this? Does this imply that the dynamics of these sparse subnetworks are in some way richer or better suited for these tasks, or is the benefit primarily in enabling more expressive (i.e., stronger) trained connections between modules?\n\n- **Continuous-Time vs. Discrete-Time Implementation**: The theoretical results are for continuous-time RNNs, but the implementation presumably uses a discrete-time numerical solver. Could you briefly comment on the choice of solver and step size, and whether the stability guarantees formally carry over to the discretized implementation?\n\n- **Sensitivity to Subnetwork Initialization**: For the Sparse Combo Net, how sensitive is the final performance to the random initialization of the fixed subnetwork weights? The repeatability experiments show low variance for a given set of hyperparameters, but what if the subnetworks themselves were re-initialized? This would help clarify whether the success is due to the architecture itself or a potentially lucky initialization scheme.\n\n# Final Judgement\n- **Rationale of recommendation**: This is an excellent paper that makes significant contributions on both theoretical and empirical fronts. It addresses the fundamentally important problem of stability in complex neural networks with a principled and elegant approach based on contraction theory. The theoretical results are novel, and the authors successfully translate them into practical models that achieve state-of-the-art performance on relevant benchmarks. The experimental analysis is thorough and provides convincing evidence for the method's effectiveness. The weaknesses identified are minor and represent opportunities for future work rather than fundamental flaws. The paper is well-written, impactful, and will be of great interest to the community.\n- **Recommendation**: Accept",
              "clean_review": "Summary of paper\nThis paper introduces a novel framework for constructing provably stable recurrent neural networks (RNNs) by composing smaller, independently contracting subnetworks. Drawing from contraction analysis in nonlinear control theory, the authors first derive five new, simple conditions on the weight matrices of continuous-time RNNs that guarantee global contraction. The main theoretical contribution is a \"network of networks\" theorem that provides a parameterization for the connections between these contracting subnetworks, ensuring the stability of the overall composite system is preserved. The authors experimentally validate this framework by creating two architectures, \"Sparse Combo Net\" and \"SVD Combo Net,\" which they test on sequential image classification benchmarks. The proposed models achieve state-of-the-art results among stability-guaranteed RNNs on permuted sequential MNIST and sequential CIFAR10, demonstrating that strong stability constraints can be imposed without sacrificing, and in some cases even enhancing, performance.\n\nStrengths\nStrong Theoretical Foundation and Novelty: The paper is built on a solid theoretical foundation of contraction analysis, providing formal stability guarantees. The derivation of five novel contraction conditions for continuous-time RNNs (Theorems 1-5) and the key composition theorem (Theorem 6) are significant and novel contributions. This work provides a principled and elegant solution to the critical problem of ensuring stability in complex, modular systems.\nPractical and Performant Implementation: The theoretical results are not merely abstract; they are translated into concrete, trainable network architectures. The experimental results are impressive, showing that the proposed \"Sparse Combo Net\" achieves state-of-the-art performance among provably stable RNNs on challenging long-timescale tasks like permuted sequential MNIST and sequential CIFAR10. The fact that this is achieved with fewer parameters than many competitors is a strong testament to the method's efficacy.\nThorough and Insightful Experiments: The experimental section is comprehensive. The authors not only demonstrate the performance of their models but also provide valuable ablation studies and analyses. The investigations into the effects of sparsity and network size/modularity offer compelling insights, particularly the finding that increased sparsity can improve performance. Furthermore, the control study that removes the stabilizing constraint on inter-network connections effectively demonstrates the practical importance of the paper's core theoretical result.\nClarity and Contribution to the Field: The paper is well-written and clearly articulates its contributions. It also provides a valuable service to the community by clarifying common misconceptions regarding stability conditions in RNNs (Section 2.3), distinguishing between autonomous and non-autonomous stability, and formally discussing the limitations of Jacobian eigenvalue analysis.\n\nWeaknesses\nFixed Subnetworks in the Best-Performing Model: The highest-performing model, the Sparse Combo Net, relies on randomly initialized, fixed subnetworks, with only the inter-network connections being trained. While this is an interesting and valid design choice that simplifies training, it raises questions about how much the performance depends on a fortunate initialization. A discussion on the variance of performance with different random seeds for the fixed weights, and perhaps a brief exploration into fine-tuning these weights post-hoc, could strengthen the results.\nLimited Scope of Tasks: The experiments are confined to sequential image classification tasks. While these are standard benchmarks for measuring long-range dependency handling, the stability guarantees offered by this framework could be particularly impactful in other domains like control, robotics, or reinforcement learning. Demonstrating the approach on a simple control task, even in the appendix, would broaden the paper's demonstrated impact.\nParameterization of Sparse Combo Net: Unlike the SVD Combo Net, the Sparse Combo Net lacks a direct parameterization to enforce its stability condition (Theorem 1) during training. The authors rely on a generate-and-test procedure for initialization. While this works well experimentally, the lack of a differentiable parameterization that could allow for end-to-end training of the subnetwork weights is a limitation of the current approach for this specific model.\n\nQuestions for authors\nOn the Role of Sparsity: The strong performance of the Sparse Combo Net is fascinating. The paper suggests that sparsity allows for higher magnitude weights while maintaining stability. Could you elaborate on this? Does this imply that the dynamics of these sparse subnetworks are in some way richer or better suited for these tasks, or is the benefit primarily in enabling more expressive (i.e., stronger) trained connections between modules?\nContinuous-Time vs. Discrete-Time Implementation: The theoretical results are for continuous-time RNNs, but the implementation presumably uses a discrete-time numerical solver. Could you briefly comment on the choice of solver and step size, and whether the stability guarantees formally carry over to the discretized implementation?\nSensitivity to Subnetwork Initialization: For the Sparse Combo Net, how sensitive is the final performance to the random initialization of the fixed subnetwork weights? The repeatability experiments show low variance for a given set of hyperparameters, but what if the subnetworks themselves were re-initialized? This would help clarify whether the success is due to the architecture itself or a potentially lucky initialization scheme.\n\nFinal Judgement\nRationale of recommendation: This is an excellent paper that makes significant contributions on both theoretical and empirical fronts. It addresses the fundamentally important problem of stability in complex neural networks with a principled and elegant approach based on contraction theory. The theoretical results are novel, and the authors successfully translate them into practical models that achieve state-of-the-art performance on relevant benchmarks. The experimental analysis is thorough and provides convincing evidence for the method's effectiveness. The weaknesses identified are minor and represent opportunities for future work rather than fundamental flaws. The paper is well-written, impactful, and will be of great interest to the community.\nRecommendation: Accept",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper introduces a novel framework for constructing provably stable recurrent neural networks (RNNs) by composing smaller, independently contracting subnetworks.",
                      "Drawing from contraction analysis in nonlinear control theory, the authors first derive five new, simple conditions on the weight matrices of continuous-time RNNs that guarantee global contraction.",
                      "The main theoretical contribution is a \"network of networks\" theorem that provides a parameterization for the connections between these contracting subnetworks, ensuring the stability of the overall composite system is preserved.",
                      "The authors experimentally validate this framework by creating two architectures, \"Sparse Combo Net\" and \"SVD Combo Net,\" which they test on sequential image classification benchmarks.",
                      "The proposed models achieve state-of-the-art results among stability-guaranteed RNNs on permuted sequential MNIST and sequential CIFAR10.",
                      "The highest-performing model, the Sparse Combo Net, relies on randomly initialized, fixed subnetworks, with only the inter-network connections being trained.",
                      "The experiments are confined to sequential image classification tasks.",
                      "Unlike the SVD Combo Net, the Sparse Combo Net lacks a direct parameterization to enforce its stability condition (Theorem 1) during training, relying instead on a generate-and-test procedure for initialization.",
                      "The theoretical results are for continuous-time RNNs, but the implementation presumably uses a discrete-time numerical solver."
                  ],
                  "evaluation": [
                      "The paper demonstrates that strong stability constraints can be imposed without sacrificing, and in some cases even enhancing, performance.",
                      "The paper is built on a solid theoretical foundation of contraction analysis, providing formal stability guarantees.",
                      "The derivation of five novel contraction conditions for continuous-time RNNs (Theorems 1-5) and the key composition theorem (Theorem 6) are significant and novel contributions.",
                      "This work provides a principled and elegant solution to the critical problem of ensuring stability in complex, modular systems.",
                      "The theoretical results are not merely abstract; they are translated into concrete, trainable network architectures.",
                      "The experimental results are impressive, showing that the proposed \"Sparse Combo Net\" achieves state-of-the-art performance among provably stable RNNs on challenging long-timescale tasks like permuted sequential MNIST and sequential CIFAR10.",
                      "The fact that this is achieved with fewer parameters than many competitors is a strong testament to the method's efficacy.",
                      "The experimental section is comprehensive.",
                      "The authors not only demonstrate the performance of their models but also provide valuable ablation studies and analyses.",
                      "The investigations into the effects of sparsity and network size/modularity offer compelling insights, particularly the finding that increased sparsity can improve performance.",
                      "The control study that removes the stabilizing constraint on inter-network connections effectively demonstrates the practical importance of the paper's core theoretical result.",
                      "The paper is well-written and clearly articulates its contributions.",
                      "It also provides a valuable service to the community by clarifying common misconceptions regarding stability conditions in RNNs (Section 2.3), distinguishing between autonomous and non-autonomous stability, and formally discussing the limitations of Jacobian eigenvalue analysis.",
                      "While the fixed subnetworks in the Sparse Combo Net are an interesting and valid design choice that simplifies training, it raises questions about how much the performance depends on a fortunate initialization.",
                      "While sequential image classification tasks are standard benchmarks for measuring long-range dependency handling, the stability guarantees offered by this framework could be particularly impactful in other domains like control, robotics, or reinforcement learning.",
                      "While the generate-and-test procedure for initialization works well experimentally, the lack of a differentiable parameterization that could allow for end-to-end training of the subnetwork weights is a limitation of the current approach for this specific model."
                  ],
                  "constructiveInput": [
                      "A discussion on the variance of performance with different random seeds for the fixed weights, and perhaps a brief exploration into fine-tuning these weights post-hoc, could strengthen the results.",
                      "Demonstrating the approach on a simple control task, even in the appendix, would broaden the paper's demonstrated impact."
                  ],
                  "clarificationDialogue": [
                      "Could you elaborate on the suggestion that sparsity allows for higher magnitude weights while maintaining stability? Does this imply that the dynamics of these sparse subnetworks are in some way richer or better suited for these tasks, or is the benefit primarily in enabling more expressive (i.e., stronger) trained connections between modules?",
                      "Could you briefly comment on the choice of solver and step size for the discrete-time implementation, and whether the stability guarantees formally carry over to the discretized implementation?",
                      "For the Sparse Combo Net, how sensitive is the final performance to the random initialization of the fixed subnetwork weights? The repeatability experiments show low variance for a given set of hyperparameters, but what if the subnetworks themselves were re-initialized? This would help clarify whether the success is due to the architecture itself or a potentially lucky initialization scheme."
                  ],
                  "metaCommentary": [
                      "This is an excellent paper that makes significant contributions on both theoretical and empirical fronts.",
                      "It addresses the fundamentally important problem of stability in complex neural networks with a principled and elegant approach based on contraction theory.",
                      "The theoretical results are novel, and the authors successfully translate them into practical models that achieve state-of-the-art performance on relevant benchmarks.",
                      "The experimental analysis is thorough and provides convincing evidence for the method's effectiveness.",
                      "The weaknesses identified are minor and represent opportunities for future work rather than fundamental flaws.",
                      "The paper is well-written, impactful, and will be of great interest to the community.",
                      "Recommendation: Accept"
                  ]
              }
          },
          {
              "review_id": "ai_gemini-2.5-pro_Borderline_7fed9f9e",
              "author": "gemini-2.5-pro",
              "class": "ai",
              "prompt": "Borderline",
              "raw_review": "# Summary of paper\nThis paper proposes a method for constructing provably stable recurrent neural networks (RNNs) by combining smaller, independently stable RNN modules. The authors leverage contraction analysis from nonlinear control theory to first derive several novel sufficient conditions for a continuous-time RNN to be contracting (Theorems 1-5). Their key theoretical contribution is a modularity theorem (Theorem 6) which provides a specific parameterization for the connections between contracting sub-networks that guarantees the stability of the entire \"network of networks\". Based on this framework, they propose two architectures, \"Sparse Combo Net\" and \"SVD Combo Net\", and demonstrate their effectiveness on sequential image classification benchmarks. The proposed Sparse Combo Net achieves state-of-the-art results among provably stable RNNs on permuted sequential MNIST and sequential CIFAR10, while using a relatively small number of parameters.\n\n# Strengths\n- **Strong Theoretical Foundation and Novelty**: The paper is built on a solid foundation of contraction analysis. The derivation of five new stability conditions for continuous-time RNNs (Theorems 1-5) is a valuable theoretical contribution. The main result, Theorem 6, which provides a constructive and stability-preserving method for combining modules, is both elegant and practical. This provides a principled way to build complex, reliable recurrent systems.\n\n- **Impressive Empirical Performance**: The experimental results are strong and compelling. The proposed \"Sparse Combo Net\" achieves state-of-the-art performance among models with stability guarantees on challenging benchmarks like permuted sequential MNIST and sequential CIFAR10. That it does so with significantly fewer parameters than many competing models underscores the effectiveness and efficiency of the proposed approach.\n\n- **Important Connection Between Theory and Practice**: The paper successfully bridges the gap between abstract stability theory and practical deep learning. The experiments, particularly the control study showing performance degradation when the stabilizing constraint is removed, clearly demonstrate that the theoretical stability guarantee is not just a certificate but a crucial component for achieving high performance. The finding that increased sparsity (and stronger stability) correlates with better performance is an interesting and non-obvious result.\n\n# Weaknesses\n- **Limited Scope of \"Sparse Combo Net\" Training**: The most successful model, the Sparse Combo Net, operates under a significant constraint: the internal weights of the sub-networks are randomly initialized and then frozen. Only the inter-module connections are trained. While this simplifies satisfying the stability condition and works surprisingly well, it raises questions about the model's scalability and general applicability. It feels more like a clever initialization scheme than a fully trainable model, which may limit its adoption for more complex tasks where learning internal representations is critical.\n\n- **Clarity on Practical Implementation and Parameterization**: While the theory is well-explained, the practical details of implementing the parameterizations could be clearer. For the Sparse Combo Net, the process of randomly generating matrices that satisfy Theorem 1 and then finding a suitable metric `M` is described but feels somewhat ad-hoc. For the SVD Combo Net, the parameterization in Eq. (3) involves a matrix inverse (`Φ⁻¹`) which can be unstable or slow during training; it's unclear how this was handled. More detailed pseudo-code or discussion of the practicalities of these parameterizations in the main text would improve reproducibility and understanding.\n\n- **Discussion of Continuous-Time vs. Discrete-Time**: The theoretical results are derived for continuous-time RNNs (neural ODEs), but they are tested on sequential tasks that are inherently discrete. The paper does not discuss the implications of this mismatch. How were the continuous-time dynamics discretized for implementation? Does the stability guarantee from the continuous domain perfectly translate to the discretized version used in practice, especially with finite step sizes? Addressing this would strengthen the connection between the theory and the experiments.\n\n# Questions for authors\n- **On the Role of Untrained Modules**: Could you elaborate on why you believe the Sparse Combo Net, with its fixed, randomly initialized sub-modules, performs so well? Does this suggest that for these tasks, the primary learning challenge is routing information between simpler, fixed processing units, rather than learning complex representations within each unit? Have you experimented with any form of pre-training or fine-tuning of the sub-modules?\n\n- **On the Generality of the Network-of-Networks Theorem**: Theorem 6 provides a specific form for the interconnection matrix `L`. This form seems to enforce a type of anti-symmetry with respect to the metric `M`. Are there other, perhaps less restrictive, forms for `L` that would also preserve overall contraction? Is it possible that this specific structure limits the expressive power of the inter-module connections?\n\n- **On the Choice of Benchmarks**: The experiments focus on pixel-by-pixel image classification, which tests long-term memory but may not require the full dynamic range of a recurrent system. How do you expect these models would perform on tasks that require more complex temporal processing, such as language modeling or time-series forecasting? Do you foresee any challenges in applying this framework to such problems?\n\n# Final Judgement\n- **Rationale of recommendation**: This paper presents a very interesting and valuable contribution by bridging nonlinear control theory with modern deep learning. The theoretical results are novel, and the empirical validation is strong, achieving SOTA results for provably stable RNNs. However, the work has some limitations that place it on the borderline. The best-performing model relies on fixed, untrained sub-networks, which raises questions about its generalizability. Furthermore, the practical details of the model parameterizations and the theoretical gap between the continuous-time model and its discrete-time implementation could be clarified. Despite these points, the core idea is compelling and the results are significant enough to warrant serious consideration. With revisions to address the clarity issues and better contextualize the limitations, this could be a strong paper.\n- **Recommendation**: Borderline",
              "clean_review": "Summary of paper\nThis paper proposes a method for constructing provably stable recurrent neural networks (RNNs) by combining smaller, independently stable RNN modules. The authors leverage contraction analysis from nonlinear control theory to first derive several novel sufficient conditions for a continuous-time RNN to be contracting (Theorems 1-5). Their key theoretical contribution is a modularity theorem (Theorem 6) which provides a specific parameterization for the connections between contracting sub-networks that guarantees the stability of the entire \"network of networks\". Based on this framework, they propose two architectures, \"Sparse Combo Net\" and \"SVD Combo Net\", and demonstrate their effectiveness on sequential image classification benchmarks. The proposed Sparse Combo Net achieves state-of-the-art results among provably stable RNNs on permuted sequential MNIST and sequential CIFAR10, while using a relatively small number of parameters.\n\nStrengths\nStrong Theoretical Foundation and Novelty: The paper is built on a solid foundation of contraction analysis. The derivation of five new stability conditions for continuous-time RNNs (Theorems 1-5) is a valuable theoretical contribution. The main result, Theorem 6, which provides a constructive and stability-preserving method for combining modules, is both elegant and practical. This provides a principled way to build complex, reliable recurrent systems.\nImpressive Empirical Performance: The experimental results are strong and compelling. The proposed \"Sparse Combo Net\" achieves state-of-the-art performance among models with stability guarantees on challenging benchmarks like permuted sequential MNIST and sequential CIFAR10. That it does so with significantly fewer parameters than many competing models underscores the effectiveness and efficiency of the proposed approach.\nImportant Connection Between Theory and Practice: The paper successfully bridges the gap between abstract stability theory and practical deep learning. The experiments, particularly the control study showing performance degradation when the stabilizing constraint is removed, clearly demonstrate that the theoretical stability guarantee is not just a certificate but a crucial component for achieving high performance. The finding that increased sparsity (and stronger stability) correlates with better performance is an interesting and non-obvious result.\n\nWeaknesses\nLimited Scope of \"Sparse Combo Net\" Training: The most successful model, the Sparse Combo Net, operates under a significant constraint: the internal weights of the sub-networks are randomly initialized and then frozen. Only the inter-module connections are trained. While this simplifies satisfying the stability condition and works surprisingly well, it raises questions about the model's scalability and general applicability. It feels more like a clever initialization scheme than a fully trainable model, which may limit its adoption for more complex tasks where learning internal representations is critical.\nClarity on Practical Implementation and Parameterization: While the theory is well-explained, the practical details of implementing the parameterizations could be clearer. For the Sparse Combo Net, the process of randomly generating matrices that satisfy Theorem 1 and then finding a suitable metric M is described but feels somewhat ad-hoc. For the SVD Combo Net, the parameterization in Eq. (3) involves a matrix inverse (Φ⁻¹) which can be unstable or slow during training; it's unclear how this was handled. More detailed pseudo-code or discussion of the practicalities of these parameterizations in the main text would improve reproducibility and understanding.\nDiscussion of Continuous-Time vs. Discrete-Time: The theoretical results are derived for continuous-time RNNs (neural ODEs), but they are tested on sequential tasks that are inherently discrete. The paper does not discuss the implications of this mismatch. How were the continuous-time dynamics discretized for implementation? Does the stability guarantee from the continuous domain perfectly translate to the discretized version used in practice, especially with finite step sizes? Addressing this would strengthen the connection between the theory and the experiments.\n\nQuestions for authors\nOn the Role of Untrained Modules: Could you elaborate on why you believe the Sparse Combo Net, with its fixed, randomly initialized sub-modules, performs so well? Does this suggest that for these tasks, the primary learning challenge is routing information between simpler, fixed processing units, rather than learning complex representations within each unit? Have you experimented with any form of pre-training or fine-tuning of the sub-modules?\nOn the Generality of the Network-of-Networks Theorem: Theorem 6 provides a specific form for the interconnection matrix L. This form seems to enforce a type of anti-symmetry with respect to the metric M. Are there other, perhaps less restrictive, forms for L that would also preserve overall contraction? Is it possible that this specific structure limits the expressive power of the inter-module connections?\nOn the Choice of Benchmarks: The experiments focus on pixel-by-pixel image classification, which tests long-term memory but may not require the full dynamic range of a recurrent system. How do you expect these models would perform on tasks that require more complex temporal processing, such as language modeling or time-series forecasting? Do you foresee any challenges in applying this framework to such problems?\n\nFinal Judgement\nRationale of recommendation: This paper presents a very interesting and valuable contribution by bridging nonlinear control theory with modern deep learning. The theoretical results are novel, and the empirical validation is strong, achieving SOTA results for provably stable RNNs. However, the work has some limitations that place it on the borderline. The best-performing model relies on fixed, untrained sub-networks, which raises questions about its generalizability. Furthermore, the practical details of the model parameterizations and the theoretical gap between the continuous-time model and its discrete-time implementation could be clarified. Despite these points, the core idea is compelling and the results are significant enough to warrant serious consideration. With revisions to address the clarity issues and better contextualize the limitations, this could be a strong paper.\nRecommendation: Borderline",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper proposes a method for constructing provably stable recurrent neural networks (RNNs) by combining smaller, independently stable RNN modules.",
                      "The authors leverage contraction analysis from nonlinear control theory to first derive several novel sufficient conditions for a continuous-time RNN to be contracting (Theorems 1-5).",
                      "Their key theoretical contribution is a modularity theorem (Theorem 6) which provides a specific parameterization for the connections between contracting sub-networks that guarantees the stability of the entire \"network of networks\".",
                      "Based on this framework, they propose two architectures, \"Sparse Combo Net\" and \"SVD Combo Net\", and demonstrate their effectiveness on sequential image classification benchmarks.",
                      "The proposed Sparse Combo Net achieves state-of-the-art results among provably stable RNNs on permuted sequential MNIST and sequential CIFAR10, while using a relatively small number of parameters.",
                      "The most successful model, the Sparse Combo Net, operates under a significant constraint: the internal weights of the sub-networks are randomly initialized and then frozen, with only the inter-module connections being trained.",
                      "The theoretical results are derived for continuous-time RNNs (neural ODEs), but they are tested on sequential tasks that are inherently discrete."
                  ],
                  "evaluation": [
                      "The paper is built on a solid foundation of contraction analysis.",
                      "The derivation of five new stability conditions for continuous-time RNNs (Theorems 1-5) is a valuable theoretical contribution.",
                      "The main result, Theorem 6, which provides a constructive and stability-preserving method for combining modules, is both elegant and practical.",
                      "This provides a principled way to build complex, reliable recurrent systems.",
                      "The experimental results are strong and compelling.",
                      "The proposed \"Sparse Combo Net\" achieves state-of-the-art performance among models with stability guarantees on challenging benchmarks like permuted sequential MNIST and sequential CIFAR10.",
                      "That it does so with significantly fewer parameters than many competing models underscores the effectiveness and efficiency of the proposed approach.",
                      "The paper successfully bridges the gap between abstract stability theory and practical deep learning.",
                      "The experiments, particularly the control study showing performance degradation when the stabilizing constraint is removed, clearly demonstrate that the theoretical stability guarantee is not just a certificate but a crucial component for achieving high performance.",
                      "The finding that increased sparsity (and stronger stability) correlates with better performance is an interesting and non-obvious result.",
                      "The limited scope of \"Sparse Combo Net\" training raises questions about the model's scalability and general applicability.",
                      "It feels more like a clever initialization scheme than a fully trainable model, which may limit its adoption for more complex tasks where learning internal representations is critical.",
                      "While the theory is well-explained, the practical details of implementing the parameterizations could be clearer.",
                      "For the Sparse Combo Net, the process of randomly generating matrices that satisfy Theorem 1 and then finding a suitable metric M is described but feels somewhat ad-hoc.",
                      "For the SVD Combo Net, the parameterization in Eq. (3) involves a matrix inverse (Φ⁻¹) which can be unstable or slow during training; it's unclear how this was handled.",
                      "The paper does not discuss the implications of the mismatch between continuous-time theoretical results and discrete-time experimental testing."
                  ],
                  "constructiveInput": [
                      "More detailed pseudo-code or discussion of the practicalities of these parameterizations in the main text would improve reproducibility and understanding.",
                      "Addressing the implications of the mismatch between continuous-time theory and discrete-time implementation would strengthen the connection between the theory and the experiments.",
                      "Revisions to address the clarity issues and better contextualize the limitations would make this a strong paper."
                  ],
                  "clarificationDialogue": [
                      "Could you elaborate on why you believe the Sparse Combo Net, with its fixed, randomly initialized sub-modules, performs so well?",
                      "Does this suggest that for these tasks, the primary learning challenge is routing information between simpler, fixed processing units, rather than learning complex representations within each unit?",
                      "Have you experimented with any form of pre-training or fine-tuning of the sub-modules?",
                      "Are there other, perhaps less restrictive, forms for L that would also preserve overall contraction?",
                      "Is it possible that this specific structure limits the expressive power of the inter-module connections?",
                      "How do you expect these models would perform on tasks that require more complex temporal processing, such as language modeling or time-series forecasting?",
                      "Do you foresee any challenges in applying this framework to such problems?",
                      "How were the continuous-time dynamics discretized for implementation?",
                      "Does the stability guarantee from the continuous domain perfectly translate to the discretized version used in practice, especially with finite step sizes?"
                  ],
                  "metaCommentary": [
                      "This paper presents a very interesting and valuable contribution by bridging nonlinear control theory with modern deep learning.",
                      "The theoretical results are novel, and the empirical validation is strong, achieving SOTA results for provably stable RNNs.",
                      "However, the work has some limitations that place it on the borderline.",
                      "Despite these points, the core idea is compelling and the results are significant enough to warrant serious consideration.",
                      "Recommendation: Borderline"
                  ]
              }
          },
          {
              "review_id": "rewrite_0_gemini-2.5-pro_26adff39",
              "og_review_id": 0,
              "author": "gemini-2.5-pro",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This submission introduces new theorems concerning the stability of a specific class of Recurrent Neural Networks (RNNs). By integrating these RNNs into hierarchical and feedback-based superstructures, the authors report state-of-the-art performance on several tasks.\n\n**Theoretical Contributions**\n\nTheorems 1-5 represent a notable contribution to the understanding of stability conditions for RNNs. The authors also present a valuable counterexample to the commonly held belief that linear contraction is a sufficient condition for non-linear stability. These theoretical findings are then leveraged, in part, to construct provably stable combinations of RNNs.\n\n**Experimental Evaluation**\n\nWhile the experimental section presents interesting results, it currently lacks a clear, unifying takeaway. Furthermore, certain methodological choices in the reporting of results raise concerns.\n\n*   The authors state that for the Permuted-MNIST results in Table 1, the experiment was run four times, with outcomes ranging from 96.65 to 96.94. However, only the highest value (96.94) is reported in the table. Selectively reporting the best-case result from multiple trials is a departure from standard scientific practice and can be misleading. It is strongly recommended that the authors report the mean and standard deviation (or variance) to provide a more accurate and robust representation of the model's performance.\n\n*   Some claims appear to be supported by limited evidence. For instance, the conclusion that increased modularity is beneficial up to a certain point is substantiated only by a single comparison (44x8 versus 22x16 in Section 3.2.1). To establish a convincing trend, it would be beneficial to include additional data points, such as the suggested 50x7 and 39x9 configurations.\n\n*   In general, the inclusion of more data points and error bars across all experiments would significantly strengthen the authors' conclusions and increase reader confidence that the observed effects are genuine and not statistical artifacts.\n\n*   There appears to be a contradiction in the reported trends for Sections 3.1 and 3.2. In Section 3.1, performance increases monotonically with network size, while it follows an inverted U-shape in Section 3.2. Conversely, modularity is shown to have a monotonic benefit in Section 3.2 but an inverted U-shaped relationship in Section 3.1. The manuscript would benefit from a discussion clarifying the conclusions that should be drawn from these seemingly opposing results.\n\n*   The overall structure of the experimental section could be improved for readability. A summary of the main conclusions, along with clearly demarcated sections corresponding to each experimental claim, would help guide the reader through the results.\n\n**Clarity and Minor Points**\n\nThe clarity of the paper is inconsistent. While sections such as the discussion on stability and contraction are well-written, other parts are confusing.\n\n*   On page 3, the symbol 'g' is used to denote two different concepts within the same paragraph, which could lead to confusion.\n*   The phrasing of Theorem 7 could be revised to make it explicit that the first inequality is a precondition rather than a conclusion of the theorem. While this becomes clear upon reflection, an initial reading can be confusing.\n*   For denoting multiplication in model configurations (e.g., \"22x16\"), using the standard mathematical symbol `$\\times$` is recommended for professionalism and clarity.\n*   The rationale for having a subsection 3.2.1 without a corresponding 3.2.2 is unclear; it could be integrated directly into Section 3.2.\n*   A thorough proofreading of the entire manuscript is recommended to enhance overall readability and address minor typographical or grammatical errors.\n\n**Summary**\n\nThis paper makes an important contribution to the understanding and design of stable RNNs, with novel and noteworthy theoretical results. However, the experimental section is undermined by a lack of clear conclusions and the use of questionable reporting practices, such as reporting only the best result from multiple trials. Addressing these issues would substantially strengthen the submission.",
              "clean_review": "This submission introduces new theorems concerning the stability of a specific class of Recurrent Neural Networks (RNNs). By integrating these RNNs into hierarchical and feedback-based superstructures, the authors report state-of-the-art performance on several tasks.\n\nTheoretical Contributions\n\nTheorems 1-5 represent a notable contribution to the understanding of stability conditions for RNNs. The authors also present a valuable counterexample to the commonly held belief that linear contraction is a sufficient condition for non-linear stability. These theoretical findings are then leveraged, in part, to construct provably stable combinations of RNNs.\n\nExperimental Evaluation\n\nWhile the experimental section presents interesting results, it currently lacks a clear, unifying takeaway. Furthermore, certain methodological choices in the reporting of results raise concerns.\n\nThe authors state that for the Permuted-MNIST results in Table 1, the experiment was run four times, with outcomes ranging from 96.65 to 96.94. However, only the highest value (96.94) is reported in the table. Selectively reporting the best-case result from multiple trials is a departure from standard scientific practice and can be misleading. It is strongly recommended that the authors report the mean and standard deviation (or variance) to provide a more accurate and robust representation of the model's performance.\n\nSome claims appear to be supported by limited evidence. For instance, the conclusion that increased modularity is beneficial up to a certain point is substantiated only by a single comparison (44x8 versus 22x16 in Section 3.2.1). To establish a convincing trend, it would be beneficial to include additional data points, such as the suggested 50x7 and 39x9 configurations.\n\nIn general, the inclusion of more data points and error bars across all experiments would significantly strengthen the authors' conclusions and increase reader confidence that the observed effects are genuine and not statistical artifacts.\n\nThere appears to be a contradiction in the reported trends for Sections 3.1 and 3.2. In Section 3.1, performance increases monotonically with network size, while it follows an inverted U-shape in Section 3.2. Conversely, modularity is shown to have a monotonic benefit in Section 3.2 but an inverted U-shaped relationship in Section 3.1. The manuscript would benefit from a discussion clarifying the conclusions that should be drawn from these seemingly opposing results.\n\nThe overall structure of the experimental section could be improved for readability. A summary of the main conclusions, along with clearly demarcated sections corresponding to each experimental claim, would help guide the reader through the results.\n\nClarity and Minor Points\n\nThe clarity of the paper is inconsistent. While sections such as the discussion on stability and contraction are well-written, other parts are confusing.\n\nOn page 3, the symbol 'g' is used to denote two different concepts within the same paragraph, which could lead to confusion.\nThe phrasing of Theorem 7 could be revised to make it explicit that the first inequality is a precondition rather than a conclusion of the theorem. While this becomes clear upon reflection, an initial reading can be confusing.\nFor denoting multiplication in model configurations (e.g., \"22x16\"), using the standard mathematical symbol 'x' is recommended for professionalism and clarity.\nThe rationale for having a subsection 3.2.1 without a corresponding 3.2.2 is unclear; it could be integrated directly into Section 3.2.\nA thorough proofreading of the entire manuscript is recommended to enhance overall readability and address minor typographical or grammatical errors.\n\nSummary\n\nThis paper makes an important contribution to the understanding and design of stable RNNs, with novel and noteworthy theoretical results. However, the experimental section is undermined by a lack of clear conclusions and the use of questionable reporting practices, such as reporting only the best result from multiple trials. Addressing these issues would substantially strengthen the submission.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The submission introduces new theorems concerning the stability of a specific class of Recurrent Neural Networks (RNNs).",
                      "The authors report state-of-the-art performance on several tasks by integrating these RNNs into hierarchical and feedback-based superstructures.",
                      "Theorems 1-5 represent a contribution to the understanding of stability conditions for RNNs.",
                      "The authors present a counterexample to the commonly held belief that linear contraction is a sufficient condition for non-linear stability.",
                      "These theoretical findings are leveraged, in part, to construct provably stable combinations of RNNs.",
                      "For the Permuted-MNIST results in Table 1, the experiment was run four times, with outcomes ranging from 96.65 to 96.94, and only the highest value (96.94) is reported.",
                      "The conclusion that increased modularity is beneficial up to a certain point is substantiated only by a single comparison (44x8 versus 22x16 in Section 3.2.1).",
                      "In Section 3.1, performance increases monotonically with network size, while it follows an inverted U-shape in Section 3.2.",
                      "Modularity is shown to have a monotonic benefit in Section 3.2 but an inverted U-shaped relationship in Section 3.1.",
                      "On page 3, the symbol 'g' is used to denote two different concepts within the same paragraph.",
                      "Theorem 7's phrasing implies the first inequality is a conclusion rather than a precondition."
                  ],
                  "evaluation": [
                      "Theorems 1-5 represent a notable contribution.",
                      "The counterexample to the commonly held belief that linear contraction is a sufficient condition for non-linear stability is valuable.",
                      "The experimental section presents interesting results, but currently lacks a clear, unifying takeaway.",
                      "Certain methodological choices in the reporting of results raise concerns.",
                      "Selectively reporting the best-case result from multiple trials is a departure from standard scientific practice and can be misleading.",
                      "Some claims appear to be supported by limited evidence.",
                      "The clarity of the paper is inconsistent; while sections such as the discussion on stability and contraction are well-written, other parts are confusing.",
                      "The use of 'g' for two different concepts on page 3 could lead to confusion.",
                      "The initial reading of Theorem 7 can be confusing due to its phrasing.",
                      "This paper makes an important contribution to the understanding and design of stable RNNs, with novel and noteworthy theoretical results.",
                      "The experimental section is undermined by a lack of clear conclusions and the use of questionable reporting practices, such as reporting only the best result from multiple trials."
                  ],
                  "constructiveInput": [
                      "It is strongly recommended that the authors report the mean and standard deviation (or variance) to provide a more accurate and robust representation of the model's performance.",
                      "To establish a convincing trend for modularity, it would be beneficial to include additional data points, such as the suggested 50x7 and 39x9 configurations.",
                      "The inclusion of more data points and error bars across all experiments would significantly strengthen the authors' conclusions and increase reader confidence that the observed effects are genuine and not statistical artifacts.",
                      "The manuscript would benefit from a discussion clarifying the conclusions that should be drawn from the seemingly opposing results in Sections 3.1 and 3.2.",
                      "The overall structure of the experimental section could be improved for readability, with a summary of the main conclusions and clearly demarcated sections corresponding to each experimental claim.",
                      "The phrasing of Theorem 7 could be revised to make it explicit that the first inequality is a precondition rather than a conclusion of the theorem.",
                      "For denoting multiplication in model configurations (e.g., \"22x16\"), using the standard mathematical symbol 'x' is recommended for professionalism and clarity.",
                      "The rationale for having a subsection 3.2.1 without a corresponding 3.2.2 is unclear; it could be integrated directly into Section 3.2.",
                      "A thorough proofreading of the entire manuscript is recommended to enhance overall readability and address minor typographical or grammatical errors.",
                      "Addressing the issues in the experimental section would substantially strengthen the submission."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": []
              }
          },
          {
              "review_id": "rewrite_1_gemini-2.5-pro_6e2196f3",
              "og_review_id": 1,
              "author": "gemini-2.5-pro",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Accept",
              "raw_review": "Here is the rewritten review, polished for professional and academic communication.\n\n***\n\n### **Summary**\n\nThis paper presents a primarily theoretical contribution to the construction of stable recurrent neural network (RNN) assemblies. While the combination of learned modular components is recognized as a powerful and tractable alternative to training monolithic models from scratch, particularly in applied domains (e.g., AlphaGo), theoretical guarantees for the stability of such combinations have been lacking. This work addresses this gap by developing a theory for provably-stable combinations of RNNs, achieved through specific weight constraints and feedback mechanisms. The authors empirically validate their theory by constructing networks with fixed RNNs that adhere to these constraints, leaving only the connections between them as learnable antisymmetric parameters. The resulting sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks with significantly fewer learned parameters, while retaining the aforementioned stability guarantee.\n\n### **Strengths**\n\n*   For a primarily theoretical contribution, the empirical results are compelling. The authors conduct a thorough investigation of various permutations of their modular framework, comparing the number of RNNs versus the size of each module under absolute value weight constraints. They also evaluate an alternative SVD weight constraint, though it demonstrates lower performance and slower training. Critically, their proposed model outperforms state-of-the-art methods on several standard sequential image classification benchmarks, demonstrating the efficacy of the proposed constraints.\n*   The theoretical contribution is significant. There has been substantial recent work on networks composed of individual recurrent components (e.g., AlphaGo, Recurrent Independent Mechanisms), but these have largely relied on intuitive explanations and empirical validation rather than theoretical guarantees. Given that RNNs are notoriously unstable and difficult to train, learning these modular systems end-to-end presents a considerable challenge. The stability analysis provided in this paper is a key step toward developing robust modular systems. If these constraint conditions can be applied while maintaining strong performance—as the results suggest is plausible—it could enable the development of reusable modules that can be combined for specific tasks.\n*   The proofs provided in the appendix are well-structured and clear for a reader with the necessary mathematical background.\n\n### **Weaknesses**\n\n*   The paper is dense and challenging to parse. A thorough understanding of the contribution requires multiple readings as well as consultation of the appendix and cited references. The practical applications are not immediately intuitive from the text alone; prior familiarity with systems like AlphaGo provides more insight into the potential impact than the paper itself. The manuscript's clarity could be substantially improved with the inclusion of a graphical representation of the sparse combination network (to supplement the weight matrices in Figure 2), pseudocode for the algorithms in the appendix, and a pedagogical example illustrating how an unstable network assembly might diverge.\n*   While the empirical results are strong, they are confined to a single domain: sequential image classification. The paper would be more persuasive if it demonstrated the framework's applicability in at least one other domain, such as natural language processing, reinforcement learning, or continuous control.\n\n### **Recommendation**\n\nI recommend accepting this paper. Despite its density and the need for improved readability, the work addresses a critical and promising direction for the field of deep learning. The theoretical and empirical results have the potential to be highly impactful. To strengthen the paper, I would encourage the authors to broaden their empirical evaluation to include a different application domain and to dedicate effort to improving the manuscript's accessibility. Valuable theoretical work is too often underutilized because it is not readily comprehensible or its relevance is not made clear to the empiricists and engineers who could build upon it.",
              "clean_review": "Summary\nThis paper presents a primarily theoretical contribution to the construction of stable recurrent neural network (RNN) assemblies. While the combination of learned modular components is recognized as a powerful and tractable alternative to training monolithic models from scratch, particularly in applied domains (e.g., AlphaGo), theoretical guarantees for the stability of such combinations have been lacking. This work addresses this gap by developing a theory for provably-stable combinations of RNNs, achieved through specific weight constraints and feedback mechanisms. The authors empirically validate their theory by constructing networks with fixed RNNs that adhere to these constraints, leaving only the connections between them as learnable antisymmetric parameters. The resulting sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks with significantly fewer learned parameters, while retaining the aforementioned stability guarantee.\n\nStrengths\nFor a primarily theoretical contribution, the empirical results are compelling. The authors conduct a thorough investigation of various permutations of their modular framework, comparing the number of RNNs versus the size of each module under absolute value weight constraints. They also evaluate an alternative SVD weight constraint, though it demonstrates lower performance and slower training. Critically, their proposed model outperforms state-of-the-art methods on several standard sequential image classification benchmarks, demonstrating the efficacy of the proposed constraints.\nThe theoretical contribution is significant. There has been substantial recent work on networks composed of individual recurrent components (e.g., AlphaGo, Recurrent Independent Mechanisms), but these have largely relied on intuitive explanations and empirical validation rather than theoretical guarantees. Given that RNNs are notoriously unstable and difficult to train, learning these modular systems end-to-end presents a considerable challenge. The stability analysis provided in this paper is a key step toward developing robust modular systems. If these constraint conditions can be applied while maintaining strong performance—as the results suggest is plausible—it could enable the development of reusable modules that can be combined for specific tasks.\nThe proofs provided in the appendix are well-structured and clear for a reader with the necessary mathematical background.\n\nWeaknesses\nThe paper is dense and challenging to parse. A thorough understanding of the contribution requires multiple readings as well as consultation of the appendix and cited references. The practical applications are not immediately intuitive from the text alone; prior familiarity with systems like AlphaGo provides more insight into the potential impact than the paper itself. The manuscript's clarity could be substantially improved with the inclusion of a graphical representation of the sparse combination network (to supplement the weight matrices in Figure 2), pseudocode for the algorithms in the appendix, and a pedagogical example illustrating how an unstable network assembly might diverge.\nWhile the empirical results are strong, they are confined to a single domain: sequential image classification. The paper would be more persuasive if it demonstrated the framework's applicability in at least one other domain, such as natural language processing, reinforcement learning, or continuous control.\n\nRecommendation\nI recommend accepting this paper. Despite its density and the need for improved readability, the work addresses a critical and promising direction for the field of deep learning. The theoretical and empirical results have the potential to be highly impactful. To strengthen the paper, I would encourage the authors to broaden their empirical evaluation to include a different application domain and to dedicate effort to improving the manuscript's accessibility. Valuable theoretical work is too often underutilized because it is not readily comprehensible or its relevance is not made clear to the empiricists and engineers who could build upon it.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper presents a primarily theoretical contribution to the construction of stable recurrent neural network (RNN) assemblies.",
                      "The work addresses the lack of theoretical guarantees for the stability of combinations of learned modular components by developing a theory for provably-stable combinations of RNNs, achieved through specific weight constraints and feedback mechanisms.",
                      "The authors empirically validate their theory by constructing networks with fixed RNNs that adhere to these constraints, leaving only the connections between them as learnable antisymmetric parameters.",
                      "The resulting sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks with significantly fewer learned parameters, while retaining the stability guarantee.",
                      "The authors conduct a thorough investigation of various permutations of their modular framework, comparing the number of RNNs versus the size of each module under absolute value weight constraints.",
                      "They also evaluate an alternative SVD weight constraint, though it demonstrates lower performance and slower training.",
                      "The proofs provided in the appendix are well-structured and clear for a reader with the necessary mathematical background.",
                      "The empirical results are confined to a single domain: sequential image classification."
                  ],
                  "evaluation": [
                      "For a primarily theoretical contribution, the empirical results are compelling.",
                      "Critically, their proposed model outperforms state-of-the-art methods on several standard sequential image classification benchmarks, demonstrating the efficacy of the proposed constraints.",
                      "The theoretical contribution is significant.",
                      "The stability analysis provided in this paper is a key step toward developing robust modular systems.",
                      "If these constraint conditions can be applied while maintaining strong performance—as the results suggest is plausible—it could enable the development of reusable modules that can be combined for specific tasks.",
                      "The paper is dense and challenging to parse.",
                      "A thorough understanding of the contribution requires multiple readings as well as consultation of the appendix and cited references.",
                      "The practical applications are not immediately intuitive from the text alone; prior familiarity with systems like AlphaGo provides more insight into the potential impact than the paper itself.",
                      "While the empirical results are strong, they are confined to a single domain: sequential image classification."
                  ],
                  "constructiveInput": [
                      "The manuscript's clarity could be substantially improved with the inclusion of a graphical representation of the sparse combination network (to supplement the weight matrices in Figure 2), pseudocode for the algorithms in the appendix, and a pedagogical example illustrating how an unstable network assembly might diverge.",
                      "The paper would be more persuasive if it demonstrated the framework's applicability in at least one other domain, such as natural language processing, reinforcement learning, or continuous control.",
                      "To strengthen the paper, I would encourage the authors to broaden their empirical evaluation to include a different application domain and to dedicate effort to improving the manuscript's accessibility."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "I recommend accepting this paper.",
                      "Despite its density and the need for improved readability, the work addresses a critical and promising direction for the field of deep learning.",
                      "The theoretical and empirical results have the potential to be highly impactful.",
                      "Valuable theoretical work is too often underutilized because it is not readily comprehensible or its relevance is not made clear to the empiricists and engineers who could build upon it."
                  ]
              }
          },
          {
              "review_id": "rewrite_2_gemini-2.5-pro_9c2d64dd",
              "og_review_id": 2,
              "author": "gemini-2.5-pro",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This paper investigates the contraction properties of continuous-time recurrent neural networks (RNNs). The authors demonstrate that a network composed of provably stable RNN modules (a \"net of nets\") can achieve competitive performance on several benchmarks, including sequential CIFAR10, even when training is restricted to the connections between modules.\n\n### Strengths\n\n*   The paper addresses the compelling question of how to effectively assemble modular networks of RNNs, which is a valuable research direction.\n*   The theoretical contributions, particularly the theorems on contraction properties, provide a useful framework for researchers working on provably stable recurrent models.\n\n### Weaknesses\n\n1.  The presentation of experimental results in Section 3.2.1 could be improved. The extensive reporting on numerous individual AxB network configurations is dense and detracts from the section's readability. A more concise summary of these findings would enhance the clarity of the paper's primary results.\n\n2.  To better isolate the contribution of the stability constraints, the analysis would benefit from an ablation study. While Table 1 provides a comparison with other network architectures, a direct comparison against an identical architecture where the individual RNN modules are not constrained to be provably stable would be highly informative.\n\n3.  The experimental setup raises a question regarding the training strategy. The authors demonstrate strong performance when training only the inter-module connections. It would be valuable to understand the impact of training all network weights, including those within the modules. This comparison would clarify whether the sparse training approach is essential for performance or if a fully trained model might yield different results (e.g., improved or degraded performance despite a larger number of trainable parameters).\n\n4.  The connection between the two main contributions of the paper—the theoretical analysis of stability and the empirical results on modular networks—could be more explicitly established. The manuscript would be strengthened by a clearer discussion of why provably stable modules are particularly advantageous for the proposed \"net of nets\" architecture.\n\n### Overall Assessment\n\nThis is an interesting paper that explores RNNs from the less common perspectives of provable stability and modular \"net-of-nets\" architectures. While some sections of the results are challenging to parse, the overall message is clear and presents a novel approach.\n\n*Disclaimer: My evaluation does not extend to the formal verification of the mathematical derivations.*",
              "clean_review": "The paper addresses the compelling question of how to effectively assemble modular networks of RNNs, which is a valuable research direction.\nThe theoretical contributions, particularly the theorems on contraction properties, provide a useful framework for researchers working on provably stable recurrent models.\nThe presentation of experimental results in Section 3.2.1 could be improved. The extensive reporting on numerous individual AxB network configurations is dense and detracts from the section's readability. A more concise summary of these findings would enhance the clarity of the paper's primary results.\nTo better isolate the contribution of the stability constraints, the analysis would benefit from an ablation study. While Table 1 provides a comparison with other network architectures, a direct comparison against an identical architecture where the individual RNN modules are not constrained to be provably stable would be highly informative.\nThe experimental setup raises a question regarding the training strategy. The authors demonstrate strong performance when training only the inter-module connections. It would be valuable to understand the impact of training all network weights, including those within the modules. This comparison would clarify whether the sparse training approach is essential for performance or if a fully trained model might yield different results (e.g., improved or degraded performance despite a larger number of trainable parameters).\nThe connection between the two main contributions of the paper—the theoretical analysis of stability and the empirical results on modular networks—could be more explicitly established. The manuscript would be strengthened by a clearer discussion of why provably stable modules are particularly advantageous for the proposed \"net of nets\" architecture.\nThis is an interesting paper that explores RNNs from the less common perspectives of provable stability and modular \"net-of-nets\" architectures. While some sections of the results are challenging to parse, the overall message is clear and presents a novel approach.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper addresses the question of how to effectively assemble modular networks of RNNs.",
                      "The paper includes theoretical contributions, specifically theorems on contraction properties.",
                      "The authors demonstrate strong performance when training only the inter-module connections."
                  ],
                  "evaluation": [
                      "The question of how to effectively assemble modular networks of RNNs is a valuable research direction.",
                      "The theoretical contributions provide a useful framework for researchers working on provably stable recurrent models.",
                      "The presentation of experimental results in Section 3.2.1 could be improved.",
                      "The extensive reporting on numerous individual AxB network configurations is dense and detracts from the section's readability.",
                      "While some sections of the results are challenging to parse, the overall message is clear and presents a novel approach."
                  ],
                  "constructiveInput": [
                      "A more concise summary of the findings in Section 3.2.1 would enhance the clarity of the paper's primary results.",
                      "To better isolate the contribution of the stability constraints, the analysis would benefit from an ablation study, comparing against an identical architecture where individual RNN modules are not constrained to be provably stable.",
                      "The connection between the theoretical analysis of stability and the empirical results on modular networks could be more explicitly established.",
                      "The manuscript would be strengthened by a clearer discussion of why provably stable modules are particularly advantageous for the proposed \"net of nets\" architecture."
                  ],
                  "clarificationDialogue": [
                      "It would be valuable to understand the impact of training all network weights, including those within the modules, to clarify whether the sparse training approach is essential for performance or if a fully trained model might yield different results."
                  ],
                  "metaCommentary": [
                      "This is an interesting paper that explores RNNs from the less common perspectives of provable stability and modular \"net-of-nets\" architectures."
                  ]
              }
          },
          {
              "review_id": "rewrite_3_gemini-2.5-pro_3ff3faf2",
              "og_review_id": 3,
              "author": "gemini-2.5-pro",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "This review assesses a paper on stable architectures for Recurrent Neural Networks (RNNs). The paper's primary contributions are twofold: a theoretical analysis presenting conditions for contractive weight matrices in RNNs, and a new modeling approach proposing RNN architectures that adhere to these conditions. The proposed methods are evaluated on standard benchmark datasets, including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical results are interesting, though perhaps not entirely surprising. A more significant concern is the apparent disconnection between the theoretical analysis and the proposed model. It appears that not all theorems presented are directly relevant to or utilized in the construction of the proposed architecture, which weakens the link between the paper's theoretical and practical contributions.\n\nThe presentation of the proposed model in Section 3.1 lacks clarity, making it difficult for the reader to fully grasp its mechanics. For instance, the term \"subnetworks\" is introduced at the beginning of the section without a clear definition or explanation of their function. The method for combining these subnetworks is also not explicitly detailed. The reviewer had to infer the resulting block matrix structure from the section title (\"recursive construction\") and Figure 2, rather than from a direct explanation in the text. Improving the exposition of the model would significantly benefit the paper.\n\nFurthermore, a core component of the proposed model—parameterizing orthogonal weight matrices by exponentiating skew-symmetric matrices—lacks novelty. This technique has been previously explored in the literature, notably in the expRNN paper by Lezcano-Casado and Martínez-Rubio (2019). The authors should properly situate their work with respect to this prior art.\n\nFinally, the introduction could be strengthened. The current discussion of topics like AlphaGo and evolutionary modules could be replaced with a more focused and thorough literature review on RNN trainability and the challenge of capturing long-term dependencies. This would provide more relevant context for the paper's contributions.\n\nIn summary, while the theoretical findings are noteworthy, the paper requires substantial revision. The lack of clarity in the model's description, concerns about its novelty, and the disconnect between theory and application need to be addressed before the paper can meet the threshold for acceptance.",
              "clean_review": "This review assesses a paper on stable architectures for Recurrent Neural Networks (RNNs). The paper's primary contributions are twofold: a theoretical analysis presenting conditions for contractive weight matrices in RNNs, and a new modeling approach proposing RNN architectures that adhere to these conditions. The proposed methods are evaluated on standard benchmark datasets, including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nThe theoretical results are interesting, though perhaps not entirely surprising. A more significant concern is the apparent disconnection between the theoretical analysis and the proposed model. It appears that not all theorems presented are directly relevant to or utilized in the construction of the proposed architecture, which weakens the link between the paper's theoretical and practical contributions.\n\nThe presentation of the proposed model in Section 3.1 lacks clarity, making it difficult for the reader to fully grasp its mechanics. For instance, the term \"subnetworks\" is introduced at the beginning of the section without a clear definition or explanation of their function. The method for combining these subnetworks is also not explicitly detailed. The reviewer had to infer the resulting block matrix structure from the section title (\"recursive construction\") and Figure 2, rather than from a direct explanation in the text. Improving the exposition of the model would significantly benefit the paper.\n\nFurthermore, a core component of the proposed model—parameterizing orthogonal weight matrices by exponentiating skew-symmetric matrices—lacks novelty. This technique has been previously explored in the literature, notably in the expRNN paper by Lezcano-Casado and Martínez-Rubio (2019). The authors should properly situate their work with respect to this prior art.\n\nFinally, the introduction could be strengthened. The current discussion of topics like AlphaGo and evolutionary modules could be replaced with a more focused and thorough literature review on RNN trainability and the challenge of capturing long-term dependencies. This would provide more relevant context for the paper's contributions.\n\nIn summary, while the theoretical findings are noteworthy, the paper requires substantial revision. The lack of clarity in the model's description, concerns about its novelty, and the disconnect between theory and application need to be addressed before the paper can meet the threshold for acceptance.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper assesses stable architectures for Recurrent Neural Networks (RNNs).",
                      "The paper's primary contributions are a theoretical analysis presenting conditions for contractive weight matrices in RNNs, and a new modeling approach proposing RNN architectures that adhere to these conditions.",
                      "The proposed methods are evaluated on standard benchmark datasets, including sequential MNIST, permuted MNIST, and sequential CIFAR-10.",
                      "A core component of the proposed model involves parameterizing orthogonal weight matrices by exponentiating skew-symmetric matrices.",
                      "The expRNN paper by Lezcano-Casado and Martínez-Rubio (2019) has previously explored the technique of parameterizing orthogonal weight matrices by exponentiating skew-symmetric matrices."
                  ],
                  "evaluation": [
                      "The theoretical results are interesting, though perhaps not entirely surprising.",
                      "A significant concern is the apparent disconnection between the theoretical analysis and the proposed model.",
                      "It appears that not all theorems presented are directly relevant to or utilized in the construction of the proposed architecture, which weakens the link between the paper's theoretical and practical contributions.",
                      "The presentation of the proposed model in Section 3.1 lacks clarity, making it difficult for the reader to fully grasp its mechanics.",
                      "The term \"subnetworks\" is introduced at the beginning of Section 3.1 without a clear definition or explanation of their function.",
                      "The method for combining these subnetworks is not explicitly detailed.",
                      "The reviewer had to infer the resulting block matrix structure from the section title (\"recursive construction\") and Figure 2, rather than from a direct explanation in the text.",
                      "The technique of parameterizing orthogonal weight matrices by exponentiating skew-symmetric matrices lacks novelty.",
                      "The theoretical findings are noteworthy."
                  ],
                  "constructiveInput": [
                      "Improving the exposition of the model would significantly benefit the paper.",
                      "The authors should properly situate their work with respect to prior art, specifically the expRNN paper by Lezcano-Casado and Martínez-Rubio (2019).",
                      "The introduction could be strengthened.",
                      "The current discussion of topics like AlphaGo and evolutionary modules could be replaced with a more focused and thorough literature review on RNN trainability and the challenge of capturing long-term dependencies.",
                      "The lack of clarity in the model's description, concerns about its novelty, and the disconnect between theory and application need to be addressed."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "The paper requires substantial revision.",
                      "The paper needs to address several issues before it can meet the threshold for acceptance."
                  ]
              }
          },
          {
              "review_id": "ai_deepseek-v3.1_Accept_5a2ff595",
              "author": "deepseek-v3.1",
              "class": "ai",
              "prompt": "Accept",
              "raw_review": "# Summary of paper\nThis paper introduces a framework for constructing stable \"networks of networks\" by leveraging contraction theory. It presents several novel sufficient conditions for ensuring contraction in continuous-time RNNs (Theorems 1-5) and proves a combination theorem (Theorem 6) that enables stable interconnection of contracting subnetworks. The authors implement two architectures (Sparse Combo Net and SVD Combo Net) that enforce these stability constraints and demonstrate state-of-the-art performance among provably stable RNNs on sequential image classification tasks, particularly permuted sequential MNIST and sequential CIFAR10, while maintaining fewer parameters than many comparable models.\n\n# Strengths\n- **Theoretical Novelty and Rigor**: The paper provides several new contraction conditions for continuous-time RNNs (Theorems 1-5) and a general combination theorem (Theorem 6) that enables modular, stable construction of complex networks. The proofs are included in supplements, and counterexamples (e.g., Theorem 8) clarify the limitations of previous work.\n- **Practical Implementation and Validation**: The proposed architectures (Sparse Combo Net and SVD Combo Net) are concretely implemented, trained, and evaluated on standard sequential tasks. The Sparse Combo Net achieves SOTA for stable RNNs on permuted seqMNIST (96.94%) and seqCIFAR10 (64.75%), demonstrating that strong stability constraints need not compromise performance.\n- **Modularity and Biological Relevance**: The work aligns with principles of modularity in both engineering and biology (e.g., \"facilitated variation\"), offering a systematic way to build complex systems from stable components. The emphasis on sparsity and antisymmetric connections also resonates with neuroscientific insights.\n- **Comprehensive Experiments**: The paper includes ablation studies on sparsity, network size, and modularity, along with repeatability tests and control experiments (e.g., unconstrained L matrix), which robustly support the claims.\n\n# Weaknesses\n- **Limited Comparison to Non-Stable SOTA**: While the models outperform other stable RNNs, they do not match overall SOTA methods (e.g., Trellis networks or large Transformers) on some tasks. The trade-off between stability and performance could be more critically discussed.\n- **Computational Overhead**: The parameterization for stability (e.g., skew-symmetric matrix exponentials for orthogonality in SVD Combo Net) may introduce computational costs during training, though this is not explicitly analyzed.\n- **Task Scope**: Experiments are confined to sequential image classification; broader evaluation on tasks requiring different temporal dynamics (e.g., time-series prediction, reinforcement learning) would strengthen the generalizability claim.\n\n# Questions for authors (if any)\n- **Scalability to Larger Networks**: How does the computational cost of solving for metrics (e.g., in Sparse Combo Net) scale with the number and size of subnetworks? Are there approximate methods for very large networks?\n- **Generalization to Discrete-Time**: The analysis focuses on continuous-time RNNs. Can similar combination theorems be derived for discrete-time models, and how would that affect training and performance?\n- **Theoretical Limitations**: Theorem 7 suggests a less restrictive condition involving time-varying metrics. Is there a practical way to enforce this during training, and what improvements might it yield?\n\n# Final Judgement\n- **Rationale of recommendation**: This paper makes significant theoretical contributions to stability-guaranteed RNN design via contraction theory and demonstrates practical utility through state-of-the-art performance among stable models on challenging benchmarks. The work is timely, well-executed, and addresses an important gap in modular and safe AI systems. The weaknesses are minor and do not detract from the overall impact.\n- **Recommendation**: Accept",
              "clean_review": "Summary of paper\nThis paper introduces a framework for constructing stable \"networks of networks\" by leveraging contraction theory. It presents several novel sufficient conditions for ensuring contraction in continuous-time RNNs (Theorems 1-5) and proves a combination theorem (Theorem 6) that enables stable interconnection of contracting subnetworks. The authors implement two architectures (Sparse Combo Net and SVD Combo Net) that enforce these stability constraints and demonstrate state-of-the-art performance among provably stable RNNs on sequential image classification tasks, particularly permuted sequential MNIST and sequential CIFAR10, while maintaining fewer parameters than many comparable models.\n\nStrengths\nTheoretical Novelty and Rigor: The paper provides several new contraction conditions for continuous-time RNNs (Theorems 1-5) and a general combination theorem (Theorem 6) that enables modular, stable construction of complex networks. The proofs are included in supplements, and counterexamples (e.g., Theorem 8) clarify the limitations of previous work.\nPractical Implementation and Validation: The proposed architectures (Sparse Combo Net and SVD Combo Net) are concretely implemented, trained, and evaluated on standard sequential tasks. The Sparse Combo Net achieves SOTA for stable RNNs on permuted seqMNIST (96.94%) and seqCIFAR10 (64.75%), demonstrating that strong stability constraints need not compromise performance.\nModularity and Biological Relevance: The work aligns with principles of modularity in both engineering and biology (e.g., \"facilitated variation\"), offering a systematic way to build complex systems from stable components. The emphasis on sparsity and antisymmetric connections also resonates with neuroscientific insights.\nComprehensive Experiments: The paper includes ablation studies on sparsity, network size, and modularity, along with repeatability tests and control experiments (e.g., unconstrained L matrix), which robustly support the claims.\n\nWeaknesses\nLimited Comparison to Non-Stable SOTA: While the models outperform other stable RNNs, they do not match overall SOTA methods (e.g., Trellis networks or large Transformers) on some tasks. The trade-off between stability and performance could be more critically discussed.\nComputational Overhead: The parameterization for stability (e.g., skew-symmetric matrix exponentials for orthogonality in SVD Combo Net) may introduce computational costs during training, though this is not explicitly analyzed.\nTask Scope: Experiments are confined to sequential image classification; broader evaluation on tasks requiring different temporal dynamics (e.g., time-series prediction, reinforcement learning) would strengthen the generalizability claim.\n\nQuestions for authors (if any)\nScalability to Larger Networks: How does the computational cost of solving for metrics (e.g., in Sparse Combo Net) scale with the number and size of subnetworks? Are there approximate methods for very large networks?\nGeneralization to Discrete-Time: The analysis focuses on continuous-time RNNs. Can similar combination theorems be derived for discrete-time models, and how would that affect training and performance?\nTheoretical Limitations: Theorem 7 suggests a less restrictive condition involving time-varying metrics. Is there a practical way to enforce this during training, and what improvements might it yield?\n\nFinal Judgement\nRationale of recommendation: This paper makes significant theoretical contributions to stability-guaranteed RNN design via contraction theory and demonstrates practical utility through state-of-the-art performance among stable models on challenging benchmarks. The work is timely, well-executed, and addresses an important gap in modular and safe AI systems. The weaknesses are minor and do not detract from the overall impact.\nRecommendation: Accept",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper introduces a framework for constructing stable \"networks of networks\" by leveraging contraction theory.",
                      "It presents several novel sufficient conditions for ensuring contraction in continuous-time RNNs (Theorems 1-5) and proves a combination theorem (Theorem 6) that enables stable interconnection of contracting subnetworks.",
                      "The authors implement two architectures (Sparse Combo Net and SVD Combo Net) that enforce these stability constraints.",
                      "The models demonstrate state-of-the-art performance among provably stable RNNs on sequential image classification tasks, particularly permuted sequential MNIST and sequential CIFAR10, while maintaining fewer parameters than many comparable models.",
                      "The proposed architectures (Sparse Combo Net and SVD Combo Net) are concretely implemented, trained, and evaluated on standard sequential tasks.",
                      "The Sparse Combo Net achieves SOTA for stable RNNs on permuted seqMNIST (96.94%) and seqCIFAR10 (64.75%).",
                      "The paper includes ablation studies on sparsity, network size, and modularity, along with repeatability tests and control experiments (e.g., unconstrained L matrix).",
                      "Theorem 7 suggests a less restrictive condition involving time-varying metrics.",
                      "The analysis focuses on continuous-time RNNs."
                  ],
                  "evaluation": [
                      "The paper provides several new contraction conditions for continuous-time RNNs (Theorems 1-5) and a general combination theorem (Theorem 6) that enables modular, stable construction of complex networks, demonstrating theoretical novelty and rigor.",
                      "The proofs are included in supplements, and counterexamples (e.g., Theorem 8) clarify the limitations of previous work.",
                      "The practical implementation and validation show that strong stability constraints need not compromise performance.",
                      "The work aligns with principles of modularity in both engineering and biology (e.g., \"facilitated variation\"), offering a systematic way to build complex systems from stable components.",
                      "The emphasis on sparsity and antisymmetric connections also resonates with neuroscientific insights.",
                      "The comprehensive experiments robustly support the claims.",
                      "While the models outperform other stable RNNs, they do not match overall SOTA methods (e.g., Trellis networks or large Transformers) on some tasks.",
                      "The parameterization for stability (e.g., skew-symmetric matrix exponentials for orthogonality in SVD Combo Net) may introduce computational costs during training, though this is not explicitly analyzed.",
                      "Experiments are confined to sequential image classification, which limits the generalizability claim."
                  ],
                  "constructiveInput": [
                      "The trade-off between stability and performance could be more critically discussed.",
                      "Broader evaluation on tasks requiring different temporal dynamics (e.g., time-series prediction, reinforcement learning) would strengthen the generalizability claim."
                  ],
                  "clarificationDialogue": [
                      "How does the computational cost of solving for metrics (e.g., in Sparse Combo Net) scale with the number and size of subnetworks?",
                      "Are there approximate methods for very large networks?",
                      "Can similar combination theorems be derived for discrete-time models, and how would that affect training and performance?",
                      "Is there a practical way to enforce the less restrictive condition involving time-varying metrics during training, and what improvements might it yield?"
                  ],
                  "metaCommentary": [
                      "This paper makes significant theoretical contributions to stability-guaranteed RNN design via contraction theory and demonstrates practical utility through state-of-the-art performance among stable models on challenging benchmarks.",
                      "The work is timely, well-executed, and addresses an important gap in modular and safe AI systems.",
                      "The weaknesses are minor and do not detract from the overall impact.",
                      "The paper is recommended for acceptance."
                  ]
              }
          },
          {
              "review_id": "ai_deepseek-v3.1_Borderline_25f68cdd",
              "author": "deepseek-v3.1",
              "class": "ai",
              "prompt": "Borderline",
              "raw_review": "# Summary of paper\n\nThis paper introduces a framework for constructing provably stable \"networks of networks\" by leveraging contraction theory. The authors derive several novel contraction conditions for continuous-time recurrent neural networks (RNNs) and present a theorem that enables stable combination of multiple contracting subnetworks through a constrained parameterization of inter-module connections. They demonstrate the effectiveness of their approach on sequential image classification tasks (permuted MNIST and CIFAR-10), achieving state-of-the-art performance among stability-guaranteed RNNs while maintaining fewer parameters than many alternatives.\n\n# Strengths\n- **Novel theoretical contributions**: The paper provides multiple new contraction conditions for continuous-time RNNs (Theorems 1-5) and a principled framework for combining stable subnetworks (Theorem 6), advancing the theoretical foundation for stable modular neural networks.\n- **Strong empirical results**: The proposed \"Sparse Combo Net\" achieves impressive performance on challenging sequential tasks (96.94% on permuted MNIST, 64.75% on sequential CIFAR-10), outperforming other stability-guaranteed methods while using relatively few parameters.\n- **Biological and practical relevance**: The work connects to biological principles of modularity and facilitated variation, and addresses important practical concerns around stability in safety-critical applications of machine learning.\n\n# Weaknesses\n- **Limited comparison to non-stable methods**: While the paper shows advantages over other stability-guaranteed approaches, it doesn't sufficiently benchmark against high-performing non-stable methods (e.g., Transformers, Trellis Networks) to contextualize the performance trade-offs of the stability constraints.\n- **Theoretical-clinical gap**: The contraction conditions (especially Theorem 1) appear quite restrictive, and it's not fully explained why such strongly constrained networks can achieve competitive performance. The analysis of this apparent paradox (strong stability correlating with good performance) is somewhat superficial.\n- **Implementation details**: The description of how exactly the sparse networks are initialized and the contraction metrics are computed could be more detailed to ensure reproducibility. The training procedure for the sparse networks (where only inter-module connections are trained) seems non-standard and warrants more explanation.\n\n# Questions for authors\n- **Stability-performance tradeoff**: Could you provide more analysis on why the strong stability constraints (particularly the sparse connectivity) don't hamper performance, and perhaps even improve it on these tasks?\n- **Comparison baseline**: Why not include comparisons against top-performing non-stable methods in your main results table to better situate your approach in the broader landscape?\n- **Computational complexity**: What is the computational overhead of maintaining the stability constraints during training compared to standard RNN training?\n- **Generalization**: Have you tested your approach on other sequential tasks beyond image classification to assess its broader applicability?\n\n# Final Judgement\n- **Rationale of recommendation**: This paper makes meaningful theoretical contributions to stable RNN design and demonstrates impressive empirical results on standard benchmarks. However, the comparison to non-stable state-of-the-art methods is insufficient, and the explanation of why strong stability constraints don't hurt performance needs strengthening. The biological motivation is interesting but somewhat disconnected from the technical contributions.\n- **Recommendation**: Borderline",
              "clean_review": "This paper introduces a framework for constructing provably stable \"networks of networks\" by leveraging contraction theory. The authors derive several novel contraction conditions for continuous-time recurrent neural networks (RNNs) and present a theorem that enables stable combination of multiple contracting subnetworks through a constrained parameterization of inter-module connections. They demonstrate the effectiveness of their approach on sequential image classification tasks (permuted MNIST and CIFAR-10), achieving state-of-the-art performance among stability-guaranteed RNNs while maintaining fewer parameters than many alternatives.\n\nStrengths\nNovel theoretical contributions: The paper provides multiple new contraction conditions for continuous-time RNNs (Theorems 1-5) and a principled framework for combining stable subnetworks (Theorem 6), advancing the theoretical foundation for stable modular neural networks.\nStrong empirical results: The proposed \"Sparse Combo Net\" achieves impressive performance on challenging sequential tasks (96.94% on permuted MNIST, 64.75% on sequential CIFAR-10), outperforming other stability-guaranteed methods while using relatively few parameters.\nBiological and practical relevance: The work connects to biological principles of modularity and facilitated variation, and addresses important practical concerns around stability in safety-critical applications of machine learning.\n\nWeaknesses\nLimited comparison to non-stable methods: While the paper shows advantages over other stability-guaranteed approaches, it doesn't sufficiently benchmark against high-performing non-stable methods (e.g., Transformers, Trellis Networks) to contextualize the performance trade-offs of the stability constraints.\nTheoretical-clinical gap: The contraction conditions (especially Theorem 1) appear quite restrictive, and it's not fully explained why such strongly constrained networks can achieve competitive performance. The analysis of this apparent paradox (strong stability correlating with good performance) is somewhat superficial.\nImplementation details: The description of how exactly the sparse networks are initialized and the contraction metrics are computed could be more detailed to ensure reproducibility. The training procedure for the sparse networks (where only inter-module connections are trained) seems non-standard and warrants more explanation.\n\nQuestions for authors\nStability-performance tradeoff: Could you provide more analysis on why the strong stability constraints (particularly the sparse connectivity) don't hamper performance, and perhaps even improve it on these tasks?\nComparison baseline: Why not include comparisons against top-performing non-stable methods in your main results table to better situate your approach in the broader landscape?\nComputational complexity: What is the computational overhead of maintaining the stability constraints during training compared to standard RNN training?\nGeneralization: Have you tested your approach on other sequential tasks beyond image classification to assess its broader applicability?\n\nFinal Judgement\nRationale of recommendation: This paper makes meaningful theoretical contributions to stable RNN design and demonstrates impressive empirical results on standard benchmarks. However, the comparison to non-stable state-of-the-art methods is insufficient, and the explanation of why strong stability constraints don't hurt performance needs strengthening. The biological motivation is interesting but somewhat disconnected from the technical contributions.\nRecommendation: Borderline",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper introduces a framework for constructing provably stable \"networks of networks\" by leveraging contraction theory.",
                      "The authors derive several novel contraction conditions for continuous-time recurrent neural networks (RNNs).",
                      "They present a theorem that enables stable combination of multiple contracting subnetworks through a constrained parameterization of inter-module connections.",
                      "They demonstrate the effectiveness of their approach on sequential image classification tasks (permuted MNIST and CIFAR-10).",
                      "The proposed \"Sparse Combo Net\" achieves impressive performance on challenging sequential tasks (96.94% on permuted MNIST, 64.75% on sequential CIFAR-10).",
                      "The work connects to biological principles of modularity and facilitated variation."
                  ],
                  "evaluation": [
                      "The paper achieves state-of-the-art performance among stability-guaranteed RNNs while maintaining fewer parameters than many alternatives.",
                      "The paper provides multiple new contraction conditions for continuous-time RNNs (Theorems 1-5) and a principled framework for combining stable subnetworks (Theorem 6), advancing the theoretical foundation for stable modular neural networks.",
                      "The proposed \"Sparse Combo Net\" outperforms other stability-guaranteed methods while using relatively few parameters.",
                      "The work addresses important practical concerns around stability in safety-critical applications of machine learning.",
                      "The paper doesn't sufficiently benchmark against high-performing non-stable methods (e.g., Transformers, Trellis Networks) to contextualize the performance trade-offs of the stability constraints.",
                      "The contraction conditions (especially Theorem 1) appear quite restrictive.",
                      "The analysis of the apparent paradox (strong stability correlating with good performance) is somewhat superficial.",
                      "The description of how exactly the sparse networks are initialized and the contraction metrics are computed could be more detailed to ensure reproducibility.",
                      "The training procedure for the sparse networks (where only inter-module connections are trained) seems non-standard and warrants more explanation."
                  ],
                  "constructiveInput": [
                      "The description of how exactly the sparse networks are initialized and the contraction metrics are computed could be more detailed to ensure reproducibility.",
                      "The training procedure for the sparse networks (where only inter-module connections are trained) warrants more explanation."
                  ],
                  "clarificationDialogue": [
                      "Could you provide more analysis on why the strong stability constraints (particularly the sparse connectivity) don't hamper performance, and perhaps even improve it on these tasks?",
                      "Why not include comparisons against top-performing non-stable methods in your main results table to better situate your approach in the broader landscape?",
                      "What is the computational overhead of maintaining the stability constraints during training compared to standard RNN training?",
                      "Have you tested your approach on other sequential tasks beyond image classification to assess its broader applicability?"
                  ],
                  "metaCommentary": [
                      "This paper makes meaningful theoretical contributions to stable RNN design and demonstrates impressive empirical results on standard benchmarks.",
                      "The comparison to non-stable state-of-the-art methods is insufficient.",
                      "The explanation of why strong stability constraints don't hurt performance needs strengthening.",
                      "The biological motivation is interesting but somewhat disconnected from the technical contributions.",
                      "Recommendation: Borderline"
                  ]
              }
          },
          {
              "review_id": "rewrite_0_deepseek-v3.1_1d9ae655",
              "og_review_id": 0,
              "author": "deepseek-v3.1",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "The submission introduces novel theorems that establish stability conditions for a class of recurrent neural networks (RNNs). Furthermore, by integrating these RNNs into hierarchical and feedback superstructures, state-of-the-art performance is achieved across multiple tasks.\n\n**Theoretical Contributions**\n\nTheorems 1–5 represent an evolutionary advancement in understanding the conditions for stability. The authors also provide a counterexample to the widely held assumption that linear contraction is a sufficient condition for nonlinear stability. These theoretical insights are partially utilized in the construction of provably stable RNN combinations.\n\n**Experimental Evaluation**\n\nWhile the experimental section presents interesting results, it lacks a clearly defined central message. Additionally, certain reporting choices in Table 1 raise concerns.\n\nFor instance, the authors indicate that the Permuted MNIST trials were conducted four times, with results ranging from 96.65% to 96.94%. It is noteworthy that only the highest value (96.94%) is reported in the table. Selecting only the best-performing run from multiple trials is not considered good practice, as it may misrepresent typical performance. Reporting the mean and variance would be more appropriate and informative. (Reporting only the best outcome after numerous trials could artificially inflate perceived performance beyond what is achievable on average.)\n\nSome claims are supported by limited evidence. For example, the assertion that increased modularity improves performance up to a certain point is based solely on the observation that a 44×8 configuration outperforms a 22×16 configuration in Section 3.2.1. To substantiate this conclusion and demonstrate a trend, the authors should consider examining additional configurations, such as 50×7 and 39×9.\n\nIn general, incorporating more data points and error bars would strengthen the validity of the conclusions and help distinguish genuine effects from random variations.\n\nContradictory trends are observed between Sections 3.1 and 3.2: performance increases monotonically with network size in Section 3.1 but follows an inverted U-shape in Section 3.2, while modularity shows an inverted U-shape relationship in Section 3.1 but a monotonic improvement in Section 3.2. These discrepancies warrant clarification regarding the overall conclusions.\n\nOverall, the experimental section would benefit from improved organization, including a summary of key findings and clearly delineated paragraphs or subsections corresponding to each demonstrated point.\n\n**Clarity and Minor Points**\n\nThe clarity of the paper is inconsistent. Discussions of stability and contraction are well articulated, yet certain passages, such as the paragraph at the top of page 3, use the symbol *g* to denote two distinct concepts within the same context.\n\nTheorem 7 could be rephrased to explicitly indicate that the first inequality is a condition rather than a statement. Although apparent in retrospect, this distinction may confuse new readers.\n\nFor mathematical notation, it is recommended to use the multiplication symbol (×) instead of the letter \"x\" (e.g., 22×16 rather than 22x16).\n\nThe rationale for including a subsection 3.2.1 instead of integrating its content into Section 3.2 is unclear. A thorough revision of the manuscript to enhance overall readability is advised.\n\n**Summary**\n\nThis paper constitutes an evolutionary step forward in the understanding and design of stable RNNs, with theoretical contributions that are both novel and significant. However, the experimental results lack definitive conclusions and, in some instances, deviate from established reporting practices (e.g., selectively reporting the best performance from multiple runs).",
              "clean_review": "The submission introduces novel theorems that establish stability conditions for a class of recurrent neural networks (RNNs). Furthermore, by integrating these RNNs into hierarchical and feedback superstructures, state-of-the-art performance is achieved across multiple tasks.\n\nTheoretical Contributions\n\nTheorems 1–5 represent an evolutionary advancement in understanding the conditions for stability. The authors also provide a counterexample to the widely held assumption that linear contraction is a sufficient condition for nonlinear stability. These theoretical insights are partially utilized in the construction of provably stable RNN combinations.\n\nExperimental Evaluation\n\nWhile the experimental section presents interesting results, it lacks a clearly defined central message. Additionally, certain reporting choices in Table 1 raise concerns.\n\nFor instance, the authors indicate that the Permuted MNIST trials were conducted four times, with results ranging from 96.65% to 96.94%. It is noteworthy that only the highest value (96.94%) is reported in the table. Selecting only the best-performing run from multiple trials is not considered good practice, as it may misrepresent typical performance. Reporting the mean and variance would be more appropriate and informative. (Reporting only the best outcome after numerous trials could artificially inflate perceived performance beyond what is achievable on average.)\n\nSome claims are supported by limited evidence. For example, the assertion that increased modularity improves performance up to a certain point is based solely on the observation that a 44×8 configuration outperforms a 22×16 configuration in Section 3.2.1. To substantiate this conclusion and demonstrate a trend, the authors should consider examining additional configurations, such as 50×7 and 39×9.\n\nIn general, incorporating more data points and error bars would strengthen the validity of the conclusions and help distinguish genuine effects from random variations.\n\nContradictory trends are observed between Sections 3.1 and 3.2: performance increases monotonically with network size in Section 3.1 but follows an inverted U-shape in Section 3.2, while modularity shows an inverted U-shape relationship in Section 3.1 but a monotonic improvement in Section 3.2. These discrepancies warrant clarification regarding the overall conclusions.\n\nOverall, the experimental section would benefit from improved organization, including a summary of key findings and clearly delineated paragraphs or subsections corresponding to each demonstrated point.\n\nClarity and Minor Points\n\nThe clarity of the paper is inconsistent. Discussions of stability and contraction are well articulated, yet certain passages, such as the paragraph at the top of page 3, use the symbol g to denote two distinct concepts within the same context.\n\nTheorem 7 could be rephrased to explicitly indicate that the first inequality is a condition rather than a statement. Although apparent in retrospect, this distinction may confuse new readers.\n\nFor mathematical notation, it is recommended to use the multiplication symbol (×) instead of the letter \"x\" (e.g., 22×16 rather than 22x16).\n\nThe rationale for including a subsection 3.2.1 instead of integrating its content into Section 3.2 is unclear. A thorough revision of the manuscript to enhance overall readability is advised.\n\nSummary\n\nThis paper constitutes an evolutionary step forward in the understanding and design of stable RNNs, with theoretical contributions that are both novel and significant. However, the experimental results lack definitive conclusions and, in some instances, deviate from established reporting practices (e.g., selectively reporting the best performance from multiple runs).",
              "claim_extraction": {
                  "factualRestatement": [
                      "The submission introduces novel theorems that establish stability conditions for a class of recurrent neural networks (RNNs).",
                      "By integrating these RNNs into hierarchical and feedback superstructures, state-of-the-art performance is achieved across multiple tasks.",
                      "Theorems 1–5 represent an evolutionary advancement in understanding the conditions for stability.",
                      "The authors provide a counterexample to the widely held assumption that linear contraction is a sufficient condition for nonlinear stability.",
                      "These theoretical insights are partially utilized in the construction of provably stable RNN combinations.",
                      "The Permuted MNIST trials were conducted four times, with results ranging from 96.65% to 96.94%, and only the highest value (96.94%) is reported in Table 1.",
                      "The assertion that increased modularity improves performance up to a certain point is based solely on the observation that a 44×8 configuration outperforms a 22×16 configuration in Section 3.2.1.",
                      "Contradictory trends are observed between Sections 3.1 and 3.2 regarding network size and modularity: performance increases monotonically with network size in Section 3.1 but follows an inverted U-shape in Section 3.2, while modularity shows an inverted U-shape relationship in Section 3.1 but a monotonic improvement in Section 3.2.",
                      "Discussions of stability and contraction are well articulated.",
                      "Certain passages, such as the paragraph at the top of page 3, use the symbol g to denote two distinct concepts within the same context."
                  ],
                  "evaluation": [
                      "The theoretical contributions are both novel and significant.",
                      "The experimental section presents interesting results, but it lacks a clearly defined central message.",
                      "Certain reporting choices in Table 1 raise concerns.",
                      "Selecting only the best-performing run from multiple trials is not considered good practice, as it may misrepresent typical performance.",
                      "Reporting only the best outcome after numerous trials could artificially inflate perceived performance beyond what is achievable on average.",
                      "Some claims are supported by limited evidence.",
                      "The clarity of the paper is inconsistent.",
                      "The experimental results lack definitive conclusions and, in some instances, deviate from established reporting practices (e.g., selectively reporting the best performance from multiple runs)."
                  ],
                  "constructiveInput": [
                      "Reporting the mean and variance would be more appropriate and informative for experimental results.",
                      "To substantiate the conclusion about modularity and demonstrate a trend, the authors should consider examining additional configurations, such as 50×7 and 39×9.",
                      "Incorporating more data points and error bars would strengthen the validity of the conclusions and help distinguish genuine effects from random variations.",
                      "The discrepancies in trends between Sections 3.1 and 3.2 warrant clarification regarding the overall conclusions.",
                      "The experimental section would benefit from improved organization, including a summary of key findings and clearly delineated paragraphs or subsections corresponding to each demonstrated point.",
                      "The authors should clarify the use of the symbol g when it denotes two distinct concepts within the same context.",
                      "Theorem 7 could be rephrased to explicitly indicate that the first inequality is a condition rather than a statement.",
                      "For mathematical notation, it is recommended to use the multiplication symbol (×) instead of the letter \"x\" (e.g., 22×16 rather than 22x16).",
                      "A thorough revision of the manuscript to enhance overall readability is advised."
                  ],
                  "clarificationDialogue": [
                      "The rationale for including a subsection 3.2.1 instead of integrating its content into Section 3.2 is unclear."
                  ],
                  "metaCommentary": [
                      "This paper constitutes an evolutionary step forward in the understanding and design of stable RNNs."
                  ]
              }
          },
          {
              "review_id": "rewrite_1_deepseek-v3.1_ec726e98",
              "og_review_id": 1,
              "author": "deepseek-v3.1",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Accept",
              "raw_review": "This paper presents a theoretical contribution to the construction of assemblies of recurrent neural networks. While combinations of learned modular components are known to be powerful and more tractable than training bespoke models from scratch—as seen in applied domains such as AlphaGo—theoretical guarantees for the stability of such assemblies have been lacking. This work develops a theoretical foundation for provably stable combinations of RNNs through the use of weight constraints and feedback mechanisms. The authors then demonstrate that, by employing fixed RNNs generated under these constraints—with antisymmetric learnable parameters for interconnections—their sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks, using significantly fewer learned parameters while retaining stability guarantees.\n\n**Strengths:**\n- The empirical results are compelling, especially given the paper’s primarily theoretical focus. The authors thoroughly explore variations of their modular sparse combination framework, including comparisons between the number of RNNs and their size under absolute value weight constraints, and investigate an alternative SVD-based constraint approach (which underperforms and trains more slowly). Most notably, the framework outperforms state-of-the-art algorithms on established, though relatively accessible, benchmarks, even—and perhaps because of—the imposed constraints.\n- The theoretical contribution is substantial. Recent work in networks with multiple recurrent components, such as AlphaGo or the recurrent independent mechanisms (RIMs) framework, has largely relied on intuition and empirical validation rather than theoretical assurances. Given the notorious instability and training challenges of RNNs, the ability to apply constraint-based conditions and still achieve competitive performance—particularly with absolute value constraints—opens avenues for developing reusable modules tailored to specific tasks. Although several intermediate questions remain, the stability analysis represents a critical advance.\n- The proofs provided in the appendix are rigorous and accessible to readers with adequate mathematical background.\n\n**Weaknesses:**\n- The paper is highly dense and challenging to follow. Multiple readings were required to fully appreciate the significance of network stability and the mechanisms by which it is achieved. The appendix and several references are essential for comprehension. The use cases are not intuitive; inclusion of a graphical representation of the sparse combination network (beyond the weight matrices in Figure 2), pseudocode for key algorithms (perhaps in the appendix), and an illustrative example of an unstable assembly diverging could improve clarity. Personal familiarity with systems like AlphaGo provided more practical insight than the paper itself.\n- Although the empirical results are strong, they are limited to sequential image prediction. Extending validation to at least one additional domain—such as natural language processing, reinforcement learning, or continuous control—would strengthen the generality of the findings.\n\n**Overall Recommendation:**  \nThis paper is recommended for acceptance. Despite its density and the need for supplemental reading, the work aligns with an important direction in deep learning and offers both theoretical and practical value. To enhance the paper, the inclusion of results from another application domain and greater emphasis on improving readability would be beneficial. Highly theoretical contributions often remain underutilized when they are challenging to interpret or appear less directly applicable to empiricists and engineers who might build upon them.",
              "clean_review": "This paper presents a theoretical contribution to the construction of assemblies of recurrent neural networks. While combinations of learned modular components are known to be powerful and more tractable than training bespoke models from scratch—as seen in applied domains such as AlphaGo—theoretical guarantees for the stability of such assemblies have been lacking. This work develops a theoretical foundation for provably stable combinations of RNNs through the use of weight constraints and feedback mechanisms. The authors then demonstrate that, by employing fixed RNNs generated under these constraints—with antisymmetric learnable parameters for interconnections—their sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks, using significantly fewer learned parameters while retaining stability guarantees.\n\nStrengths:\nThe empirical results are compelling, especially given the paper’s primarily theoretical focus. The authors thoroughly explore variations of their modular sparse combination framework, including comparisons between the number of RNNs and their size under absolute value weight constraints, and investigate an alternative SVD-based constraint approach (which underperforms and trains more slowly). Most notably, the framework outperforms state-of-the-art algorithms on established, though relatively accessible, benchmarks, even—and perhaps because of—the imposed constraints.\nThe theoretical contribution is substantial. Recent work in networks with multiple recurrent components, such as AlphaGo or the recurrent independent mechanisms (RIMs) framework, has largely relied on intuition and empirical validation rather than theoretical assurances. Given the notorious instability and training challenges of RNNs, the ability to apply constraint-based conditions and still achieve competitive performance—particularly with absolute value constraints—opens avenues for developing reusable modules tailored to specific tasks. Although several intermediate questions remain, the stability analysis represents a critical advance.\nThe proofs provided in the appendix are rigorous and accessible to readers with adequate mathematical background.\n\nWeaknesses:\nThe paper is highly dense and challenging to follow. Multiple readings were required to fully appreciate the significance of network stability and the mechanisms by which it is achieved. The appendix and several references are essential for comprehension. The use cases are not intuitive; inclusion of a graphical representation of the sparse combination network (beyond the weight matrices in Figure 2), pseudocode for key algorithms (perhaps in the appendix), and an illustrative example of an unstable assembly diverging could improve clarity. Personal familiarity with systems like AlphaGo provided more practical insight than the paper itself.\nAlthough the empirical results are strong, they are limited to sequential image prediction. Extending validation to at least one additional domain—such as natural language processing, reinforcement learning, or continuous control—would strengthen the generality of the findings.\n\nOverall Recommendation:\nThis paper is recommended for acceptance. Despite its density and the need for supplemental reading, the work aligns with an important direction in deep learning and offers both theoretical and practical value. To enhance the paper, the inclusion of results from another application domain and greater emphasis on improving readability would be beneficial. Highly theoretical contributions often remain underutilized when they are challenging to interpret or appear less directly applicable to empiricists and engineers who might build upon them.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper presents a theoretical contribution to the construction of assemblies of recurrent neural networks.",
                      "Combinations of learned modular components are known to be powerful and more tractable than training bespoke models from scratch.",
                      "This work develops a theoretical foundation for provably stable combinations of RNNs through the use of weight constraints and feedback mechanisms.",
                      "The authors demonstrate that, by employing fixed RNNs generated under these constraints—with antisymmetric learnable parameters for interconnections—their sparse combination network achieves state-of-the-art performance on sequential image classification benchmarks.",
                      "The framework uses significantly fewer learned parameters while retaining stability guarantees.",
                      "The authors thoroughly explore variations of their modular sparse combination framework, including comparisons between the number of RNNs and their size under absolute value weight constraints.",
                      "The paper investigates an alternative SVD-based constraint approach, which underperforms and trains more slowly.",
                      "Recent work in networks with multiple recurrent components, such as AlphaGo or the recurrent independent mechanisms (RIMs) framework, has largely relied on intuition and empirical validation rather than theoretical assurances.",
                      "The proofs provided in the appendix are rigorous and accessible to readers with adequate mathematical background.",
                      "The empirical results are limited to sequential image prediction."
                  ],
                  "evaluation": [
                      "The empirical results are compelling, especially given the paper’s primarily theoretical focus.",
                      "Most notably, the framework outperforms state-of-the-art algorithms on established, though relatively accessible, benchmarks, even—and perhaps because of—the imposed constraints.",
                      "The theoretical contribution is substantial.",
                      "Given the notorious instability and training challenges of RNNs, the ability to apply constraint-based conditions and still achieve competitive performance—particularly with absolute value constraints—opens avenues for developing reusable modules tailored to specific tasks.",
                      "Although several intermediate questions remain, the stability analysis represents a critical advance.",
                      "The paper is highly dense and challenging to follow.",
                      "Multiple readings were required to fully appreciate the significance of network stability and the mechanisms by which it is achieved.",
                      "The appendix and several references are essential for comprehension.",
                      "The use cases are not intuitive.",
                      "Personal familiarity with systems like AlphaGo provided more practical insight than the paper itself.",
                      "Although the empirical results are strong, they are limited to sequential image prediction."
                  ],
                  "constructiveInput": [
                      "Inclusion of a graphical representation of the sparse combination network (beyond the weight matrices in Figure 2), pseudocode for key algorithms (perhaps in the appendix), and an illustrative example of an unstable assembly diverging could improve clarity.",
                      "Extending validation to at least one additional domain—such as natural language processing, reinforcement learning, or continuous control—would strengthen the generality of the findings.",
                      "To enhance the paper, the inclusion of results from another application domain and greater emphasis on improving readability would be beneficial."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "Theoretical guarantees for the stability of such assemblies have been lacking.",
                      "This paper is recommended for acceptance.",
                      "Despite its density and the need for supplemental reading, the work aligns with an important direction in deep learning and offers both theoretical and practical value.",
                      "Highly theoretical contributions often remain underutilized when they are challenging to interpret or appear less directly applicable to empiricists and engineers who might build upon them."
                  ]
              }
          },
          {
              "review_id": "rewrite_2_deepseek-v3.1_844eac9c",
              "og_review_id": 2,
              "author": "deepseek-v3.1",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "The authors investigate the contraction properties of continuous-time recurrent neural networks and further demonstrate that a provably stable network of such RNNs can be trained to achieve competitive performance across several benchmarks, including sequential CIFAR-10, even when only the inter-module connections are trained.\n\n**Strengths**  \nThe problem of assembling networks of RNNs is compelling, and the theoretical results on contraction properties provide valuable insights for researchers working with provably stable RNNs.\n\n**Weaknesses**  \n(1) Section 3.2.1 is somewhat dense. Presenting results from numerous individual AxB networks may not be strictly necessary.  \n(2) While Table 1 compares performance with other network types, it would be informative to include results from configurations where all aspects are identical except that the RNNs are no longer provably stable.  \n(3) It would also be beneficial to examine the performance when all connection weights, not just those between modules, are trained. Specifically, does performance decline despite the increased number of parameters?  \n(4) The connection between the provable stability analysis and the training of modular networks appears somewhat disjointed. The relationship between these components, such as whether stable RNN modules are particularly advantageous for sparsely connected modular architectures, merits further clarification.\n\nOverall, this paper presents an innovative approach to RNNs by emphasizing provable stability and modular network design. Although certain sections are challenging to follow, the presentation is generally clear.\n\nI note that I am not in a position to assess the correctness of the mathematical derivations.",
              "clean_review": "The problem of assembling networks of RNNs is compelling, and the theoretical results on contraction properties provide valuable insights for researchers working with provably stable RNNs.\n\nSection 3.2.1 is somewhat dense. Presenting results from numerous individual AxB networks may not be strictly necessary.\nWhile Table 1 compares performance with other network types, it would be informative to include results from configurations where all aspects are identical except that the RNNs are no longer provably stable.\nIt would also be beneficial to examine the performance when all connection weights, not just those between modules, are trained. Specifically, does performance decline despite the increased number of parameters?\nThe connection between the provable stability analysis and the training of modular networks appears somewhat disjointed. The relationship between these components, such as whether stable RNN modules are particularly advantageous for sparsely connected modular architectures, merits further clarification.\n\nOverall, this paper presents an innovative approach to RNNs by emphasizing provable stability and modular network design. Although certain sections are challenging to follow, the presentation is generally clear.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper addresses the problem of assembling networks of RNNs.",
                      "The paper provides theoretical results on contraction properties."
                  ],
                  "evaluation": [
                      "The problem of assembling networks of RNNs is compelling.",
                      "The theoretical results on contraction properties provide valuable insights for researchers working with provably stable RNNs.",
                      "Section 3.2.1 is somewhat dense.",
                      "The connection between the provable stability analysis and the training of modular networks appears somewhat disjointed.",
                      "This paper presents an innovative approach to RNNs by emphasizing provable stability and modular network design.",
                      "Although certain sections are challenging to follow, the presentation is generally clear."
                  ],
                  "constructiveInput": [
                      "Presenting results from numerous individual AxB networks in Section 3.2.1 may not be strictly necessary.",
                      "It would be informative to include results from configurations where all aspects are identical except that the RNNs are no longer provably stable in Table 1.",
                      "It would be beneficial to examine the performance when all connection weights, not just those between modules, are trained.",
                      "The relationship between the provable stability analysis and the training of modular networks, such as whether stable RNN modules are particularly advantageous for sparsely connected modular architectures, merits further clarification."
                  ],
                  "clarificationDialogue": [
                      "Specifically, does performance decline despite the increased number of parameters when all connection weights are trained?"
                  ],
                  "metaCommentary": []
              }
          },
          {
              "review_id": "rewrite_3_deepseek-v3.1_a509a1f4",
              "og_review_id": 3,
              "author": "deepseek-v3.1",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "In this paper, the authors investigate stable architectures for recurrent neural networks (RNNs). From a theoretical perspective, they derive a set of conditions under which the weight matrix of an RNN is contractive. From a modeling perspective, they introduce RNN architectures that incorporate such contractive weight matrices. The proposed methods are evaluated on established benchmark datasets, including sequential MNIST, permuted MNIST, and sequential CIFAR-10.\n\nWhile the theoretical contributions appear interesting, they are not especially surprising. A more significant concern, however, is the apparent disconnect between the theoretical analysis and the proposed model, as not all theoretical results appear directly relevant to the model architecture.\n\nRegarding the model described in Section 3.1, the presentation could be improved for clarity. For instance, the term \"subnetworks\" is introduced without sufficient explanation, making it difficult to understand how these components are integrated. It can only be inferred from the title and Figure 2 that the resulting weight matrix is constructed in a block-wise manner.\n\nAdditionally, the use of the exponential of a skew-symmetric matrix to parametrize orthogonal weight matrices is not novel, as a similar approach was previously introduced in expRNN [1].\n\nThe introduction could also benefit from refinement. Rather than discussing AlphaGo or evolutionary modules, it would be more helpful to provide a more comprehensive review of prior work on RNN trainability and long-term dependency issues.\n\n[1] Lezcano-Casado, Mario, and David Martínez-Rubio. \"Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group.\" International Conference on Machine Learning. PMLR, 2019.\n\nIn summary, although the theoretical results are of interest, the clarity of the model description and its novelty are limited. Overall, the paper requires substantive improvements to meet the standards for acceptance.",
              "clean_review": "While the theoretical contributions appear interesting, they are not especially surprising. A more significant concern, however, is the apparent disconnect between the theoretical analysis and the proposed model, as not all theoretical results appear directly relevant to the model architecture.\n\nRegarding the model described in Section 3.1, the presentation could be improved for clarity. For instance, the term \"subnetworks\" is introduced without sufficient explanation, making it difficult to understand how these components are integrated. It can only be inferred from the title and Figure 2 that the resulting weight matrix is constructed in a block-wise manner.\n\nAdditionally, the use of the exponential of a skew-symmetric matrix to parametrize orthogonal weight matrices is not novel, as a similar approach was previously introduced in expRNN.\n\nThe introduction could also benefit from refinement. Rather than discussing AlphaGo or evolutionary modules, it would be more helpful to provide a more comprehensive review of prior work on RNN trainability and long-term dependency issues.\n\nIn summary, although the theoretical results are of interest, the clarity of the model description and its novelty are limited. Overall, the paper requires substantive improvements to meet the standards for acceptance.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper discusses a model in Section 3.1.",
                      "The paper uses the exponential of a skew-symmetric matrix to parametrize orthogonal weight matrices."
                  ],
                  "evaluation": [
                      "The theoretical contributions appear interesting, but not especially surprising.",
                      "There is an apparent disconnect between the theoretical analysis and the proposed model, as not all theoretical results appear directly relevant to the model architecture.",
                      "The presentation of the model described in Section 3.1 could be improved for clarity.",
                      "The term \"subnetworks\" is introduced without sufficient explanation, making it difficult to understand how these components are integrated.",
                      "The use of the exponential of a skew-symmetric matrix to parametrize orthogonal weight matrices is not novel, as a similar approach was previously introduced in expRNN.",
                      "The clarity of the model description and its novelty are limited."
                  ],
                  "constructiveInput": [
                      "The presentation of the model described in Section 3.1 could be improved for clarity.",
                      "The introduction could benefit from refinement.",
                      "Rather than discussing AlphaGo or evolutionary modules, it would be more helpful to provide a more comprehensive review of prior work on RNN trainability and long-term dependency issues."
                  ],
                  "clarificationDialogue": [
                      "It can only be inferred from the title and Figure 2 that the resulting weight matrix is constructed in a block-wise manner."
                  ],
                  "metaCommentary": [
                      "Overall, the paper requires substantive improvements to meet the standards for acceptance."
                  ]
              }
          },
          {
              "review_id": "ai_qwen3-235b-a22b_Accept_6f33e3af",
              "author": "qwen3-235b-a22b",
              "class": "ai",
              "prompt": "Accept",
              "raw_review": "# Summary of paper  \nThis paper establishes novel contraction conditions for continuous-time recurrent neural networks (RNNs) and neural ODEs, enabling the recursive construction of provably stable \"networks of networks.\" The authors derive five sufficient conditions under which individual RNNs are contracting, then prove a key composition theorem (Theorem 6) showing how such contracting subnetworks can be interconnected via a structured feedback matrix **L** that preserves global contraction. They instantiate two architectures—Sparse Combo Nets (fixed sparse subnetworks) and SVD Combo Nets (trainable, singular-value-constrained subnetworks)—and demonstrate state-of-the-art performance on sequential MNIST and CIFAR-10 with strong stability guarantees. Experiments highlight the benefits of sparsity and modularity, and ablation studies confirm the importance of the proposed stability-preserving interconnection structure.\n\n# Strengths  \n- **Strong theoretical contribution**: The derivation of multiple new, verifiable contraction conditions for continuous-time RNNs significantly expands the toolbox for stable recurrent model design. These conditions are both general and practical.\n- **Novel and impactful composition theorem**: Theorem 6 provides a rare constructive method for building stable *modular* systems—enabling learnable, stability-preserving interconnections between pre-verified contracting modules. This is a major step toward scalable, certifiable deep architectures.\n- **State-of-the-art empirical performance with strong guarantees**: The proposed models achieve SOTA results among *provably stable* RNNs on permuted seqMNIST and seqCIFAR10, with competitive accuracy and relatively few parameters. This validates the practical relevance of the theory.\n- **Rigorous experimental validation**: The authors conduct extensive ablation studies on sparsity, scalability, repeatability, and the necessity of the stability-preserving interconnects, providing compelling evidence for their claims.\n- **Interdisciplinary significance**: The work bridges control theory, neuroscience (modularity, sparsity), and deep learning, offering tools relevant to safe AI, biological modeling, and modular machine learning.\n\n# Weaknesses  \n- **Limited evaluation on longer sequences or more complex tasks**: While permuted seqMNIST and seqCIFAR10 are standard, they are not extremely long-range or semantically complex. The true advantage of contraction-based stability in handling very long dependencies (e.g., 10k+ steps) remains less explored.\n- **Sparse Combo Net’s fixed internal weights may limit expressivity**: Freezing subnetwork weights avoids training complexity but could restrict representational power. The strong performance suggests this is not fatal, but the trade-off deserves more discussion.\n- **Theorem 8 is under-explained**: The counterexample showing that symmetric negative definiteness of $gW_{\\text{sym}} - I$ does not imply contraction is important, but its implications for practice (e.g., limitations of prior work) could be more clearly articulated.\n- **Scalability of metric computation not discussed**: For large-scale deployments, computing or storing block-diagonal contraction metrics $\\tilde{\\mathbf{M}}$ may become a bottleneck. A brief discussion of computational cost would strengthen the practicality assessment.\n\n# Questions for authors  \n- **Sparsity vs. stability trade-off**: You show that sparser subnetworks perform better, possibly due to allowing larger weight magnitudes under stability constraints. Could you quantify this trade-off analytically or empirically—e.g., by plotting performance vs. spectral radius or contraction rate?\n- **Generalization beyond classification**: Your models excel at sequence classification. Have you tested them on generation or regression tasks (e.g., language modeling, time-series prediction), where stability and long-term dynamics are equally critical?\n- **Broader applicability of Theorem 6**: Can the interconnection structure (via **L**) be adapted to other network types (e.g., transformers with recurrence, graph RNNs), or is it inherently tied to continuous-time RNNs?\n\n# Final Judgement  \n- **Rationale of recommendation**: This paper presents a theoretically grounded, practically effective framework for building stable, modular recurrent networks. The contraction analysis is rigorous, the composition theorem is a key innovation, and the experiments convincingly demonstrate both stability and strong performance. Despite minor limitations in task scope and exposition depth, the work represents a significant advance in stable deep learning with broad implications.\n- **Recommendation**: Accept",
              "clean_review": "Summary of paper\nThis paper establishes novel contraction conditions for continuous-time recurrent neural networks (RNNs) and neural ODEs, enabling the recursive construction of provably stable \"networks of networks.\" The authors derive five sufficient conditions under which individual RNNs are contracting, then prove a key composition theorem (Theorem 6) showing how such contracting subnetworks can be interconnected via a structured feedback matrix L that preserves global contraction. They instantiate two architectures—Sparse Combo Nets (fixed sparse subnetworks) and SVD Combo Nets (trainable, singular-value-constrained subnetworks)—and demonstrate state-of-the-art performance on sequential MNIST and CIFAR-10 with strong stability guarantees. Experiments highlight the benefits of sparsity and modularity, and ablation studies confirm the importance of the proposed stability-preserving interconnection structure.\n\nStrengths\nStrong theoretical contribution: The derivation of multiple new, verifiable contraction conditions for continuous-time RNNs significantly expands the toolbox for stable recurrent model design. These conditions are both general and practical.\nNovel and impactful composition theorem: Theorem 6 provides a rare constructive method for building stable modular systems—enabling learnable, stability-preserving interconnections between pre-verified contracting modules. This is a major step toward scalable, certifiable deep architectures.\nState-of-the-art empirical performance with strong guarantees: The proposed models achieve SOTA results among provably stable RNNs on permuted seqMNIST and seqCIFAR10, with competitive accuracy and relatively few parameters. This validates the practical relevance of the theory.\nRigorous experimental validation: The authors conduct extensive ablation studies on sparsity, scalability, repeatability, and the necessity of the stability-preserving interconnects, providing compelling evidence for their claims.\nInterdisciplinary significance: The work bridges control theory, neuroscience (modularity, sparsity), and deep learning, offering tools relevant to safe AI, biological modeling, and modular machine learning.\n\nWeaknesses\nLimited evaluation on longer sequences or more complex tasks: While permuted seqMNIST and seqCIFAR10 are standard, they are not extremely long-range or semantically complex. The true advantage of contraction-based stability in handling very long dependencies (e.g., 10k+ steps) remains less explored.\nSparse Combo Net’s fixed internal weights may limit expressivity: Freezing subnetwork weights avoids training complexity but could restrict representational power. The strong performance suggests this is not fatal, but the trade-off deserves more discussion.\nTheorem 8 is under-explained: The counterexample showing that symmetric negative definiteness of gWsym - I does not imply contraction is important, but its implications for practice (e.g., limitations of prior work) could be more clearly articulated.\nScalability of metric computation not discussed: For large-scale deployments, computing or storing block-diagonal contraction metrics M may become a bottleneck. A brief discussion of computational cost would strengthen the practicality assessment.\n\nQuestions for authors\nSparsity vs. stability trade-off: You show that sparser subnetworks perform better, possibly due to allowing larger weight magnitudes under stability constraints. Could you quantify this trade-off analytically or empirically—e.g., by plotting performance vs. spectral radius or contraction rate?\nGeneralization beyond classification: Your models excel at sequence classification. Have you tested them on generation or regression tasks (e.g., language modeling, time-series prediction), where stability and long-term dynamics are equally critical?\nBroader applicability of Theorem 6: Can the interconnection structure (via L) be adapted to other network types (e.g., transformers with recurrence, graph RNNs), or is it inherently tied to continuous-time RNNs?\n\nFinal Judgement\nRationale of recommendation: This paper presents a theoretically grounded, practically effective framework for building stable, modular recurrent networks. The contraction analysis is rigorous, the composition theorem is a key innovation, and the experiments convincingly demonstrate both stability and strong performance. Despite minor limitations in task scope and exposition depth, the work represents a significant advance in stable deep learning with broad implications.\nRecommendation: Accept",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper establishes novel contraction conditions for continuous-time recurrent neural networks (RNNs) and neural ODEs, enabling the recursive construction of provably stable \"networks of networks.\"",
                      "The authors derive five sufficient conditions under which individual RNNs are contracting, then prove a key composition theorem (Theorem 6) showing how such contracting subnetworks can be interconnected via a structured feedback matrix L that preserves global contraction.",
                      "They instantiate two architectures—Sparse Combo Nets (fixed sparse subnetworks) and SVD Combo Nets (trainable, singular-value-constrained subnetworks)—and demonstrate state-of-the-art performance on sequential MNIST and CIFAR-10 with strong stability guarantees.",
                      "Experiments highlight the benefits of sparsity and modularity, and ablation studies confirm the importance of the proposed stability-preserving interconnection structure."
                  ],
                  "evaluation": [
                      "The derivation of multiple new, verifiable contraction conditions for continuous-time RNNs significantly expands the toolbox for stable recurrent model design.",
                      "These conditions are both general and practical.",
                      "Theorem 6 provides a rare constructive method for building stable modular systems—enabling learnable, stability-preserving interconnections between pre-verified contracting modules.",
                      "This is a major step toward scalable, certifiable deep architectures.",
                      "The proposed models achieve SOTA results among provably stable RNNs on permuted seqMNIST and seqCIFAR10, with competitive accuracy and relatively few parameters.",
                      "This validates the practical relevance of the theory.",
                      "The authors conduct extensive ablation studies on sparsity, scalability, repeatability, and the necessity of the stability-preserving interconnects, providing compelling evidence for their claims.",
                      "The work bridges control theory, neuroscience (modularity, sparsity), and deep learning, offering tools relevant to safe AI, biological modeling, and modular machine learning.",
                      "While permuted seqMNIST and seqCIFAR10 are standard, they are not extremely long-range or semantically complex.",
                      "The true advantage of contraction-based stability in handling very long dependencies (e.g., 10k+ steps) remains less explored.",
                      "Freezing subnetwork weights avoids training complexity but could restrict representational power.",
                      "The strong performance suggests this is not fatal, but the trade-off deserves more discussion.",
                      "The counterexample showing that symmetric negative definiteness of gWsym - I does not imply contraction is important."
                  ],
                  "constructiveInput": [
                      "The implications of Theorem 8 for practice (e.g., limitations of prior work) could be more clearly articulated.",
                      "For large-scale deployments, computing or storing block-diagonal contraction metrics M may become a bottleneck. A brief discussion of computational cost would strengthen the practicality assessment."
                  ],
                  "clarificationDialogue": [
                      "You show that sparser subnetworks perform better, possibly due to allowing larger weight magnitudes under stability constraints. Could you quantify this trade-off analytically or empirically—e.g., by plotting performance vs. spectral radius or contraction rate?",
                      "Your models excel at sequence classification. Have you tested them on generation or regression tasks (e.g., language modeling, time-series prediction), where stability and long-term dynamics are equally critical?",
                      "Can the interconnection structure (via L) be adapted to other network types (e.g., transformers with recurrence, graph RNNs), or is it inherently tied to continuous-time RNNs?"
                  ],
                  "metaCommentary": [
                      "This paper presents a theoretically grounded, practically effective framework for building stable, modular recurrent networks.",
                      "The contraction analysis is rigorous, the composition theorem is a key innovation, and the experiments convincingly demonstrate both stability and strong performance.",
                      "Despite minor limitations in task scope and exposition depth, the work represents a significant advance in stable deep learning with broad implications."
                  ]
              }
          },
          {
              "review_id": "ai_qwen3-235b-a22b_Borderline_8c5a9426",
              "author": "qwen3-235b-a22b",
              "class": "ai",
              "prompt": "Borderline",
              "raw_review": "# Summary of paper  \nThis paper proposes a framework for constructing provably stable \"networks of networks\" using continuous-time recurrent neural networks (RNNs) and neural ODEs. It introduces five novel contraction conditions for individual RNNs and leverages contraction theory to derive a parameterization of inter-subnetwork connections—via Theorem 6—that guarantees global stability of the composite system. The authors instantiate two architectures: *Sparse Combo Nets*, where subnetwork weights are fixed and sparse, and *SVD Combo Nets*, where all parameters are trainable under a singular value-based constraint. Experiments on sequential MNIST and CIFAR-10 show strong performance, particularly for Sparse Combo Nets, which achieve state-of-the-art results among stability-guaranteed RNNs with fewer parameters.\n\n---\n\n# Strengths  \n- **Theoretical novelty and rigor**: The contraction conditions (Theorems 1–5) for continuous-time RNNs are well-formulated and expand the toolkit for analyzing nonlinear dynamics in neural networks. The use of contraction theory is mathematically sound and builds on a strong control-theoretic foundation.\n- **Modular stability via combination theorem**: Theorem 6 provides a clear, constructive method to combine contracting subnetworks while preserving stability—a valuable contribution toward scalable, verifiable deep architectures.\n- **Strong empirical performance**: The proposed models, especially Sparse Combo Nets, achieve competitive or superior accuracy on benchmark sequence tasks compared to other stable RNNs, despite fewer trainable parameters.\n- **Ablation and sensitivity studies**: The paper includes thorough controls—e.g., testing the importance of the interconnection stability constraint and studying sparsity/size effects—which strengthen the validity of the claims.\n\n---\n\n# Weaknesses  \n- **Limited novelty in combination mechanism**: While Theorem 6 is useful, the form of the interconnection matrix **L** = **B** − **M̃**⁻¹**B**ᵀ**M̃** is a known construction in contraction theory (differential Lyapunov framework). The paper does not sufficiently distinguish this from prior work (e.g., Slotine & Lohmiller, 2001), nor does it compare against simpler or alternative stable interconnection schemes.\n- **Unclear scalability and generalization**: Results are limited to small-scale image classification tasks. There is no evaluation on longer sequences, real-time series, or tasks requiring complex memory dynamics (e.g., algorithmic reasoning), raising questions about broader applicability.\n- **Sparse Combo Net training protocol raises concerns**: Fixing subnetwork weights *a priori* and only training inter-module connections deviates significantly from standard end-to-end learning. The performance gains may stem more from architectural inductive bias than the stability framework itself, yet this is not disentangled.\n- **Ambiguous stability claims**: The paper claims \"provably stable\" networks, but Theorem 7 requires time-varying metrics and restrictive derivative bounds on activation slopes (Ḋ condition), which are not enforced during training. This weakens the practical guarantee.\n- **Overstatement of SOTA claims**: While the method outperforms *stable* RNNs, it falls short of overall SOTA (e.g., Trellis Networks, Transformers). Table 1 misleadingly groups models without clearly distinguishing between \"stable\" and \"unstable\" RNNs, potentially overstating impact.\n\n---\n\n# Questions for authors  \n- **Generality of Theorem 6**: Can the interconnection structure in Theorem 6 represent arbitrary modular architectures (e.g., skip connections, hierarchical trees), or is it limited to fully connected, antisymmetric feedback? Are there expressivity trade-offs?\n- **Training feasibility of Theorem 7**: Since the condition on Ḋ(t) in Theorem 7 depends on time derivatives of activations, how is this ensured during training? Is it actively enforced, or is it merely a post-hoc justification?\n- **Initialization sensitivity**: For Sparse Combo Nets, how sensitive are results to the initial selection of fixed subnetworks? Was any optimization or screening used during initialization, or is performance purely due to random sampling?\n- **Comparison to alternative stability methods**: Why not compare against other stable RNN parameterizations (e.g., orthogonal RNNs with trained input/output layers) under the same modular setup? This would better isolate the benefit of the proposed contraction approach.\n\n---\n\n# Final Judgement  \n- **Rationale of recommendation**: The paper presents a solid theoretical framework with promising experimental results on standard benchmarks. The integration of contraction theory into modular RNN design is timely and potentially impactful, especially for safety-critical applications. However, the novelty of the combination rule is modest, the empirical validation is narrow, and some theoretical claims are stronger than what is practically enforced. The performance gains, while notable, may be more attributable to sparsity and architectural design than the core stability mechanism. These limitations prevent it from being an outright accept, but the work is technically sound and offers meaningful progress in stable deep learning.\n- **Recommendation**: **Borderline**",
              "clean_review": "Summary of paper\nThis paper proposes a framework for constructing provably stable \"networks of networks\" using continuous-time recurrent neural networks (RNNs) and neural ODEs. It introduces five novel contraction conditions for individual RNNs and leverages contraction theory to derive a parameterization of inter-subnetwork connections—via Theorem 6—that guarantees global stability of the composite system. The authors instantiate two architectures: Sparse Combo Nets, where subnetwork weights are fixed and sparse, and SVD Combo Nets, where all parameters are trainable under a singular value-based constraint. Experiments on sequential MNIST and CIFAR-10 show strong performance, particularly for Sparse Combo Nets, which achieve state-of-the-art results among stability-guaranteed RNNs with fewer parameters.\n\nStrengths\nTheoretical novelty and rigor: The contraction conditions (Theorems 1–5) for continuous-time RNNs are well-formulated and expand the toolkit for analyzing nonlinear dynamics in neural networks. The use of contraction theory is mathematically sound and builds on a strong control-theoretic foundation.\nModular stability via combination theorem: Theorem 6 provides a clear, constructive method to combine contracting subnetworks while preserving stability—a valuable contribution toward scalable, verifiable deep architectures.\nStrong empirical performance: The proposed models, especially Sparse Combo Nets, achieve competitive or superior accuracy on benchmark sequence tasks compared to other stable RNNs, despite fewer trainable parameters.\nAblation and sensitivity studies: The paper includes thorough controls—e.g., testing the importance of the interconnection stability constraint and studying sparsity/size effects—which strengthen the validity of the claims.\n\nWeaknesses\nLimited novelty in combination mechanism: While Theorem 6 is useful, the form of the interconnection matrix L = B − M̃⁻¹BᵀM̃ is a known construction in contraction theory (differential Lyapunov framework). The paper does not sufficiently distinguish this from prior work (e.g., Slotine & Lohmiller, 2001), nor does it compare against simpler or alternative stable interconnection schemes.\nUnclear scalability and generalization: Results are limited to small-scale image classification tasks. There is no evaluation on longer sequences, real-time series, or tasks requiring complex memory dynamics (e.g., algorithmic reasoning), raising questions about broader applicability.\nSparse Combo Net training protocol raises concerns: Fixing subnetwork weights a priori and only training inter-module connections deviates significantly from standard end-to-end learning. The performance gains may stem more from architectural inductive bias than the stability framework itself, yet this is not disentangled.\nAmbiguous stability claims: The paper claims \"provably stable\" networks, but Theorem 7 requires time-varying metrics and restrictive derivative bounds on activation slopes (Ḋ condition), which are not enforced during training. This weakens the practical guarantee.\nOverstatement of SOTA claims: While the method outperforms stable RNNs, it falls short of overall SOTA (e.g., Trellis Networks, Transformers). Table 1 misleadingly groups models without clearly distinguishing between \"stable\" and \"unstable\" RNNs, potentially overstating impact.\n\nQuestions for authors\nGenerality of Theorem 6: Can the interconnection structure in Theorem 6 represent arbitrary modular architectures (e.g., skip connections, hierarchical trees), or is it limited to fully connected, antisymmetric feedback? Are there expressivity trade-offs?\nTraining feasibility of Theorem 7: Since the condition on Ḋ(t) in Theorem 7 depends on time derivatives of activations, how is this ensured during training? Is it actively enforced, or is it merely a post-hoc justification?\nInitialization sensitivity: For Sparse Combo Nets, how sensitive are results to the initial selection of fixed subnetworks? Was any optimization or screening used during initialization, or is performance purely due to random sampling?\nComparison to alternative stability methods: Why not compare against other stable RNN parameterizations (e.g., orthogonal RNNs with trained input/output layers) under the same modular setup? This would better isolate the benefit of the proposed contraction approach.\n\nFinal Judgement\nRationale of recommendation: The paper presents a solid theoretical framework with promising experimental results on standard benchmarks. The integration of contraction theory into modular RNN design is timely and potentially impactful, especially for safety-critical applications. However, the novelty of the combination rule is modest, the empirical validation is narrow, and some theoretical claims are stronger than what is practically enforced. The performance gains, while notable, may be more attributable to sparsity and architectural design than the core stability mechanism. These limitations prevent it from being an outright accept, but the work is technically sound and offers meaningful progress in stable deep learning.\nRecommendation: Borderline",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper proposes a framework for constructing provably stable \"networks of networks\" using continuous-time recurrent neural networks (RNNs) and neural ODEs.",
                      "It introduces five novel contraction conditions for individual RNNs and leverages contraction theory to derive a parameterization of inter-subnetwork connections—via Theorem 6—that guarantees global stability of the composite system.",
                      "The authors instantiate two architectures: Sparse Combo Nets, where subnetwork weights are fixed and sparse, and SVD Combo Nets, where all parameters are trainable under a singular value-based constraint.",
                      "Experiments on sequential MNIST and CIFAR-10 show strong performance, particularly for Sparse Combo Nets, which achieve state-of-the-art results among stability-guaranteed RNNs with fewer parameters.",
                      "The contraction conditions (Theorems 1–5) for continuous-time RNNs are well-formulated and expand the toolkit for analyzing nonlinear dynamics in neural networks.",
                      "Theorem 6 provides a clear, constructive method to combine contracting subnetworks while preserving stability.",
                      "The proposed models, especially Sparse Combo Nets, achieve competitive or superior accuracy on benchmark sequence tasks compared to other stable RNNs, despite fewer trainable parameters.",
                      "The paper includes thorough controls—e.g., testing the importance of the interconnection stability constraint and studying sparsity/size effects—which strengthen the validity of the claims.",
                      "The form of the interconnection matrix L = B − M̃⁻¹BᵀM̃ is a known construction in contraction theory (differential Lyapunov framework).",
                      "Results are limited to small-scale image classification tasks.",
                      "Theorem 7 requires time-varying metrics and restrictive derivative bounds on activation slopes (Ḋ condition), which are not enforced during training.",
                      "Table 1 groups models without clearly distinguishing between \"stable\" and \"unstable\" RNNs."
                  ],
                  "evaluation": [
                      "The use of contraction theory is mathematically sound and builds on a strong control-theoretic foundation.",
                      "Theorem 6 is a valuable contribution toward scalable, verifiable deep architectures.",
                      "The theoretical novelty and rigor of the paper are strong.",
                      "The empirical performance of the proposed models is strong.",
                      "The ablation and sensitivity studies are thorough.",
                      "While Theorem 6 is useful, the form of the interconnection matrix L = B − M̃⁻¹BᵀM̃ is a known construction in contraction theory, and the paper does not sufficiently distinguish this from prior work.",
                      "The paper does not compare against simpler or alternative stable interconnection schemes.",
                      "The scalability and generalization of the method are unclear.",
                      "The performance gains of Sparse Combo Nets may stem more from architectural inductive bias than the stability framework itself, and this is not disentangled.",
                      "The paper's claims of \"provably stable\" networks are ambiguous because Theorem 7's conditions are not enforced during training, which weakens the practical guarantee.",
                      "The overstatement of SOTA claims is a weakness, as the method falls short of overall SOTA (e.g., Trellis Networks, Transformers).",
                      "The paper presents a solid theoretical framework with promising experimental results on standard benchmarks.",
                      "The integration of contraction theory into modular RNN design is timely and potentially impactful, especially for safety-critical applications.",
                      "The novelty of the combination rule is modest.",
                      "The empirical validation is narrow.",
                      "Some theoretical claims are stronger than what is practically enforced.",
                      "The performance gains, while notable, may be more attributable to sparsity and architectural design than the core stability mechanism.",
                      "The work is technically sound and offers meaningful progress in stable deep learning."
                  ],
                  "constructiveInput": [
                      "The paper should sufficiently distinguish the combination mechanism from prior work (e.g., Slotine & Lohmiller, 2001) and compare against simpler or alternative stable interconnection schemes.",
                      "The authors should evaluate the method on longer sequences, real-time series, or tasks requiring complex memory dynamics (e.g., algorithmic reasoning) to address questions about broader applicability.",
                      "The authors should disentangle whether the performance gains of Sparse Combo Nets stem more from architectural inductive bias than the stability framework itself.",
                      "Table 1 should clearly distinguish between \"stable\" and \"unstable\" RNNs to avoid misleadingly grouping models and potentially overstating impact."
                  ],
                  "clarificationDialogue": [
                      "Can the interconnection structure in Theorem 6 represent arbitrary modular architectures (e.g., skip connections, hierarchical trees), or is it limited to fully connected, antisymmetric feedback?",
                      "Are there expressivity trade-offs with the interconnection structure in Theorem 6?",
                      "Since the condition on Ḋ(t) in Theorem 7 depends on time derivatives of activations, how is this ensured during training?",
                      "Is the condition on Ḋ(t) in Theorem 7 actively enforced, or is it merely a post-hoc justification?",
                      "For Sparse Combo Nets, how sensitive are results to the initial selection of fixed subnetworks?",
                      "Was any optimization or screening used during initialization for Sparse Combo Nets, or is performance purely due to random sampling?",
                      "Why not compare against other stable RNN parameterizations (e.g., orthogonal RNNs with trained input/output layers) under the same modular setup?",
                      "Would comparing against other stable RNN parameterizations better isolate the benefit of the proposed contraction approach?"
                  ],
                  "metaCommentary": [
                      "These limitations prevent it from being an outright accept.",
                      "The recommendation is Borderline."
                  ]
              }
          },
          {
              "review_id": "rewrite_0_qwen3-235b-a22b_8ffba6ae",
              "og_review_id": 0,
              "author": "qwen3-235b-a22b",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "Certainly. Here's a more polished, professional, and constructive version of the review while preserving all the original points and critiques:\n\n---\n\n**Summary:**\n\nThe submission presents a theoretical framework establishing stability conditions for a class of RNNs, supported by novel theorems (Theorems 1–5). These results represent a meaningful advancement in understanding the stability of recurrent networks, particularly through the demonstration that linear contraction alone does not guarantee nonlinear stability—a useful correction to a commonly held assumption. The authors further leverage this theoretical foundation to design hierarchical and feedback-based RNN architectures that achieve state-of-the-art performance on several benchmark tasks.\n\nWhile the theoretical contributions are solid and well-motivated, the experimental evaluation raises several concerns that currently limit the paper’s overall impact and clarity.\n\n---\n\n**Theoretical Contributions:**\n\nThe theoretical development is a clear strength of the paper. Theorems 1–5 provide an incremental but valuable progression in the analysis of RNN stability, with precise conditions for contraction and stability in both standalone and interconnected modules. The counterexample illustrating that linear contraction does not suffice for nonlinear stability is instructive and adds nuance to existing intuitions. The extension of these results to modular, hierarchical, and feedback-coupled architectures is well-motivated and theoretically grounded.\n\n---\n\n**Experimental Evaluation:**\n\nThe experimental section, while showcasing strong performance, suffers from several methodological and presentational issues that undermine the reliability and interpretability of the findings.\n\nFirst, the reporting of results raises concerns about reproducibility and fairness. For instance, in Table 1, the authors report only the best-performing run (96.94%) of four trials on Perm-MNIST, omitting the mean and variance. Selectively reporting the best result from multiple runs is not consistent with best practices in empirical machine learning research and risks overstating performance. I strongly recommend reporting mean performance with standard deviation or error bars across runs. Without such statistics, it is difficult to assess the robustness and generalizability of the claimed results. Moreover, if many trials were conducted, the best result could reflect overfitting to random variation rather than a true performance gain.\n\nSecond, several key claims are supported by insufficient evidence. For example, the assertion that increased modularity improves performance up to a point is based solely on a comparison between two configurations: 44×8 and 22×16. To establish a meaningful trend and draw a statistically sound conclusion, additional data points (e.g., 50×7, 39×9, or other modular splits) are needed. Similarly, claims regarding performance trends with respect to network size and modularity require more systematic variation and replication.\n\nThird, there appears to be a contradiction in the reported trends between Sections 3.1 and 3.2. In Section 3.1, performance increases monotonically with network size, while in Section 3.2, an inverted U-shaped relationship is observed. Likewise, modularity improves performance monotonically in 3.2 but shows an inverted U-shape in 3.1. The paper does not adequately reconcile these conflicting patterns, leaving the reader uncertain about the underlying principles. A more detailed discussion or controlled analysis to explain this discrepancy would greatly strengthen the work.\n\nFinally, the experimental section as a whole lacks a clear narrative structure. It would benefit from better organization—clearly delineated subsections, each focused on a specific hypothesis, along with a summary of key takeaways. The inclusion of error bars, statistical significance tests, and more comprehensive ablation studies would also improve confidence in the conclusions.\n\n---\n\n**Clarity and Presentation:**\n\nThe paper is at times very clear—particularly in its treatment of contraction and stability—but at other points becomes confusing due to notation and presentation issues. For example, on page 3, the symbol $g$ is used for two different purposes within the same paragraph, which is likely to cause confusion. Theorem 7 would also benefit from rewording: the first inequality should be explicitly framed as a condition rather than a statement, to avoid ambiguity for new readers.\n\nMinor typographical improvements are also suggested: use of $\\times$ (via `\\times`) for dimension notation (e.g., $44 \\times 8$) instead of \"x\", and reconsideration of the subsection numbering (e.g., why Section 3.2.1 exists instead of being integrated into 3.2). A thorough proofreading pass is recommended to enhance overall readability and precision.\n\n---\n\n**Final Assessment:**\n\nThe paper makes a valuable and novel contribution to the theoretical understanding of stable RNN design. The stability theorems and their application to modular architectures are well-justified and promising. However, the experimental section currently lacks the rigor and transparency needed to fully support the strong performance claims. With more comprehensive reporting, additional data points, and clearer presentation, the work could make a stronger impact. As it stands, the experimental limitations prevent a fully convincing demonstration of the proposed framework’s advantages.\n\n--- \n\nThis version maintains the critical feedback while presenting it in a more formal and constructive tone appropriate for a peer-review setting.",
              "clean_review": "Summary:\n\nThe submission presents a theoretical framework establishing stability conditions for a class of RNNs, supported by novel theorems (Theorems 1–5). These results represent a meaningful advancement in understanding the stability of recurrent networks, particularly through the demonstration that linear contraction alone does not guarantee nonlinear stability—a useful correction to a commonly held assumption. The authors further leverage this theoretical foundation to design hierarchical and feedback-based RNN architectures that achieve state-of-the-art performance on several benchmark tasks.\n\nWhile the theoretical contributions are solid and well-motivated, the experimental evaluation raises several concerns that currently limit the paper’s overall impact and clarity.\n\nTheoretical Contributions:\n\nThe theoretical development is a clear strength of the paper. Theorems 1–5 provide an incremental but valuable progression in the analysis of RNN stability, with precise conditions for contraction and stability in both standalone and interconnected modules. The counterexample illustrating that linear contraction does not suffice for nonlinear stability is instructive and adds nuance to existing intuitions. The extension of these results to modular, hierarchical, and feedback-coupled architectures is well-motivated and theoretically grounded.\n\nExperimental Evaluation:\n\nThe experimental section, while showcasing strong performance, suffers from several methodological and presentational issues that undermine the reliability and interpretability of the findings.\n\nFirst, the reporting of results raises concerns about reproducibility and fairness. For instance, in Table 1, the authors report only the best-performing run (96.94%) of four trials on Perm-MNIST, omitting the mean and variance. Selectively reporting the best result from multiple runs is not consistent with best practices in empirical machine learning research and risks overstating performance. I strongly recommend reporting mean performance with standard deviation or error bars across runs. Without such statistics, it is difficult to assess the robustness and generalizability of the claimed results. Moreover, if many trials were conducted, the best result could reflect overfitting to random variation rather than a true performance gain.\n\nSecond, several key claims are supported by insufficient evidence. For example, the assertion that increased modularity improves performance up to a point is based solely on a comparison between two configurations: 44x8 and 22x16. To establish a meaningful trend and draw a statistically sound conclusion, additional data points (e.g., 50x7, 39x9, or other modular splits) are needed. Similarly, claims regarding performance trends with respect to network size and modularity require more systematic variation and replication.\n\nThird, there appears to be a contradiction in the reported trends between Sections 3.1 and 3.2. In Section 3.1, performance increases monotonically with network size, while in Section 3.2, an inverted U-shaped relationship is observed. Likewise, modularity improves performance monotonically in 3.2 but shows an inverted U-shape in 3.1. The paper does not adequately reconcile these conflicting patterns, leaving the reader uncertain about the underlying principles. A more detailed discussion or controlled analysis to explain this discrepancy would greatly strengthen the work.\n\nFinally, the experimental section as a whole lacks a clear narrative structure. It would benefit from better organization—clearly delineated subsections, each focused on a specific hypothesis, along with a summary of key takeaways. The inclusion of error bars, statistical significance tests, and more comprehensive ablation studies would also improve confidence in the conclusions.\n\nClarity and Presentation:\n\nThe paper is at times very clear—particularly in its treatment of contraction and stability—but at other points becomes confusing due to notation and presentation issues. For example, on page 3, the symbol g is used for two different purposes within the same paragraph, which is likely to cause confusion. Theorem 7 would also benefit from rewording: the first inequality should be explicitly framed as a condition rather than a statement, to avoid ambiguity for new readers.\n\nMinor typographical improvements are also suggested: use of x (via \\times) for dimension notation (e.g., 44 x 8) instead of \"x\", and reconsideration of the subsection numbering (e.g., why Section 3.2.1 exists instead of being integrated into 3.2). A thorough proofreading pass is recommended to enhance overall readability and precision.\n\nFinal Assessment:\n\nThe paper makes a valuable and novel contribution to the theoretical understanding of stable RNN design. The stability theorems and their application to modular architectures are well-justified and promising. However, the experimental section currently lacks the rigor and transparency needed to fully support the strong performance claims. With more comprehensive reporting, additional data points, and clearer presentation, the work could make a stronger impact. As it stands, the experimental limitations prevent a fully convincing demonstration of the proposed framework’s advantages.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The submission presents a theoretical framework establishing stability conditions for a class of RNNs, supported by novel theorems (Theorems 1–5).",
                      "The authors further leverage this theoretical foundation to design hierarchical and feedback-based RNN architectures that achieve state-of-the-art performance on several benchmark tasks.",
                      "The theoretical development provides precise conditions for contraction and stability in both standalone and interconnected modules.",
                      "The paper includes a counterexample illustrating that linear contraction does not suffice for nonlinear stability.",
                      "The theoretical results are extended to modular, hierarchical, and feedback-coupled architectures.",
                      "In Table 1, the authors report only the best-performing run (96.94%) of four trials on Perm-MNIST, omitting the mean and variance.",
                      "The assertion that increased modularity improves performance up to a point is based solely on a comparison between two configurations: 44x8 and 22x16.",
                      "In Section 3.1, performance increases monotonically with network size, while in Section 3.2, an inverted U-shaped relationship is observed.",
                      "Modularity improves performance monotonically in Section 3.2 but shows an inverted U-shape in Section 3.1.",
                      "On page 3, the symbol g is used for two different purposes within the same paragraph."
                  ],
                  "evaluation": [
                      "These results represent a meaningful advancement in understanding the stability of recurrent networks, particularly through the demonstration that linear contraction alone does not guarantee nonlinear stability—a useful correction to a commonly held assumption.",
                      "While the theoretical contributions are solid and well-motivated, the experimental evaluation raises several concerns that currently limit the paper’s overall impact and clarity.",
                      "The theoretical development is a clear strength of the paper.",
                      "Theorems 1–5 provide an incremental but valuable progression in the analysis of RNN stability.",
                      "The counterexample illustrating that linear contraction does not suffice for nonlinear stability is instructive and adds nuance to existing intuitions.",
                      "The extension of these results to modular, hierarchical, and feedback-coupled architectures is well-motivated and theoretically grounded.",
                      "The experimental section, while showcasing strong performance, suffers from several methodological and presentational issues that undermine the reliability and interpretability of the findings.",
                      "The reporting of results raises concerns about reproducibility and fairness.",
                      "Selectively reporting the best result from multiple runs is not consistent with best practices in empirical machine learning research and risks overstating performance.",
                      "Without such statistics, it is difficult to assess the robustness and generalizability of the claimed results.",
                      "If many trials were conducted, the best result could reflect overfitting to random variation rather than a true performance gain.",
                      "Several key claims are supported by insufficient evidence.",
                      "The paper does not adequately reconcile these conflicting patterns, leaving the reader uncertain about the underlying principles.",
                      "The experimental section as a whole lacks a clear narrative structure.",
                      "The paper is at times very clear—particularly in its treatment of contraction and stability—but at other points becomes confusing due to notation and presentation issues.",
                      "The use of the symbol g for two different purposes within the same paragraph is likely to cause confusion.",
                      "The paper makes a valuable and novel contribution to the theoretical understanding of stable RNN design.",
                      "The stability theorems and their application to modular architectures are well-justified and promising.",
                      "The experimental section currently lacks the rigor and transparency needed to fully support the strong performance claims.",
                      "As it stands, the experimental limitations prevent a fully convincing demonstration of the proposed framework’s advantages."
                  ],
                  "constructiveInput": [
                      "I strongly recommend reporting mean performance with standard deviation or error bars across runs.",
                      "To establish a meaningful trend and draw a statistically sound conclusion, additional data points (e.g., 50x7, 39x9, or other modular splits) are needed.",
                      "Claims regarding performance trends with respect to network size and modularity require more systematic variation and replication.",
                      "A more detailed discussion or controlled analysis to explain the discrepancy in reported trends would greatly strengthen the work.",
                      "The experimental section would benefit from better organization—clearly delineated subsections, each focused on a specific hypothesis, along with a summary of key takeaways.",
                      "The inclusion of error bars, statistical significance tests, and more comprehensive ablation studies would also improve confidence in the conclusions.",
                      "Theorem 7 would benefit from rewording: the first inequality should be explicitly framed as a condition rather than a statement, to avoid ambiguity for new readers.",
                      "Minor typographical improvements are also suggested: use of x (via \\times) for dimension notation (e.g., 44 x 8) instead of \"x\", and reconsideration of the subsection numbering (e.g., why Section 3.2.1 exists instead of being integrated into 3.2).",
                      "A thorough proofreading pass is recommended to enhance overall readability and precision."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "With more comprehensive reporting, additional data points, and clearer presentation, the work could make a stronger impact."
                  ]
              }
          },
          {
              "review_id": "rewrite_1_qwen3-235b-a22b_54d791db",
              "og_review_id": 1,
              "author": "qwen3-235b-a22b",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Accept",
              "raw_review": "This paper presents a theoretical advance in the construction of modular assemblies of recurrent neural networks (RNNs), addressing a critical gap in the literature: the lack of stability guarantees when combining learned RNN components. While modular architectures—such as those seen in systems like AlphaGo or the Recurrent Independent Mechanisms (RIMs) framework—have demonstrated empirical success and improved tractability over monolithic models, they typically lack formal assurances of dynamical stability. This work fills that gap by introducing a theoretical framework for provably stable combinations of RNNs, leveraging weight constraints and feedback mechanisms. The authors instantiate their theory in a sparse combination network architecture, where pre-defined, stable RNN modules are interconnected via antisymmetric learnable parameters. Remarkably, this constrained model achieves state-of-the-art performance on sequential image classification tasks with significantly fewer trainable parameters, while maintaining theoretical stability.\n\n**Strengths:**\n\n- **Compelling Empirical Validation for a Theoretical Contribution:** Despite its primary focus on theory, the paper delivers strong empirical results. The authors conduct thorough ablation studies exploring various configurations of their modular architecture—varying the number and size of constituent RNNs under absolute value weight constraints—and compare them with an alternative SVD-based constraint approach. Although the SVD variant underperforms in both accuracy and training speed, the absolute value-constrained model not only remains stable but also surpasses existing state-of-the-art methods on standard benchmarks. This demonstrates that stability-inducing constraints need not come at the cost of performance, and may even enhance learning efficiency.\n\n- **Significant Theoretical Contribution:** The paper makes a timely and impactful contribution to the theoretical foundations of modular deep learning. As interest grows in decomposing complex models into specialized, reusable components, ensuring the stability of such compositions becomes paramount—especially given the well-known sensitivity of RNNs to initialization and parameter settings. This work provides rigorous conditions under which modular RNN assemblies remain stable, offering a principled foundation for future work in compositional modeling. The stability analysis is not merely academic; it enables the safe reuse and recombination of trained modules, opening the door to more scalable and interpretable architectures.\n\n- **Clear and Accessible Proofs:** The mathematical derivations in the appendix are well-structured and presented with commendable clarity, assuming a reasonable level of background in linear algebra and dynamical systems. The proofs support the main claims effectively and are approachable for readers with the appropriate technical preparation.\n\n**Weaknesses:**\n\n- **High Cognitive Load and Limited Accessibility:** The paper is dense and challenging to parse, even for readers familiar with RNNs and dynamical systems. Key concepts—such as the implications of network stability and the mechanism by which constraints enforce it—require multiple readings and external references to fully grasp. The reliance on supplementary material (e.g., the appendix) is necessary rather than optional, which suggests that core ideas could benefit from better integration into the main text. The presentation would be greatly improved by a more intuitive explanation of the architecture, including a clear schematic diagram of the sparse combination network (beyond the weight matrices in Figure 2), pseudocode for the training procedure in the appendix, and perhaps a demonstration of how an unconstrained assembly might diverge during training.\n\n- **Narrow Scope of Applications:** All experimental evaluations are confined to sequential image classification tasks. While the results are impressive within this domain, the broader relevance of the method would be strengthened by testing in other modalities—such as natural language processing, reinforcement learning, or continuous control—where RNNs are commonly used and stability issues are equally pressing. Demonstrating generalization across domains would bolster confidence in the framework’s versatility and practical utility.\n\n**Overall Assessment:**\n\nI recommend acceptance of this paper. Despite its density, the work represents a meaningful step forward in the theoretical understanding of modular RNNs, with promising empirical validation. The combination of stability guarantees and high performance under parameter efficiency constraints suggests a valuable direction for future deep learning systems—particularly as the field moves toward more modular, interpretable, and reusable architectures. That said, the impact of this work could be significantly enhanced by improved exposition and broader experimental validation. Theoretical advances of this nature risk being underutilized if they remain inaccessible to practitioners. Greater emphasis on clarity, visual explanation, and cross-domain applicability would help bridge the gap between theory and practice, enabling wider adoption by engineers and applied researchers.",
              "clean_review": "This paper presents a theoretical advance in the construction of modular assemblies of recurrent neural networks (RNNs), addressing a critical gap in the literature: the lack of stability guarantees when combining learned RNN components. While modular architectures—such as those seen in systems like AlphaGo or the Recurrent Independent Mechanisms (RIMs) framework—have demonstrated empirical success and improved tractability over monolithic models, they typically lack formal assurances of dynamical stability. This work fills that gap by introducing a theoretical framework for provably stable combinations of RNNs, leveraging weight constraints and feedback mechanisms. The authors instantiate their theory in a sparse combination network architecture, where pre-defined, stable RNN modules are interconnected via antisymmetric learnable parameters. Remarkably, this constrained model achieves state-of-the-art performance on sequential image classification tasks with significantly fewer trainable parameters, while maintaining theoretical stability.\n\nStrengths:\n\nCompelling Empirical Validation for a Theoretical Contribution: Despite its primary focus on theory, the paper delivers strong empirical results. The authors conduct thorough ablation studies exploring various configurations of their modular architecture—varying the number and size of constituent RNNs under absolute value weight constraints—and compare them with an alternative SVD-based constraint approach. Although the SVD variant underperforms in both accuracy and training speed, the absolute value-constrained model not only remains stable but also surpasses existing state-of-the-art methods on standard benchmarks. This demonstrates that stability-inducing constraints need not come at the cost of performance, and may even enhance learning efficiency.\n\nSignificant Theoretical Contribution: The paper makes a timely and impactful contribution to the theoretical foundations of modular deep learning. As interest grows in decomposing complex models into specialized, reusable components, ensuring the stability of such compositions becomes paramount—especially given the well-known sensitivity of RNNs to initialization and parameter settings. This work provides rigorous conditions under which modular RNN assemblies remain stable, offering a principled foundation for future work in compositional modeling. The stability analysis is not merely academic; it enables the safe reuse and recombination of trained modules, opening the door to more scalable and interpretable architectures.\n\nClear and Accessible Proofs: The mathematical derivations in the appendix are well-structured and presented with commendable clarity, assuming a reasonable level of background in linear algebra and dynamical systems. The proofs support the main claims effectively and are approachable for readers with the appropriate technical preparation.\n\nWeaknesses:\n\nHigh Cognitive Load and Limited Accessibility: The paper is dense and challenging to parse, even for readers familiar with RNNs and dynamical systems. Key concepts—such as the implications of network stability and the mechanism by which constraints enforce it—require multiple readings and external references to fully grasp. The reliance on supplementary material (e.g., the appendix) is necessary rather than optional, which suggests that core ideas could benefit from better integration into the main text. The presentation would be greatly improved by a more intuitive explanation of the architecture, including a clear schematic diagram of the sparse combination network (beyond the weight matrices in Figure 2), pseudocode for the training procedure in the appendix, and perhaps a demonstration of how an unconstrained assembly might diverge during training.\n\nNarrow Scope of Applications: All experimental evaluations are confined to sequential image classification tasks. While the results are impressive within this domain, the broader relevance of the method would be strengthened by testing in other modalities—such as natural language processing, reinforcement learning, or continuous control—where RNNs are commonly used and stability issues are equally pressing. Demonstrating generalization across domains would bolster confidence in the framework’s versatility and practical utility.\n\nOverall Assessment:\n\nI recommend acceptance of this paper. Despite its density, the work represents a meaningful step forward in the theoretical understanding of modular RNNs, with promising empirical validation. The combination of stability guarantees and high performance under parameter efficiency constraints suggests a valuable direction for future deep learning systems—particularly as the field moves toward more modular, interpretable, and reusable architectures. That said, the impact of this work could be significantly enhanced by improved exposition and broader experimental validation. Theoretical advances of this nature risk being underutilized if they remain inaccessible to practitioners. Greater emphasis on clarity, visual explanation, and cross-domain applicability would help bridge the gap between theory and practice, enabling wider adoption by engineers and applied researchers.",
              "claim_extraction": {
                  "factualRestatement": [
                      "This paper presents a theoretical advance in the construction of modular assemblies of recurrent neural networks (RNNs), addressing the lack of stability guarantees when combining learned RNN components.",
                      "The work introduces a theoretical framework for provably stable combinations of RNNs, leveraging weight constraints and feedback mechanisms.",
                      "The authors instantiate their theory in a sparse combination network architecture, where pre-defined, stable RNN modules are interconnected via antisymmetric learnable parameters.",
                      "The constrained model achieves state-of-the-art performance on sequential image classification tasks with significantly fewer trainable parameters, while maintaining theoretical stability.",
                      "The authors conduct thorough ablation studies exploring various configurations of their modular architecture, varying the number and size of constituent RNNs under absolute value weight constraints, and compare them with an alternative SVD-based constraint approach.",
                      "The SVD variant underperforms in both accuracy and training speed, but the absolute value-constrained model surpasses existing state-of-the-art methods on standard benchmarks.",
                      "The mathematical derivations in the appendix are well-structured and presented with commendable clarity, assuming a reasonable level of background in linear algebra and dynamical systems.",
                      "All experimental evaluations are confined to sequential image classification tasks."
                  ],
                  "evaluation": [
                      "Modular architectures, such as those seen in systems like AlphaGo or the Recurrent Independent Mechanisms (RIMs) framework, have demonstrated empirical success and improved tractability over monolithic models, but they typically lack formal assurances of dynamical stability.",
                      "This work fills the gap in stability guarantees by introducing a theoretical framework for provably stable combinations of RNNs.",
                      "Despite its primary focus on theory, the paper delivers strong empirical results.",
                      "The absolute value-constrained model not only remains stable but also surpasses existing state-of-the-art methods on standard benchmarks.",
                      "This demonstrates that stability-inducing constraints need not come at the cost of performance, and may even enhance learning efficiency.",
                      "The paper makes a timely and impactful contribution to the theoretical foundations of modular deep learning.",
                      "The stability analysis is not merely academic; it enables the safe reuse and recombination of trained modules, opening the door to more scalable and interpretable architectures.",
                      "The proofs support the main claims effectively and are approachable for readers with the appropriate technical preparation.",
                      "The paper is dense and challenging to parse, even for readers familiar with RNNs and dynamical systems.",
                      "Key concepts require multiple readings and external references to fully grasp.",
                      "The reliance on supplementary material (e.g., the appendix) is necessary rather than optional, which suggests that core ideas could benefit from better integration into the main text.",
                      "The results are impressive within the domain of sequential image classification tasks."
                  ],
                  "constructiveInput": [
                      "The presentation would be greatly improved by a more intuitive explanation of the architecture, including a clear schematic diagram of the sparse combination network (beyond the weight matrices in Figure 2).",
                      "Pseudocode for the training procedure should be included in the appendix.",
                      "Perhaps a demonstration of how an unconstrained assembly might diverge during training would be beneficial.",
                      "The broader relevance of the method would be strengthened by testing in other modalities—such as natural language processing, reinforcement learning, or continuous control—where RNNs are commonly used and stability issues are equally pressing.",
                      "Greater emphasis on clarity, visual explanation, and cross-domain applicability would help bridge the gap between theory and practice, enabling wider adoption by engineers and applied researchers."
                  ],
                  "clarificationDialogue": [],
                  "metaCommentary": [
                      "I recommend acceptance of this paper.",
                      "Despite its density, the work represents a meaningful step forward in the theoretical understanding of modular RNNs, with promising empirical validation.",
                      "The combination of stability guarantees and high performance under parameter efficiency constraints suggests a valuable direction for future deep learning systems—particularly as the field moves toward more modular, interpretable, and reusable architectures.",
                      "The impact of this work could be significantly enhanced by improved exposition and broader experimental validation.",
                      "Theoretical advances of this nature risk being underutilized if they remain inaccessible to practitioners."
                  ]
              }
          },
          {
              "review_id": "rewrite_2_qwen3-235b-a22b_02c09f6f",
              "og_review_id": 2,
              "author": "qwen3-235b-a22b",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "The authors investigate the contraction properties of continuous-time recurrent neural networks (RNNs) and demonstrate that a modular architecture—composed of provably stable RNN modules, or a \"network of networks\"—can achieve competitive performance on several sequential learning benchmarks, including sequential CIFAR-10, even when only the inter-module connections are trained.\n\n**Strengths:**  \nThe problem of effectively assembling multiple RNN modules into a coherent, trainable system is both challenging and underexplored. The theoretical contributions, particularly the analysis of contraction conditions ensuring stability, provide valuable insights for researchers interested in designing provably stable dynamical systems. These results lay a foundation for building modular RNNs with desirable convergence and robustness properties.\n\n**Weaknesses:**  \n1. Section 3.2.1 is somewhat tedious to follow, as it presents detailed results from numerous individual A×B network configurations, which may distract from the broader insights. A more consolidated or high-level summary could improve readability.\n\n2. While Table 1 compares performance against other network types, it would be informative to include a direct ablation comparing the current modular, stable RNN setup with a variant where the internal RNN dynamics are not constrained to be stable (but otherwise identical in architecture and training). This would help clarify the impact of enforcing provable stability.\n\n3. The paper trains only the inter-module connections, keeping intra-module parameters fixed. It would be valuable to know whether end-to-end training of all weights—including within modules—leads to better, worse, or similar performance. In particular, does performance degrade despite increased model capacity, possibly due to instability or optimization challenges?\n\n4. The connection between the theoretical stability analysis and the practical training of the modular network could be further clarified. Specifically, is the use of stable RNN modules especially critical in sparsely connected, modular architectures? If so, how does stability interact with modularity to influence trainability and generalization?\n\nOverall, this is a thought-provoking paper that takes a principled and less conventional approach to RNN design, combining provable stability with a modular \"net-of-nets\" architecture. While some parts of the presentation are dense, the paper is generally clear and offers meaningful contributions to the development of structured, theoretically grounded recurrent systems.\n\nI should note that I am not in a position to verify the correctness of the mathematical derivations.",
              "clean_review": "The problem of effectively assembling multiple RNN modules into a coherent, trainable system is both challenging and underexplored. The theoretical contributions, particularly the analysis of contraction conditions ensuring stability, provide valuable insights for researchers interested in designing provably stable dynamical systems. These results lay a foundation for building modular RNNs with desirable convergence and robustness properties.\n\n1. Section 3.2.1 is somewhat tedious to follow, as it presents detailed results from numerous individual A×B network configurations, which may distract from the broader insights. A more consolidated or high-level summary could improve readability.\n\n2. While Table 1 compares performance against other network types, it would be informative to include a direct ablation comparing the current modular, stable RNN setup with a variant where the internal RNN dynamics are not constrained to be stable (but otherwise identical in architecture and training). This would help clarify the impact of enforcing provable stability.\n\n3. The paper trains only the inter-module connections, keeping intra-module parameters fixed. It would be valuable to know whether end-to-end training of all weights—including within modules—leads to better, worse, or similar performance. In particular, does performance degrade despite increased model capacity, possibly due to instability or optimization challenges?\n\n4. The connection between the theoretical stability analysis and the practical training of the modular network could be further clarified. Specifically, is the use of stable RNN modules especially critical in sparsely connected, modular architectures? If so, how does stability interact with modularity to influence trainability and generalization?\n\nOverall, this is a thought-provoking paper that takes a principled and less conventional approach to RNN design, combining provable stability with a modular \"net-of-nets\" architecture. While some parts of the presentation are dense, the paper is generally clear and offers meaningful contributions to the development of structured, theoretically grounded recurrent systems.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper addresses the problem of effectively assembling multiple RNN modules into a coherent, trainable system.",
                      "The paper trains only the inter-module connections, keeping intra-module parameters fixed."
                  ],
                  "evaluation": [
                      "The problem of effectively assembling multiple RNN modules into a coherent, trainable system is both challenging and underexplored.",
                      "The theoretical contributions, particularly the analysis of contraction conditions ensuring stability, provide valuable insights for researchers interested in designing provably stable dynamical systems.",
                      "These results lay a foundation for building modular RNNs with desirable convergence and robustness properties.",
                      "Section 3.2.1 is somewhat tedious to follow, as it presents detailed results from numerous individual A×B network configurations, which may distract from the broader insights.",
                      "This is a thought-provoking paper that takes a principled and less conventional approach to RNN design, combining provable stability with a modular \"net-of-nets\" architecture.",
                      "While some parts of the presentation are dense, the paper is generally clear and offers meaningful contributions to the development of structured, theoretically grounded recurrent systems."
                  ],
                  "constructiveInput": [
                      "A more consolidated or high-level summary of Section 3.2.1 could improve readability.",
                      "While Table 1 compares performance against other network types, it would be informative to include a direct ablation comparing the current modular, stable RNN setup with a variant where the internal RNN dynamics are not constrained to be stable (but otherwise identical in architecture and training).",
                      "It would be valuable to know whether end-to-end training of all weights—including within modules—leads to better, worse, or similar performance.",
                      "The connection between the theoretical stability analysis and the practical training of the modular network could be further clarified."
                  ],
                  "clarificationDialogue": [
                      "Does performance degrade despite increased model capacity when training all weights, possibly due to instability or optimization challenges?",
                      "Is the use of stable RNN modules especially critical in sparsely connected, modular architectures?",
                      "If so, how does stability interact with modularity to influence trainability and generalization?"
                  ],
                  "metaCommentary": []
              }
          },
          {
              "review_id": "rewrite_3_qwen3-235b-a22b_efdaf5ea",
              "og_review_id": 3,
              "author": "qwen3-235b-a22b",
              "class": "rewrite",
              "prompt": "rewrite",
              "rating": "Borderline",
              "raw_review": "In this paper, the authors investigate stable architectures for recurrent neural networks (RNNs) by studying conditions under which the weight matrix of an RNN is contractive, thereby promoting stability during training. From a theoretical standpoint, they derive a series of sufficient conditions for contractivity, which provide useful insights into designing stable RNNs. On the modeling side, they propose novel RNN architectures that inherently satisfy these contractivity conditions through structured weight matrices. The proposed models are evaluated on standard sequential benchmarks, including sequential MNIST, permuted MNIST, and sequential CIFAR-10, demonstrating competitive performance.\n\nThe theoretical analysis presented in the paper is sound and contributes meaningfully to the understanding of stability in RNNs. While some of the results are intuitive and may not be entirely surprising, they offer a formal framework that can guide architectural design. That said, there is a noticeable gap between the theoretical developments and their direct application in the proposed model—specifically, not all of the derived theorems appear to be fully leveraged or directly applicable to the architecture introduced in Section 3.1.\n\nRegarding the proposed model, the exposition lacks clarity and could benefit from substantial improvement. The introduction of \"subnetworks\" at the beginning of Section 3.1 is particularly unclear, as the term is neither defined nor elaborated upon. The mechanism by which these subnetworks are combined remains ambiguous, and readers are left to infer structural details from the section title (\"Recursive Construction\") and Figure 2. A more precise and self-contained description of the architecture, including the recursive assembly process and the resulting block matrix structure, would significantly enhance readability and reproducibility.\n\nMoreover, the technique of parametrizing orthogonal matrices via the exponential map of skew-symmetric matrices—while effective—is not new and has been previously explored in expRNN (Lezcano-Casado & Martínez-Rubio, ICML 2019). This reduces the novelty of the parameterization strategy employed in the model.\n\nThe introduction could also be strengthened. While the discussion touches on broad themes such as AlphaGo and biological modularity, these examples feel tangential. A more focused and comprehensive review of prior work on RNN trainability, vanishing gradients, and long-term dependency challenges would better situate the contribution within the relevant literature and underscore the motivation for stability-preserving architectures.\n\nIn summary, the paper presents theoretically interesting results on contractive RNNs and proposes architectures aligned with these principles. However, the current presentation of the model is insufficiently clear, and certain technical components lack novelty. To meet the acceptance standards, the authors should clarify the model design, strengthen the connection between theory and practice, and provide a more targeted and informative introduction to the problem space.",
              "clean_review": "The theoretical analysis presented in the paper is sound and contributes meaningfully to the understanding of stability in RNNs. While some of the results are intuitive and may not be entirely surprising, they offer a formal framework that can guide architectural design. That said, there is a noticeable gap between the theoretical developments and their direct application in the proposed model—specifically, not all of the derived theorems appear to be fully leveraged or directly applicable to the architecture introduced in Section 3.1.\n\nRegarding the proposed model, the exposition lacks clarity and could benefit from substantial improvement. The introduction of \"subnetworks\" at the beginning of Section 3.1 is particularly unclear, as the term is neither defined nor elaborated upon. The mechanism by which these subnetworks is combined remains ambiguous, and readers are left to infer structural details from the section title (\"Recursive Construction\") and Figure 2. A more precise and self-contained description of the architecture, including the recursive assembly process and the resulting block matrix structure, would significantly enhance readability and reproducibility.\n\nMoreover, the technique of parametrizing orthogonal matrices via the exponential map of skew-symmetric matrices—while effective—is not new and has been previously explored in expRNN (Lezcano-Casado & Martínez-Rubio, ICML 2019). This reduces the novelty of the parameterization strategy employed in the model.\n\nThe introduction could also be strengthened. While the discussion touches on broad themes such as AlphaGo and biological modularity, these examples feel tangential. A more focused and comprehensive review of prior work on RNN trainability, vanishing gradients, and long-term dependency challenges would better situate the contribution within the relevant literature and underscore the motivation for stability-preserving architectures.\n\nIn summary, the paper presents theoretically interesting results on contractive RNNs and proposes architectures aligned with these principles. However, the current presentation of the model is insufficiently clear, and certain technical components lack novelty. To meet the acceptance standards, the authors should clarify the model design, strengthen the connection between theory and practice, and provide a more targeted and informative introduction to the problem space.",
              "claim_extraction": {
                  "factualRestatement": [
                      "The paper presents a theoretical analysis of stability in RNNs.",
                      "The paper proposes architectures aligned with the principles of contractive RNNs.",
                      "The model employs the technique of parametrizing orthogonal matrices via the exponential map of skew-symmetric matrices.",
                      "The introduction discusses broad themes such as AlphaGo and biological modularity."
                  ],
                  "evaluation": [
                      "The theoretical analysis presented in the paper is sound and contributes meaningfully to the understanding of stability in RNNs.",
                      "Some of the theoretical results are intuitive and may not be entirely surprising.",
                      "The theoretical results offer a formal framework that can guide architectural design.",
                      "There is a noticeable gap between the theoretical developments and their direct application in the proposed model.",
                      "Not all of the derived theorems appear to be fully leveraged or directly applicable to the architecture introduced in Section 3.1.",
                      "The exposition of the proposed model lacks clarity and could benefit from substantial improvement.",
                      "The introduction of \"subnetworks\" at the beginning of Section 3.1 is particularly unclear, as the term is neither defined nor elaborated upon.",
                      "The mechanism by which these subnetworks is combined remains ambiguous.",
                      "The technique of parametrizing orthogonal matrices via the exponential map of skew-symmetric matrices, while effective, is not new and has been previously explored in expRNN (Lezcano-Casado & Martínez-Rubio, ICML 2019).",
                      "This reduces the novelty of the parameterization strategy employed in the model.",
                      "The examples of AlphaGo and biological modularity in the introduction feel tangential.",
                      "The current presentation of the model is insufficiently clear.",
                      "Certain technical components lack novelty."
                  ],
                  "constructiveInput": [
                      "A more precise and self-contained description of the architecture, including the recursive assembly process and the resulting block matrix structure, would significantly enhance readability and reproducibility.",
                      "The introduction could be strengthened.",
                      "A more focused and comprehensive review of prior work on RNN trainability, vanishing gradients, and long-term dependency challenges would better situate the contribution within the relevant literature and underscore the motivation for stability-preserving architectures.",
                      "The authors should clarify the model design.",
                      "The authors should strengthen the connection between theory and practice.",
                      "The authors should provide a more targeted and informative introduction to the problem space."
                  ],
                  "clarificationDialogue": [
                      "Readers are left to infer structural details from the section title (\"Recursive Construction\") and Figure 2 regarding the combination of subnetworks."
                  ],
                  "metaCommentary": [
                      "To meet the acceptance standards, the authors should clarify the model design, strengthen the connection between theory and practice, and provide a more targeted and informative introduction to the problem space."
                  ]
              }
          }
      ],
      "paper_content_latex_arxiv": "\\begin{abstract}\nRecurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting \\textit{areas}, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained `networks of networks' can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.\n\\end{abstract}\n\n\\addtocontents{toc}{\\protect\\setcounter{tocdepth}{0}}\n\n\\section{Introduction}\nThe combination and reuse of primitive ``modules\" has enabled a great deal of progress in computer science, engineering, and biology. Modularity is particularly apparent in the structure of the brain, as different parts are specialized for different functions \\citep{kandel2000principles}. Accordingly, most experimental studies throughout the history of neuroscience have focused on a single brain area in association with a single behavior \\citep{Abbott_Svoboda_2020}. Similarly, RNN models of brain function have mostly been limited to a single RNN modeling a single area. However, neuroscience is entering an age where recording from many different brain areas simultaneously during complex behaviors is possible. As experimental neuroscience has shifted towards multi-area recordings, computational techniques for analyzing, modeling, and interpreting these multi-area recordings have blossomed \\citep{michaels2020goal,mashour2020conscious,Abbott_Svoboda_2020, Perich_2021,semedo2019cortical,yang2021towards,Machado_Kauvar_Deisseroth_2022}. Despite this, RNN theory has lagged behind.\n\nThe theoretical question of RNN stability is crucial for understanding information propagation and manipulation \\citep{vogt2020lyapunov,engelken2020lyapunov,kozachkov2022robust}. The conditions under which single, autonomous RNNs are chaotic or stable are well-studied, in particular when the RNN weights are randomly chosen and the number of neurons tends to infinity \\citep{sompolinsky1988chaos,engelken2020lyapunov}. However, there is very little work addressing the theoretical question of stability in `networks of networks'. Two facts make this question challenging. Firstly, connecting two stable systems does not, in general, lead to a stable overall system. This is true even for linear systems. Secondly, there is a massive amount of feedback between brain areas, so one cannot reasonably assume near-decomposability \\citep{Simon_1962,Abbott_Svoboda_2020}.\n\nGiven that the brain seems to dynamically reorganize and adapt interareal connectivity to meet task demands and environmental constraints \\citep{miller2001integrative,sych2022dynamic}, this question of how stability is maintained is of the utmost importance. Here we take a bottom-up approach, more specifically asking ``what stability properties of the individual modules lend themselves to rapid reorganization?\"\n\n\\subsection{Contraction Analysis}\\label{subsection:contraction-intro}\nWe focus on a special type of stability, known as contractive stability \\citep{lohmiller1998contraction}. Loosely, a contracting system is a dynamical system that forgets its initial conditions exponentially quickly. Contractive stability is a strong form of dynamical stability which implies many other forms of stability, such as certain types of input-to-state stability \\citep{sontag2010contractive}. See Section \\ref{Appendix:Supplementary Math:supp_math} for a mathematical primer on contraction analysis.\n\nContraction analysis has found wide application in nonlinear control theory \\citep{manchester2017control}, synchronization \\citep{pham2007stable}, and robotics \\citep{chung2009cooperative}, but has only recently begun to find application in neuroscience and machine learning \\citep{boffi2020learning,wensing2020beyond,kozachkov2020achieving,revay2020contracting,jafarpour2021robust,kozachkov2022robust, centorrino2022contraction,burghi2022distributed,kozachkov2022generalization}. Contraction analysis is useful for neuroscience because it is directly applicable to systems with external inputs. It also allows for modular stability-preserving \\emph{combination} properties to be derived (Figure \\ref{fig: Cartoon}). The resulting contracting combinations can involve individual systems with different dynamics, as long as those dynamics are contracting \\citep{slotine2001combos,modular}.\n\nMoreover, modular stability and specifically contractive stability have relevance to evolutionary biology~\\citep{Simon_1962, slotine2001combos}. In particular, it is thought that the majority of traits that have developed over the last 400+ million years are the result of evolutionary forces acting on regulatory elements that combine core components, rather than mutations in the core components themselves. This mechanism of action makes meaningful variation in population phenotypes much more feasible to achieve, and is appropriately titled ``facilitated variation\" \\citep{gerhart2007fv}. In addition to the biological evidence for facilitated variation, computational models have demonstrated that this approach produces populations which are better able to generalize to new environments \\citep{parter2008fv}, an ability that will be critical to further develop in deep learning systems. However, the tractability of these evolutionary processes hinges on some mechanism for ensuring stability of combinations. Because contraction analysis tools allow complicated contracting systems to be built up recursively from simpler elements, this form of stability would be well suited for biological systems \\citep{slotine2012links}. Our work with the Sparse Combo Net in Section \\ref{section:experiments-intro} has direct parallels to facilitated variation, in that we train this combination network architecture \\textit{only} through training connections between contracting subnetworks.\n\n\\begin{figure}[h]\n\\centering\n\\includegraphics[width=0.95\\textwidth,keepaspectratio]{figs-final/figures_modular.png}\n\\caption{Contractive stability implies a modularity principle. Because contraction analysis tools allow complicated contracting systems to be built up recursively from simpler elements, this form of stability is well suited for understanding biological systems. Contracting combinations can be made between systems with very different dynamics, as long as those dynamics are contracting.}\n\\label{fig: Cartoon}\n\\end{figure}\n\nUltimately, our contributions are three-fold:\n\\begin{itemize}\n    \\setlength\\itemsep{.1em}\n    \\item A novel parameterization for feedback combination of contracting RNNs that enables direct optimization using standard deep learning libraries.\n    \\item  Novel contraction conditions for continuous-time nonlinear RNNs, to use in conjunction with the combination condition. We also identify flaws in stability proofs from prior literature.\n    \\item Experiments demonstrating that our `network of networks' sets a new state of the art for stability-constrained RNNs on benchmark sequential processing tasks.\n\\end{itemize}\n\n\\section{Network of Networks Model}\\label{section:model-math}\nIn this paper we analyze rate-based neural networks. Unlike spiking neural networks, these models are continuous and smooth. We consider the following RNN introduced by \\cite{Wilson_Cowan_1972}, which may be viewed as an approximation to a more biophysically-detailed spiking network:\n\\begin{equation}\\label{eq:RNN}\n\\begin{split}\n\\tau \\dot{\\mathbf{x}} = -\\mathbf{x} + \\mathbf{W}\\phi(\\mathbf{x}) + \\mathbf{u}(t)\n\\end{split}\n\\end{equation}\nHere $\\tau > 0$ is the time-constant of the network \\citep{dayan2005theoretical}, and the vector $\\mathbf{x} \\in \\mathbb{R}^n$ contains the voltages of all $n$ neurons in the network. The voltages are converted into firing-rates through a static nonlinearity $\\phi$. We only consider monotonic activation functions with bounded slope: in other words, $0 \\leq \\phi' \\leq g$ (unless otherwise noted). We do not restrain the sign of the nonlinearity. Common example nonlinearities $\\phi(x)$ that satisfy these constraints are hyperbolic tangent and ReLU. The matrix $\\mathbf{W} \\in \\mathbb{R}^{n \\times n}$ contains the synaptic weights of the RNN. It is this matrix that ultimately determines the stability of \\eqref{eq:RNN}, and will be a main target for our analysis. Finally, $\\mathbf{u}(t)$ is the potentially time-varying external input into the network, capturing both explicit input into the RNN from the external world, as well as unmodeled dynamics from other brain areas. In the context of training, the time-varying inputs $\\mathbf{u}(t)$ come from the task. For example, if the task is sequential image recognition, then the input will be a sequence of pixel intensities. The output of our network similarly depends on the task. In classification settings, one may choose to use a softmax output layer; in regression, one may choose a linear readout.\n\nNote that \\eqref{eq:RNN} is equivalent to another commonly used class of RNNs where the term $\\mathbf{W}\\mathbf{x} + \\mathbf{u}$ appears inside the nonlinearity. See Section \\ref{Appendix:Supplementary Math:two_different_rnns} or \\citep{miller2012mathematical} for details. Our mathematical results apply equally well to both types of RNNs.\n\nIn order to extend \\eqref{eq:RNN} into a model of multiple interacting neural populations, we introduce the index $i$, which runs from $1$ to $p$, where $p$ is the total number of RNNs in the collection of RNNs. For now we will assume linear interactions between RNNs, because this is the simplest case. The linearity assumption can also be motivated by the fact that RNNs have been found to be well-approximated by linear systems in many neuroscience contexts \\citep{sussillo2013opening,langdon2022latent}. This leads to the following equation for the `network of networks':\n\\begin{equation}\\label{eq:combo_RNN}\n\\begin{split}\n\\tau \\dot{\\mathbf{x}}_i = -\\mathbf{x}_i + \\mathbf{W}_i\\phi(\\mathbf{x}_i) + \\sum_{j=1}^{p} \\mathbf{\\mathbf{L}}_{ij} \\mathbf{x}_j + \\mathbf{u}_i(t) \\hspace{.5cm}  \\forall i = 1 \\cdots p\n\\end{split}\n\\end{equation}\nwhere the matrix $\\mathbf{L}_{ij}$ captures the interaction between RNN $i$ and RNN $j$. If RNN $i$ has $N_i$ neurons and RNN $j$ has $N_j$ neurons, then $\\mathbf{L}_{ij}$ is an $N_i \\times N_j$ matrix.\n\nWe can now formalize the question posed in the introduction. Namely, \"what types of connections between stable RNNs automatically preserve stability?\" becomes \"what restrictions on $\\mathbf{\\mathbf{L}}_{ij}$ must be met in order to ensure overall stability of the network of networks?\". We will now derive two combination `primitives', negative feedback and hierarchical, which allow for recursive connection of contracting modules.\n\n\\subsection{Generalized Negative Feedback Between RNNs Preserves Stability}\nWe set aside for a moment the problem of determining when \\eqref{eq:RNN} is contracting. For now, assume that we have a collection of $p$ contracting RNNs interacting through equation \\eqref{eq:combo_RNN}. Recall that contraction is defined with respect to a \\textit{metric}, a way of measuring distances between trajectories in state space. Thus, the $i$th RNN is contracting with respect to some metric $\\mathbf{M}_i$. We assume for simplicity that this metric is constant, which means that $\\mathbf{M}_i$ is simply a symmetric, positive definite matrix. In the case where every RNN receives feedback from every other, we can preserve stability by ensuring these connections are negative feedback. In the simplest case where all RNN modules are contracting in the identity metric, the negative feedback may be written as:\n\\[\\mathbf{L}_{ij} = -\\mathbf{L}^T_{ji} \\]\nThis is a well known result from the contraction analysis literature \\citep{modular}. Our first novel contribution is a generalization and parameterization of this negative feedback which allows for direct optimization using gradient-based techniques. In particular, if each $\\mathbf{L}_{ij}$ is parameterized as follows:\n\\begin{equation}\\label{eq:feedback_combo}\n\\begin{split}\n\\mathbf{L}_{ij} = \\mathbf{B}_{ij} - \\mathbf{M}^{-1}_i\\mathbf{B}^T_{ji}\\mathbf{M}_j \\hspace{.5cm} \\forall i,j\n\\end{split}\n\\end{equation}\nfor arbitrary matrix $\\mathbf{B}_{ij}$, then the overall network of networks retains the assumed contraction properties of the RNN subnetworks. We provide a detailed proof in Section \\ref{subsection:feedback_combo_proof}, but the basic idea relies on ensuring  skew-symmetry of $\\mathbf{L}_{ij}$ \\textit{in the appropriate metric}. This can be achieved via the constraint:\n\\[\\mathbf{M}_i\\mathbf{L}_{ij} = -\\mathbf{L}^T_{ji}\\mathbf{M}_j \\]\nPlugging \\eqref{eq:feedback_combo} into the above expression verifies that it is indeed satisfied. Because contraction analysis relies on analyzing the symmetric part of Jacobian matrices, the skew-symmetry of $\\mathbf{L}_{ij}$ `cancels out' when  computing the symmetric part, and leaves the stability of the subnetwork RNNs untouched. In the remainder of the paper, in the experimental sections, we will use feedback combinations constrained by \\eqref{eq:feedback_combo}. However, it is possible to significantly generalize this condition (see \\ref{subsection:feedback_combo_proof} for details). It is also possible to replace the linear feedback between modules with saturating nonlinearities, by using the \"small-gain\" results in \\citet{slotine2001combos} and \\citet{tabareau2006notes}. In the remainder of the paper, we optimize the $\\mathbf{B}_{ij}$ matrices directly using backpropagation and the Adam optimizer implemented in PyTorch \\citep{kingma2014adam,paszke2019pytorch}.\n\n\\paragraph{Recursive Properties of Contracting Combinations}\nThe feedback combination \\eqref{eq:feedback_combo}, taken together with hierarchical combinations, may be used as combination primitives for recursively constructing complicated networks of networks while automatically maintaining stability.  The recursion comes from the fact that once a modular system is shown to be contracting it may be treated as a single contracting system, which may in turn be combined with other contracting systems, \\textit{ad infinitum}. Note that while we require linear feedback connections \\eqref{eq:feedback_combo}, hierarchical interareal connections may be nonlinear \\citep{lohmiller1998contraction}.\n\n\\section{Various Ways to Achieve Local\nContraction}\\label{section:single-rnn}\nIn this section we return to the question of achieving contraction in the subnetwork RNNs. Recall that we wish to find restrictions on $\\mathbf{W}_i$ such that the $i$th subnetwork RNN, described by \\eqref{eq:RNN}, is contracting. Here we derive five such novel conditions (see Section \\ref{Appendix:Proofs} for detailed proofs). As we will discuss, not all contraction conditions are equally useful - for example some conditions are easier to optimize or have higher model capacity than others. In this section we also point out some flaws in existing stability proofs in the RNN literature, and suggest some pathways towards correcting them. \\\\\n\n\\vspace{0.1cm}\n\\begin{theorem}[Absolute Value Restricted Weights]\n\\label{theorem: absolutevaluetheorem}\nLet $|\\mathbf{W}|$ denote the matrix formed by taking the element-wise absolute value of $\\mathbf{W}$.  If there exists a positive, diagonal $\\mathbf{P}$ such that:\n\\[\\mathbf{P}(g|\\mathbf{W}|-\\mathbf{I}) +(g|\\mathbf{W}|-\\mathbf{I})^T\\mathbf{P} \\prec 0 \\]\nwith $g$ being the maximum slope of $\\phi$, then \\eqref{eq:RNN} is contracting in metric $\\mathbf{P}$. If $W_{ii} \\leq 0$, then $|W|_{ii}$ may be set to zero to reduce conservatism.\n\\end{theorem}\nIt is easy to find matrices that satisfy Theorem \\ref{theorem: absolutevaluetheorem}, and given a matrix the condition is as easy to check as linear stability is. Moreover, the condition guarantees we can obtain a metric that the system is known to contract in (see Section \\ref{section:init-info} for details). It is less straightforward to enforce this condition during training, however we found that subnetworks constrained by Theorem \\ref{theorem: absolutevaluetheorem} can achieve high performance in practice by simply fixing $\\mathbf{W}$ and only optimizing the connections \\textit{between} subnetworks (Section \\ref{section:main-results}). As there are fewer parameters to optimize, this training technique is faster.\n\n\\vspace{0.2cm}\n\\begin{theorem}[Symmetric Weights]\n\\label{theorem: symmetricweightstheorem}\nIf $ \\ \\mathbf{W} = \\mathbf{W}^T$ and $  \\ g\\mathbf{W} \\prec \\mathbf{I}$, and $ \\ \\phi' > 0$, then (\\ref{eq:RNN}) is contracting.\n\\end{theorem}\n\nIt has been known since the early 1990s that if \\eqref{eq:RNN} is autonomous (i.e the input $\\mathbf{u}$ is constant) and has symmetric weights with eigenvalues less than $1/g$, then there exists a unique fixed point that the network converges to from any initial condition \\citep{matsuoka1992stability}. Theorem \\ref{theorem: symmetricweightstheorem} generalizes this statement to say that if \\eqref{eq:RNN} has symmetric weights with eigenvalues less than $1/g$, it is contracting. This includes previous results as a special case, because an \\textit{autonomous} contracting system has a unique fixed point which the network converges to from any initial condition.\n\n\\vspace{0.2cm}\n\\begin{theorem}[Product of Diagonal and Orthogonal Weights]\n\\label{theorem: PQPtheorem}\nIf there exists positive diagonal matrices $\\mathbf{P}_1$ and $\\mathbf{P}_2$, as well as $\\mathbf{Q} = \\mathbf{Q} ^T \\succ 0$ such that\n\\[ \\mathbf{W} = -\\mathbf{P}_1 \\mathbf{Q} \\mathbf{P}_2 \\]\nthen (\\ref{eq:RNN}) is contracting in metric $\\mathbf{M} = (\\mathbf{P}_1 \\mathbf{Q} \\mathbf{P}_1)^{-1}$.\n\\end{theorem}\n\nIn contrast to the first two contraction conditions, Theorem \\ref{theorem: PQPtheorem} is very easy to optimize. To meet the constraint that the $\\mathbf{P}$ matrices are positive, one can parameterize their diagonal elements as $P_{ii} = d^2_{i} + \\epsilon$, for some small positive constant $\\epsilon$, and optimize $d_i$ directly. To meet the positive definiteness constraint on $\\mathbf{Q}$, one may parameterize it as $\\mathbf{Q} = \\mathbf{E}^T\\mathbf{E} + \\epsilon\\mathbf{I}$ and optimize $\\mathbf{E}$ directly.\n\n\\vspace{0.2cm}\n\\begin{theorem}[Triangular Weights]\n\\label{theorem: triangularweightstheorem}\nIf $g\\mathbf{W}-\\mathbf{I}$ is triangular and Hurwitz, then (\\ref{eq:RNN}) is contracting in a diagonal metric.\n\\end{theorem}\n\nTheorem \\ref{theorem: triangularweightstheorem} follows from the fact that a hierarchy of contracting systems is also contracting.\n\n\\vspace{0.2cm}\n\\begin{theorem}[Singular Value Restricted Weights]\n\\label{theorem: singularvaluetheorem}\nIf there exists a positive diagonal matrix $\\mathbf{P}$ such that:\n\\[g^2\\mathbf{W}^T\\mathbf{P}\\mathbf{W} - \\mathbf{P} \\prec 0 \\]\n\\noindent then (\\ref{eq:RNN}) is contracting in metric $\\mathbf{P}$.\n\\end{theorem}\n\nIn the case of discrete-time RNNs, this contraction condition has been proved by many different authors in many different settings. When $\\mathbf{P} = \\mathbf{I}$, it is known as the echo-state condition for discrete-time RNNs \\citep{jaeger2001echo}. This was then generalized to diagonal $\\mathbf{P}$ by \\cite{buehner2006tighter}. More recently, the original echo-state condition was rediscovered by \\cite{miller2018stable} in the machine learning literature. Following this rediscovery, the condition was generalized to $\\mathbf{P} \\neq \\mathbf{I}$ by \\cite{revay2020contracting}. Here we show that it applies to continuous-time RNNs as well.\n\n\\subsection{What do the Jacobian Eigenvalues Tell Us?}\\label{section:stability-in-ml}\nSeveral recent papers in ML, e.g  \\citep{haber2017stable,chang2019antisymmetricrnn}, claim that a sufficient condition for stability of the nonlinear system:\n\\[\\dot{\\mathbf{x}} = \\mathbf{f}(\\mathbf{x},t)\\]\nis that the associated Jacobian matrix $\\mathbf{J}(\\mathbf{x},t) = \\frac{\\partial \\mathbf{f}}{\\partial \\mathbf{x}}$ has eigenvalues whose real parts are strictly negative, i.e:\n\\[\\max_i \\text{Re}(\\lambda_i(\\mathbf{J}(\\mathbf{x},t)) \\leq -\\alpha\\]\nwith $\\alpha>0$. However, this claim is generally false - see Section 4.4.2 in \\citep{slotine1991applied}.\n\nIn the \\textit{specific} case of the RNN \\eqref{eq:RNN}, it appears that the eigenvalues of the symmetric part of $\\mathbf{W}$ do provide information on global stability in a number of applications. For example, in \\citep{matsuoka1992stability} it was shown that if $\\mathbf{W}_s = \\frac{1}{2}(\\mathbf{W} + \\mathbf{W}^T)$ has all its eigenvalues less than unity, and $\\mathbf{u}$ is constant, then \\eqref{eq:RNN} has a unique, globally asymptotically stable fixed point. This condition also implies that the real parts of the eigenvalues of the Jacobian are uniformly negative. Moreover, in \\citep{chang2019antisymmetricrnn} it was shown that setting the symmetric part of $\\mathbf{W}_s = \\frac{1}{2}(\\mathbf{W} + \\mathbf{W}^T)$ almost equal to zero (yet slightly negative) led to rotational, yet stable dynamics in practice. This leads us to the following theorem, which shows that if the slopes of the activation functions change sufficiently slowly as a function of time, then the condition in \\citep{matsuoka1992stability} in fact implies global contraction of (\\ref{eq:RNN}).\n\n\\vspace{0.2cm}\n\\begin{theorem}\\label{theorem: Wdiagstabtheorem}\nLet $\\mathbf{D}$ be a positive, diagonal matrix with $D_{ii} = \\frac{d\\phi_i}{dx_i}$, and let $\\mathbf{P}$ be an arbitrary, positive diagonal matrix. If:\n\n\\[ (g\\mathbf{W}-\\mathbf{I})\\mathbf{P} + \\mathbf{P}(g\\mathbf{W}^T-\\mathbf{I}) \\preceq -c\\mathbf{P} \\hspace{.5cm} \\text{and} \\hspace{.5cm} \\dot{\\mathbf{D}} - cg^{-1}\\mathbf{D} \\preceq -\\beta\\mathbf{D}\\]\nfor $c,\\beta > 0$, then (\\ref{eq:RNN}) is contracting in metric $\\mathbf{D}$ with rate $\\beta$.\n\\end{theorem}\n\nWe stress however, that it is an open question whether or not diagonal stability of $\\mathbf{W}$ implies that \\eqref{eq:RNN} is contracting. It has been conjectured that diagonal stability of $g\\mathbf{W}-\\mathbf{I}$ is a sufficient condition for global contraction of \\eqref{eq:RNN} \\citep{revay2020lipschitz}, however this has been difficult to prove. To better characterize this conjecture, we present Theorem \\ref{theorem: Wdiagstabcounterexampletheorem}, which shows by way of counterexample that diagonal stability of $g\\mathbf{W}-\\mathbf{I}$ does not imply global contraction in a \\textit{constant} metric for (\\ref{eq:RNN}).\n\\vspace{0.2cm}\n\\begin{theorem}\n\\label{theorem: Wdiagstabcounterexampletheorem}\nSatisfaction of the condition \\hspace{1mm} $g\\mathbf{W}_{sym} - \\mathbf{I} \\prec 0$ \\hspace{1mm} is \\textbf{not} sufficient to show global contraction of the general nonlinear RNN (\\ref{eq:RNN}) in any \\textbf{constant} metric. High levels of antisymmetry in $\\mathbf{W}$ can make it impossible to find such a metric, which we demonstrate via a $2 \\times 2$ counterexample of the following form, with $c \\geq 2$ when $g=1$: $\\mathbf{W} = \\begin{bmatrix}\n    0 & -c \\\\\n    c & 0\n  \\end{bmatrix}$\n\\end{theorem}\n\n\\section{Stability-Constrained Network of Networks Perform Well on Benchmarks}\\label{section:experiments-intro}\nA natural concern is that stability of an RNN may come at the cost of its expressivity, which is particularly relevant for integrating information over long timescales. To investigate whether this might be an issue for our model, we trained a stability-constrained network-of-networks on three benchmark sequential image classification tasks: sequential MNIST, permuted seqMNIST, and sequential CIFAR10. These tasks are often used to measure information processing ability over long sequences \\citep{le2015simple}. Images are presented pixel-by-pixel, and the network makes a prediction at the end of the sequence. In permuted seqMNIST, pixels are input in a fixed but random order.\n\nAll of our experiments were done on networks governed by \\eqref{eq:combo_RNN}. The nonlinear subnetwork RNNs were connected to each other via linear all-to-all negative feedback, given by \\eqref{eq:feedback_combo}. For all subnetworks we use the ReLU activation function. To enforce contraction of each individual subnetwork, we focused on two stability constraints from our theoretical results: Theorems \\ref{theorem: absolutevaluetheorem} and \\ref{theorem: singularvaluetheorem}. In the case of Theorem \\ref{theorem: absolutevaluetheorem}, we did not train the individual subnetworks' weight matrices, but only trained the connections \\textit{between} subnetworks (Figure \\ref{fig:network-cartoon}B). For Theorem \\ref{theorem: singularvaluetheorem}, we trained all parameters of the model (Figure \\ref{fig:network-cartoon}C).\n\nWe refer to networks with subnetworks constrained by Theorem \\ref{theorem: absolutevaluetheorem} as `Sparse Combo Nets' and to networks with subnetworks constrained by Theorem \\ref{theorem: singularvaluetheorem} as `SVD Combo Nets'. Throughout the experimental results we use the notation `$p \\times n$ network' - such a network consists of $p$ distinct subnetwork RNNs, with each such subnetwork RNN containing $n$ units.\n\n\\begin{figure}[h]\n\\centering\n\\includegraphics[width=\\textwidth,keepaspectratio]{figs-final/Cartoon+Task.pdf}\n\\caption{Summary of task structure and network architectures. Images from MNIST (or CIFAR10) were flattened into an array of pixels and fed sequentially into the modular `network of networks', with classification based on the output at the last time-step. For MNIST, each image was also permuted in a fixed manner (A). The subnetwork `modules' of our architecture were constrained to meet either Theorem \\ref{theorem: absolutevaluetheorem} via sparse initialization (B) or Theorem \\ref{theorem: singularvaluetheorem} via direct parameterization (C). Linear negative feedback connections were trained between the subnetworks according to \\eqref{eq:feedback_combo}.}\n\\label{fig:network-cartoon}\n\\end{figure}\n\n\\subsection{Network Initialization and Training}\\label{section:init-info}\nFor the Sparse Combo Net we were not able to find a parameterization to continuously update the internal RNN weights during training in a way that preserved contraction. However, it is easy to randomly generate matrices with a particular likelihood of meeting the Theorem \\ref{theorem: absolutevaluetheorem} condition by selecting an appropriate sparsity level and limit on entry magnitude. Sparsity in particular is of interest due to its relevance in neurobiology and machine learning, so it is convenient that the condition makes it easy to verify stability of many different sparse RNNs. As $g=1$ for ReLU activation, we check potential subnetwork matrices $\\mathbf{W}$ by simply verifying linear stability of $|\\mathbf{W}| - \\mathbf{I}$.\n\nBecause every RNN meeting the condition has a corresponding well-defined stable LTI system contracting in the same metric, it is also easy to find a metric to use in our training algorithm: solving for $\\mathbf{M}$ in $-\\mathbf{I} = \\mathbf{M}\\mathbf{A} + \\mathbf{A}^{T}\\mathbf{M}$ will produce a valid metric for any stable LTI system $\\mathbf{A}$ \\citep{slotine1991applied}. We utilize the fact that Hurwitz Metzler matrices are diagonally stable to improve efficiency of computing $\\mathbf{M}$ (as well as in our proof of Theorem \\ref{theorem: absolutevaluetheorem}).\n\n  We therefore randomly generated fixed subnetworks satisfying Theorem \\ref{theorem: absolutevaluetheorem} and trained only the linear connections between them (Figure \\ref{figure:example-training}), as well as the linear input and output layers. More information on network initialization, hyperparameter tuning, and training algorithm is provided in Section \\ref{Appendix:Experiments}.\n\n\\begin{figure}[h]\n\\centering\n\\includegraphics[width=\\textwidth,keepaspectratio]{figs-final/ExampleTrainingText.pdf}\n\\caption{Example $3$x$16$ Sparse Combo Net. Nonlinear intra-subnetwork weights are initialized using a set sparsity, and do not change in training (A). Linear inter-subnetwork connections are constrained to be antisymmetric with respect to the overall network metric, and are updated in training (B).}\n\\label{figure:example-training}\n\\end{figure}\n\nFor the SVD Combo Net on the other hand, we ensured contraction of each subnetwork RNN by direct parameterization (described in Section \\ref{Appendix:SVDNet}), thus allowing all weights to be trained.\n\n\\subsection{Results}\\label{section:main-results}\nThe Sparse Combo Net architecture achieved the highest overall performance on both permuted seqMNIST and seqCIFAR10, with 96.94\\% and 65.72\\% best test accuracies respectively - thereby setting a new SOTA for stable RNNs (Table \\ref{table:sota}). Furthermore, we were able to reproduce SOTA scores over several repetitions, including 10 trials of seqCIFAR10. Along with repeatability of results, we also show that the contraction constraint on the connections between subnetworks ($\\mathbf{L}$ in \\eqref{eq:feedback_combo}) is important for performance, particularly in the Sparse Combo Net (Section \\ref{section:repeats}).\n\nAdditionally, we profile how various architecture settings impact performance of our networks. In both networks, we found that increasing the total number of neurons improved task performance, but with diminishing returns (Section \\ref{section: size}). We also found that the sparsity of the hidden-to-hidden weights in Sparse Combo Net had a large impact on the final network performance (Section \\ref{section: sparse}).\n\n\\begin{table}[h!]\n\\small\n\\centering\n\\begin{tabular}{ | m{1.75cm} || m{0.7cm} | m{0.8cm} || m{1.3cm} | m{1.3cm} | m{1.3cm} || m{0.85cm} | m{0.85cm} | m{0.85cm} | }\n\\hline\n Name & Stable RNN? & Params & sMNIST \\newline Repeats \\newline Mean (n) \\newline [Min] & psMNIST \\newline Repeats \\newline Mean (n) \\newline [Min] & sCIFAR10 \\newline Repeats \\newline Mean (n) \\newline [Min] & Seq \\newline MNIST \\newline Best & PerSeq \\newline MNIST \\newline Best & Seq \\newline CIFAR \\newline Best \\\\\n\\hline\\hline\nLSTM \\citep{chang2019antisymmetricrnn} & & 68K & \\centering --- & \\centering --- & \\centering --- & 97.3\\% & 92.7\\% & 59.7\\% \\\\\n\\hline\nTransformer \\newline \\citep{trinh2018cifar} & & 0.5M & \\centering --- & \\centering --- & \\centering --- & 98.9\\% & 97.9\\% & 62.2\\% \\\\\n\\hline\\hline\nAntisymmetric \\newline \\citep{chang2019antisymmetricrnn} & ? & 36K & \\centering --- & \\centering --- & \\centering --- & 98\\% & 95.8\\% & 58.7\\% \\\\\n\\hline\n\\rowcolor{Gray}\nSparse Combo Net & \\checkmark & 130K & \\centering --- & \\textbf{96.85\\%} (4) \\newline [\\textbf{96.65\\%}] & 64.72\\% (10) \\newline [63.73\\%] & 99.04\\% & \\textbf{96.94\\%} & \\textbf{65.72\\%} \\\\\n\\hline\nLipschitz \\newline \\citep{erichson2021lipschitz} & \\checkmark & 134K & 99.2\\% (10) \\newline [99.0\\%] & 95.9\\% (10) \\newline [95.6\\%] & \\centering --- & \\textbf{99.4\\%} & 96.3\\% & 64.2\\% \\\\\n\\hline\\hline\nCKConv \\newline \\citep{romero2021ckconv} & & 1M & \\centering --- & \\centering --- & \\centering --- & 99.32\\% & 98.54\\% & 63.74\\% \\\\\n\\hline\nS4 \\citep{gu2022s4} & & 7.9M & \\centering --- & \\centering --- & \\centering --- & \\textbf{99.63\\%} & \\textbf{98.7\\%} & \\textbf{91.13\\%} \\\\\n\\hline\nTrellis \\citep{bai2019trellis} & & 8M & \\centering --- & \\centering --- & \\centering --- & 99.2\\% & 98.13\\% & 73.42\\% \\\\\n\\hline\n\\end{tabular}\n\\caption{Published benchmarks for sequential MNIST, permuted MNIST, and sequential CIFAR10 best test accuracy. Architectures are grouped into three categories: baselines, best performing RNNs with claimed stability guarantee*, and networks achieving overall SOTA. Within each grouping, networks are ordered by number of trainable parameters (for CIFAR10 if it differed across tasks). Our network is highlighted. Where possible, we include information on repeatability.\\\\\n*For more on stability guarantees in machine learning, see Section \\ref{section:stability-in-ml}}\n\\label{table:sota}\n\\end{table}\n\\subsubsection{Experiments with Network Size} \\label{section: size}\nUnderstanding the effect of size on network performance is important to practical application of these architectures. For both Sparse Combo Net and SVD Combo Net, increasing the number of subnetworks while holding other settings constant (including fixing the size of each subnetwork at 32 units) was able to increase network test accuracy on permuted seqMNIST to a point (Figure \\ref{figure:test-sizes}).\n\nThe greatest performance jump happened when increasing from one module (37.1\\% Sparse Combo Net, 61.8\\% SVD Combo Net) to two modules (89.1\\% Sparse Combo Net, 92.9\\% SVD Combo Net). After that the performance increased steadily with number of modules until saturating at $\\sim 97\\%$ for Sparse Combo Net and $\\sim 95\\%$ for SVD Combo Net.\n\nAs the internal subnetwork weights are not trained in Sparse Combo Net, it is unsurprising that its performance was substantially worse at the smallest sizes. However Sparse Combo Net surpassed SVD Combo Net by the $12 \\times 32$ network size, which contains a modest 384 total units. Due to the better performance of the Sparse Combo Net, we focused additional analyses there. Note also that the SVD Combo Net never reached 55\\% test accuracy for CIFAR10 in our early experiments.\n\nWe then evaluated task performance as the \\emph{modularity} of a Sparse Combo Net (fixed to have 352 total units) was varied. We observed an inverse U shape (Figure \\ref{figure:test-sizes-sup}B), with poor performance of a $1 \\times 352$ net and an $88 \\times 4$ net, and best performance from a $44 \\times 8$ net. However, this experiment compared similar sparsity levels, while in practice we can achieve better performance with larger subnetworks by leveraging sparsity in a way not possible for smaller ones.\n\n\n\\begin{figure}[h]\n\\centering\n\\includegraphics[width=0.85\\textwidth,keepaspectratio]{figs/perf_vs_num_sub_nets.png}\n\\caption{Permuted seqMNIST performance plotted against the number of subnetworks. Each subnetwork has 32 neurons. Results are shown for both Sparse Combo Net and SVD Combo Net.}\n\\label{figure:test-sizes}\n\\end{figure}\n\n\\subsubsection{Experiments with Sparsity}\\label{section: sparse}\nBecause of the link between sparsity and stability as well as the biological relevance of sparsity, we explored in detail how subnetwork sparsity affects the performance of Sparse Combo Net. We ran a number of experiments on the permuted seqMNIST task, varying sparsity level while holding network size and other hyperparameters constant. Here we use \"$n\\%$ sparsity level\" to refer to a network with subnetworks that have just $n\\%$ of their weights non-zero.\n\nWe observed a large ($>5$ percentage point) performance boost when switching from a $26.5\\%$ sparsity level to a $10\\%$ sparsity level in the $11 \\times 32$ Sparse Combo Net (Figure \\ref{figure:test-sparsity}), and subsequently decided to test significantly sparser subnetworks in a $16 \\times 32$ Sparse Combo Net. We trained networks with sparsity levels of $5\\%$, $3.3\\%$, and $1\\%$, as well as $10\\%$ for baseline comparison (Figure \\ref{figure:test-sparsity-sup}A). A $3.3\\%$ sparsity level produced the best results, leading to our SOTA performance for stable networks on both permuted seqMNIST and seqCIFAR10. With a component RNN size of just 32 units, this sparsity level is small, containing only one or two directional connections per neuron on average (Figure \\ref{figure:example-best-net}).\n\nAs sparsity had such a positive effect on task performance, we did additional analyses to better understand why. We found that decreasing the magnitude of non-zero elements while holding sparsity level constant decreased task performance (Figure \\ref{figure:test-sparsity-sup}B), suggesting that the effect is driven in part by the fact that sparsity enables higher magnitude non-zero elements while still maintaining stability.\n\n\\begin{figure}[h]\n\\centering\n\\includegraphics[width=0.95\\textwidth,keepaspectratio]{figs/perf_during_training_mnist_vs_sparsity.png}\n\\caption{Permuted seqMNIST performance over the course of training for two $11 \\times 32$ Sparse Combo Nets with different sparsity levels.}\n\\label{figure:test-sparsity}\n\\end{figure}\n\nThe use of sparsity in subnetworks to improve performance suggests another interesting direction that could enable better scalability of total network size - enforcing sparsity in the linear feedback weight matrix ($\\mathbf{L}$). We performed pilot testing of this idea, which showed promise in mitigating the saturation effect seen in Figure \\ref{figure:test-sizes}. Those results are detailed in Section \\ref{Appendix:scalability-discuss} (Table \\ref{table:scalability}).\n\n\\subsubsection{Repeatability and Controls}\\label{section:repeats}\nSparse Combo Net does not have the connections within its subnetworks trained, so network performance could be particularly susceptible to random initialization. Thus we ran repeatability studies on permuted sequential MNIST and sequential CIFAR10 using our best network settings ($16 \\times 32$ with subnetwork sparsity level of $3.3\\%$) and an extended training period. Mean performance over 4 trials of permuted seqMNIST was 96.85\\% with 0.019 variance, while mean performance over 10 trials of seqCIFAR10 was 64.72\\% with 0.406 variance. Note we also ran a number of additional experiments on size and sparsity settings, described in Section \\ref{Appendix:extended-results}.\n\nAcross the permuted seqMNIST trials, best test accuracy always fell between $96.65\\%$ and $96.94\\%$, a range much smaller than the differences seen with changing sparsity settings and network size. Three of the four trials showed best test accuracy $\\geq 96.88\\%$, despite some variability in early training performance (Figure \\ref{figure:test-reproduce}). Similarly, eight of the ten seqCIFAR10 trials had test accuracy $>64.3\\%$, with all results falling between 63.73\\% and 65.72\\% (Figure \\ref{figure:cifar-reps}). This robustly establishes a new SOTA for stable RNNs, comfortably beating the previously reported (single run) 64.2\\% test accuracy achieved by Lipschitz RNN \\citep{erichson2021lipschitz}.\n\nAs a control study, we also tested how sensitive the Sparse Combo Net was to the stabilization condition on the interconnection matrix ($\\mathbf{L}$ in \\eqref{eq:feedback_combo}). To do so, we initialized the individual RNN modules in a $24 \\times 32$ network as before, but set $\\mathbf{L}=\\mathbf{B}$ and did not constrain $\\mathbf{B}$ at all during training, thus no longer ensuring contraction of the overall system. This resulted in 47.0\\% test accuracy on the permuted seqMNIST task, a stark decrease from the original 96.7\\% test accuracy - thereby demonstrating the utility of the contraction condition.\n\n\\section{Discussion}\nBiologists have long noted that modularity provides organisms with stability and robustness \\citep{Kitano_2004}. The other direction -- that stability and robustness provide modularity -- is well known to engineers \\citep{khalil2002nonlinear,slotine1991applied,modular}, but has been less appreciated in biology. We use this principle to build and train provably stable assemblies of recurrent neural networks. Like real brains, the components of our \"RNN of RNNs\" can communicate with one another through a mix of hierarchical and feedback connections. In particular, we theoretically characterized conditions under which an RNN of RNNs will be stable, given that each individual RNN is stable. We also provided several novel stability conditions for single RNNs that are compatible with these stability-preserving interareal connections. Our results contribute towards understanding how the brain maintains stable and accurate function in the presence of massive interareal feedback, as well as external inputs.\n\nThe question of neural stability is one of the oldest questions in computational neuroscience. Indeed, cyberneticists were concerned with this question before the term `computational neuroscience' existed \\citep{wiener2019cybernetics,ashby2013design}. Stability is a central component in several influential neuroscience theories \\citep{hopfield1982neural,seung1996brain,murphy2009balanced}, perhaps the most well-known being that memories are stored as stable point attractors \\citep{hopfield1982neural}. Our work shows that stability continues to be a key concept for computational neuroscience as the field transitions from focusing on single brain areas to many interacting brain areas.\n\nWhile primarily motivated by neuroscience, our approach is also relevant for machine learning. Deep learning models can be as inscrutable as they are powerful. This opacity limits conceptual progress and may be dangerous in safety-critical applications like autonomous driving or human-centered robotics. Given that stability is a fundamental property of dynamical systems -- and is intimately linked to concepts of control, generalization, efficiency, and robustness -- the ability to guarantee stability of a recurrent model will be important for ensuring deep networks behave as we expect them to \\citep{richards2018lyapunov,choromanski2020ode,revay2021recurrent,rodriguez2022lyanet}.\n\nIn the case of RNNs, one difficulty is that providing a certificate of \\textit{stability} is often impossible or computationally impractical. However, the stability conditions we derive here allow for recursive construction of complicated RNNs while automatically preserving stability. By parameterizing our conditions for easy optimization using gradient-based techniques, we successfully trained our architecture on challenging sequential processing benchmarks. The high test accuracy our networks achieved with a small number of trainable parameters demonstrates that stability does not necessarily come at the cost of expressivity. Thus, our results likewise contribute towards understanding stability certification of RNNs.\n\nIn future work, we will explore how our contraction-constrained RNNs of RNNs perform on a variety of neuroscience tasks, in particular tasks with multimodal structure \\citep{yang2019task}. Our approach is particularly compatible with \"global workspace\" models, in which different networks communicate via a shared latent space \\citep{tabareau2006notes,newell1972human,baars1993cognitive,dehaene1998neuronal,vanrullen2021deep,goyal2021coordination}. One desiderata for these future models is that they learn representations which are formally similar to those observed in the brain \\citep{yamins2014performance, schrimpf2020brain, williams2021generalized}, in complement with the structural similarities already shared. Moreover, a \"network of networks\" approach will be especially relevant to challenging multimodal machine learning problems, such as the simultaneous processing of audio and video. Therefore the advancement of neuroscience theory and machine learning remain hand-in-hand for our next lines of questioning. Indeed, combinations of trained networks have already seen groundbreaking success in DeepMind's AlphaGo \\citep{silver2016mastering}.\n\nAs well as the many potential experimental applications, there are numerous theoretical future directions suggested by our work. Networks with more biologically-plausible weight update rules, such as models discussed in \\citep{kozachkov2020achieving}, would be a fruitful neuroscience context in which to explore our conditions. One promising avenue of study there is to examine input-dependent stability of the learning process. In the context of machine learning, our stability conditions could be applied to the end-to-end training of multidimensional recurrent neural networks \\citep{graves2007multi}, which have clear structural parallels to our RNNs of RNNs but lack known stability guarantees.\n\nIn sum, recursively building network combinations in an effective and stable fashion while also allowing for continual refinement of the individual networks, as nature does for biological networks, will require new analysis tools. Here we have taken a concrete step towards the development of such tools, not only through our theoretical results, but also through their application to create stable combination network architectures that perform well in practice on benchmark tasks.\n\n\\begin{ack}\n This work benefited from stimulating discussions with Michael Happ, Quang-Cuong Pham, and members of the Fiete lab at MIT.\n\\end{ack}",
      "submission_id": "qTBC7E4c454",
      "parsed_pdf_content_blind_submission_open_review": "# ABSTRACT\n\nAdvanced applications of modern machine learning will likely involve combinations of trained networks, as are already used in spectacular systems such as DeepMind's AlphaGo. Recursively building such combinations in an effective and stable fashion while also allowing for continual refinement of the individual networks - as nature does for biological networks - will require new analysis tools. This paper takes a step in this direction by establishing contraction properties of broad classes of nonlinear recurrent networks and neural ODEs, and showing how these quantified properties allow in turn to recursively construct stable networks of networks in a systematic fashion. The results can also be used to stably combine recurrent networks and physical systems with quantified contraction properties. Similarly, they may be applied to modular computational models of cognition. We perform experiments with these combined networks on benchmark sequential tasks (e.g. permuted sequential MNIST) to demonstrate their capacity for processing information across a long timescale in a provably stable manner.\n\n# 1 INTRODUCTION\n\nNeuro-inspired machine learning has profoundly altered many fields such as computer vision, natural language processing, and computational neuroscience [\\(Bengio et al.,] [2017;] [Hassabis et al.,] [2017\\)]. While models trained with e.g. deep learning are remarkably powerful, they are for the most part 'black boxes'. This opaqueness can be dangerous in safety-critical applications, such as autonomous driving or human-centered robotics, and it limits conceptual progress. In the case of recurrent models, one difficulty is that providing a certificate of *stability* is currently impossible or computationally impractical. Given that stability is a fundamental property of dynamical systems – and is intimately linked to concepts of control, generalization, data-efficiency, and robustness – being able to guarantee the stability of a recurrent model is an important step towards making sure deep models behave as we expect them to.\n\nIn this spirit, there has been a recent flux of work focusing on applications of contraction analysis [\\(Lohmiller & Slotine,] [1998\\)] to recurrent models. Loosely speaking, a dynamical system is said to be contracting if any two of its trajectories converge to each other exponentially, regardless of initial conditions. A primary advantage of contraction analysis is that it is directly applicable to non-autonomous systems, which the vast majority of recurrent models are, allowing in turn modular contraction-preserving combination properties to be derived [\\(Slotine & Lohmiller,] [2001;] [Slotine,] [2003\\)]. We include a mathematical primer on contraction analysis in Section [A1.]\n\nWithin this line of research, our paper has two main aims: 1) To provide simple contraction conditions for *continuous-time* recurrent neural networks and 2) To show how these continuous-time contraction conditions imply a combination property. Using both aims, we proceed to implement stable combination networks that exhibit state-of-the-art (SOTA) performance on multiple sequential image classification tasks with a small number of trainable parameters.\n\n### 1.1 PREVIOUS WORK ON RNN STABILITY\n\nTo briefly review the current literature on application of contraction analysis to recurrent models, we first note that the 'Echo-State Condition' introduced in [Jaeger] [\\(2001\\)] is equivalent to discrete-time contraction in the identity metric. A later generalization of this condition included a diagonal metric [\\(Buehner & Young,] [2006\\)]. In the context of neuroscience, contraction analysis has been applied to analyzing the dynamics of winner-take-all networks [\\(Rutishauser et al.,] [2011;] [2015\\)] as well as networks with synaptic plasticity [\\(Kozachkov et al.,] [2020\\)]. In the machine learning context, Miller and Hardt recently rederived an 'echo-state property' for discrete recurrent models, and went on to prove that these contracting recurrent models could be well-approximated by feedforward networks in certain cases [\\(Miller & Hardt,] [2018\\)]. More recently still, in a series of papers Revay, Wang, and Manchester applied contraction analysis to discrete-time recurrent networks [\\(Revay & Manchester,] [2020;] [Revay et al.,] [2021;] [2020a\\)], expanding the class of models considered in [Miller & Hardt] [\\(2018\\)].\n\nIn addition to contraction (which amounts to a strong form of non-autonomous exponential stability) there has been a considerable amount of work attempting to enforce weaker forms of stability in RNNs, such as *autonomous* stability [\\(Erichson et al.,] [2021;] [Chang et al.,] [2019\\)]. Unfortunately, as we discuss in Section [2.3,] autonomous stability does not in general imply non-autonomous stability, so it is not clear what stability properties these RNNs possess when driven with external input.\n\nThere is also a line of work which uses orthogonal weight matrices to avoid the vanishing/exploding gradient problem during training. Orthogonality is typically ensured during training via a parameterization [\\(Arjovsky et al.,] [2016;] [Lezcano-Casado & Martınez-Rubio,] [2019\\)] – for example, by exploiting the fact that the matrix exponential of a skew-symmetric matrix is orthogonal, as is done in [Lezcano-Casado & Martınez-Rubio] [\\(2019\\)]. These works focus on parameterizations of *individual* RNNs which ensure stability is preserved during training. By contrast, our work focuses on 'combined' RNNs (i.e network of networks) and provides a parameterization on the *connections between stable subnetworks* such that training preserves the overall network stability (Figure [1\\)].\n\n![](_page_1_Picture_4.jpeg)\n\nFigure 1: Our stability certificate implies a modularity principle. It may be used to recursively construct complicated 'networks of networks' while automatically maintaining stability.\n\n# 1.2 COMBINATION NETWORKS\n\nThe combination and reuse of primitive \"modules\" has enabled a great deal of progress in computer science, and is also a key theme in biological evolution, particularly apparent in cortical structure of the human brain. In fact, it is thought that the majority of traits that have developed over the last 400+ million years are the result of evolutionary forces acting on regulatory elements that combine core components, rather than mutations to the core components themselves. This mechanism of action makes meaningful variation in population phenotypes much more feasible to achieve, and is appropriately titled \"facilitated variation\" [\\(Gerhart & Kirschner,] [2007\\)]. In addition to the biological evidence for facilitated variation, computational models have demonstrated that this approach produces populations that are better able to generalize to new environments [\\(Parter et al.,] [2008\\)], an ability that will be critical to further develop in deep learning systems.\n\nWhile the benefits of building modular systems are clear [\\(Simon,] [1962\\)], as in DeepMind's AlphaGo for example [\\(Silver et al.,] [2016\\)], there is no general guarantee that a combination of stable systems will itself be stable. Thus the tractability of these evolutionary processes hinges on some mechanism for ensuring stability of combinations. Because contraction analysis tools allow complicated contracting systems to be built up recursively from simpler elements, this form of stability is well suited for biological systems [\\(Slotine & Lohmiller,] [2001;] [Slotine & Liu,] [2012\\)]. Note also that contracting combinations can be made between systems with very different dynamics, as long as those dynamics are contracting. Here, we describe two common forms of contracting system combinations – hierarchical and feedback – that automatically guarantee overall system stability (Figure [1\\)].\n\nUltimately, cognitive models are moving increasingly toward study of multi-area dynamics, but many questions remain on how to best train and evaluate such networks [\\(Yang et al.,] [2018;] [Yang &] [Molano-Mazon,] [2021\\)]. Understanding how different brain regions interact harmoniously to produce a unified percept/action will require new ideas and analysis tools.\n\n# 2 MATHEMATICAL RESULTS\n\n### 2.1 CONTRACTION CONDITIONS FOR INDIVIDUAL, CONTINUOUS-TIME RNNS\n\nWe consider the following continuous-time RNN, evolving according to the following equations:\n\n$$\\tau \\dot{\\mathbf{x}} = -\\mathbf{x} + \\mathbf{W}\\phi(\\mathbf{x}) + \\mathbf{u}(t) \\tag{1}$$\n\nwhere x ∈ R <sup>n</sup>, φ is a static nonlinearity such that 0 ≤ φ <sup>0</sup> ≤ g, u(t) is some input (potentially timevarying) and τ > 0. We do not constrain the sign of φ. Example nonlinearities φ(x) that satisfy the constraints are tanh(ax), log(1 + e x ), and max(0, x), with g = a, 1, 1 respectively. Note that this class of RNNs is equivalent to another commonly used class of RNNs where the terms Wx + u appears inside φ(·). See Section [A2] or [\\(Miller & Fumarola,] [2012\\)] for details. Our mathematical results therefore apply equally well to both types of RNNs.\n\nWe seek a stability certificate for this system in terms of the recurrent weight matrix W. We are specifically interested in restricting W such that the RNN is globally *contracting*. We do this with the goal of recursively combining these contracting RNNs with other contracting RNNs to make large, complicated, stable 'networks of networks'. Here we derive five restrictions on W which ensure contraction for the continuous-time RNN defined by equation [1.] To the best of our knowledge these contraction conditions are novel. All proofs can be found in the supplemental Section [A4.]\n\nTheorem 1 (Absolute Value Restricted Weights). *Let* |W| *denote the matrix formed by taking the element-wise absolute value of* W*. If there exists a positive, diagonal* P *such that:*\n\n$$\\mathbf{P}(g|\\mathbf{W}| - \\mathbf{I}) + (g|\\mathbf{W}| - \\mathbf{I})^T \\mathbf{P} \\prec 0$$\n\n*then equation [1] is contracting in metric* P*. Moreover, if* Wii ≤ 0*, then* |W|ii *may be set to zero to reduce conservatism.*\n\nNote that when g = 1, Theorem [1] can be checked by checking for linear stability of |W| − I.\n\nTheorem 2 (Symmetric Weights). *If* W = W<sup>T</sup> *and* gW ≺ I*, then [\\(1\\)] is contracting.*\n\nTheorem 3 (Product of Diagonal and Orthogonal Weights). *If there exists positive diagonal matrices* P<sup>1</sup> *and* P2*, as well as* Q = Q<sup>T</sup> 0 *such that*\n\n$$\\mathbf{W} = -\\mathbf{P}_1 \\mathbf{Q} \\mathbf{P}_2$$\n\n*then [\\(1\\)] is contracting in metric* M = (P1QP1) −1 *.*\n\nTheorem 4 (Triangular Weights). *If* gW − I *is triangular and Hurwitz, then [\\(1\\)] is contracting in a diagonal metric.*\n\nTheorem 5 (Singular Value Restricted Weights). *If there exists a positive diagonal matrix* P *such that:*\n\n$$g^2 \\mathbf{W}^T \\mathbf{P} \\mathbf{W} - \\mathbf{P} \\prec 0$$\n\n*then [\\(1\\)] is contracting in metric* P*.*\n\n## 2.2 THE MODEL: NETWORK OF NETWORKS\n\nThe RNN in equation [1] is a *subnetwork* of our model. Our goal is to combine these subnetworks into a 'network of networks' in a manner that preserves the stability of the underlying modules. In particular, we seek to parameterize and learn the inter-module connections. To do this, we prove and make extensive use of the following theorem:\n\n**Theorem 6** (Network of Networks). Consider a collection of p subnetwork RNNs governed by equation 1. Assume that these RNNs each have hidden-to-hidden weight matrices  $\\{\\mathbf{W}_1,\\ldots,\\mathbf{W}_p\\}$  and are independently contracting in metrics  $\\{\\mathbf{M}_1,\\ldots,\\mathbf{M}_p\\}$ . Define the block matrices  $\\tilde{\\mathbf{W}} \\equiv BlockDiag(\\mathbf{W}_1,\\ldots,\\mathbf{W}_p)$  and  $\\tilde{\\mathbf{M}} \\equiv BlockDiag(\\mathbf{M}_1,\\ldots,\\mathbf{M}_p)$ , as well as the overall state vector  $\\tilde{\\mathbf{x}}^T \\equiv (\\mathbf{x}_1^T \\cdots \\mathbf{x}_2^T)$ . Then the following 'network of networks' is globally contracting in metric  $\\tilde{\\mathbf{M}}$ :\n\n$$\\tau \\dot{\\tilde{\\mathbf{x}}} = -\\tilde{\\mathbf{x}} + \\tilde{\\mathbf{W}}\\phi(\\tilde{\\mathbf{x}}) + \\mathbf{u}(t) + \\mathbf{L}\\tilde{\\mathbf{x}}$$\n\n$$\\mathbf{L} = \\mathbf{B} - \\tilde{\\mathbf{M}}^{-1}\\mathbf{B}^{T}\\tilde{\\mathbf{M}}$$ (2)\n\nWhere  $\\mathbf{B}$  is an arbitrary square matrix. Note that we are agnostic to the particular contraction condition here – the subnetwork RNNs can satisfy any of the theorems in the preceding section.\n\n#### 2.3 ON STABILITY THEOREMS FOR RNNs\n\nSeveral recent papers in machine learning, e.g (Haber & Ruthotto, 2017; Chang et al., 2019), claim that a sufficient condition for stability of the nonlinear system:\n\n$$\\dot{\\mathbf{x}} = \\mathbf{f}(\\mathbf{x}, t)$$\n\nis that the associated Jacobian matrix  $\\mathbf{J}(\\mathbf{x},t)=\\frac{\\partial \\mathbf{f}}{\\partial \\mathbf{x}}$  has eigenvalues whose real parts are strictly negative, i.e:\n\n$$\\max_{i} \\operatorname{Re}(\\lambda_{i}(\\mathbf{J}(\\mathbf{x}, t)) \\le -\\alpha$$\n\nwith  $\\alpha > 0$ . This claim is generally false. For a counter-example, see Section 4.4.2 in (Slotine & Li, 1991).\n\nHowever, in the *specific* case of the RNN (1), it appears that the eigenvalues of the symmetric part of  $\\mathbf{W}$  do provide information on global stability in a number of applications. For example, in (Matsuoka, 1992) it was shown that if  $\\mathbf{W}_s = \\frac{1}{2}(\\mathbf{W} + \\mathbf{W}^T)$  has all its eigenvalues less than unity, and  $\\mathbf{u}$  is constant, then (1) has a unique fixed point that is globally asymptotically stable. It is easy to see that this condition also implies that the real parts of the eigenvalues of the Jacobian are uniformly negative. Moreover, in (Chang et al., 2019) it was shown that setting the symmetric part of  $\\mathbf{W}_s = \\frac{1}{2}(\\mathbf{W} + \\mathbf{W}^T)$  almost equal to zero (yet slightly negative) led to rotational, yet stable dynamics in practice. This leads us to the following theorem, which shows that if the slopes of the activation functions change sufficiently slowly as a function of time, then the condition in (Matsuoka, 1992) in fact implies global contraction of (1).\n\n**Theorem 7.** Let **D** be a positive, diagonal matrix with  $D_{ii} = \\frac{d\\phi_i}{dx_i}$ , and let **P** be an arbitrary, positive diagonal matrix. If:\n\n$$(g\\mathbf{W} - \\mathbf{I})\\mathbf{P} + \\mathbf{P}(g\\mathbf{W}^T - \\mathbf{I}) \\preceq -c\\mathbf{P}$$\n\nand\n\n$$\\dot{\\mathbf{D}} - cg^{-1}\\mathbf{D} \\preceq -\\beta\\mathbf{D}$$\n\nfor  $c, \\beta > 0$ , then (1) is contracting in metric **D** with rate  $\\beta$ .\n\nIt has been conjectured that diagonal stability of  $g\\mathbf{W} - \\mathbf{I}$  is a sufficient condition for global contraction of the RNN 1 (Revay et al., 2020b), however this has been difficult to prove. To better characterize this conjecture, we present Theorem 8, which shows by way of counterexample that diagonal stability of  $g\\mathbf{W} - \\mathbf{I}$  does NOT imply global contraction in a *constant* metric for (1).\n\n**Theorem 8.** Satisfaction of the condition  $g\\mathbf{W}_{sym} - \\mathbf{I} \\prec 0$  is **NOT** sufficient to show global contraction of the general nonlinear RNN (1) in any constant metric. High levels of antisymmetry in  $\\mathbf{W}$  can make it impossible to find such a metric, which we demonstrate via a  $2 \\times 2$  counterexample\n\nof the following form, with $$c \\geq 2$$ :  $\\mathbf{W} = \\begin{bmatrix} 0 & -c \\\\ c & 0 \\end{bmatrix}$\n\n# 3 EXPERIMENTS\n\nIt is possible that imposing a strong stability constraint on the RNNs precludes them from performing well on tasks that require information processing over long timescales. To investigate whether or not this occurs with our stable combination networks, we trained a variety of provably stable RNNs on benchmark sequential image classification tasks - sequential MNIST, permuted sequential MNIST, and sequential CIFAR10 - as these are frequently used to measure an RNN's ability to use information from across a long sequence [\\(Le et al.,] [2015;] [Chang et al.,] [2019\\)]. For all the tasks, images are presented pixel-by-pixel, and the network makes a prediction at the end of the sequence. In the case of permuted seqMNIST, the pixels are presented in a fixed but random order.\n\nRecall that our proposal is to constrain the stability of subnetworks in a way that makes training the combined network (using Theorem [6\\)] stability-preserving. The particular choice of subnetwork constraint influences the resulting training process and performance. We selected two stability constraints from our results above to experiment with: Theorems [1] and [5.] In the case of Theorem [1,] we did not train the individual subnetworks' hidden-to-hidden weight matrices, only training the connections *between* subnetworks (Figure [2A]). In the case of Theorem [5] we trained all the parameters of the model (Figure [2B]).\n\nWe refer to networks with subnetworks constrained by Theorem [1] as 'Sparse Combo Nets' (because, as we will detail below, sparse hidden-to-hidden weights more readily satisfy Theorem [1\\)]. Similarly, we refer to networks with subnetworks constrained by Theorem [5] as 'SVD Combo Nets'. Throughout the experimental results we refer to combination networks of different sizes using the notation 'p × n network'. Such a network consists of p distinct subnetwork RNNs, with each such subnetwork RNN containing n units. For all subnetworks we use the ReLU activation function.\n\n![](_page_4_Figure_5.jpeg)\n\nFigure 2: Summary of network architectures. Nonlinear subnetworks are constrained to meet either Theorem [1] via sparse initialization (A) or Theorem [5] via direct parameterization (B). Linear negative feedback connections are trained between the subnetworks according to [6.] See Section [A5] for additional details on model definitions.\n\nOur networks performed better than or comparably to other stable RNNs across all tasks, especially when leveraging sparsity in the subnetworks. Our best test accuracies of 96.94% on permuted seqMNIST and 64.75% on seqCIFAR10 not only set a new SOTA for stability-guaranteed RNNs, but are also competitive with the overall SOTA while having many fewer parameters (Table [1\\)].\n\n### 3.1 NETWORK INITIALIZATION AND TRAINING DETAILS\n\nFor the Sparse Combo Net we were not able to find a direct parameterization to use during training, but it is straightforward to randomly generate matrices with a particular likelihood of meeting the condition by selecting an appropriate density level and limit on entry magnitude. As every RNN meeting this contraction condition has a well-defined stable LTI system contracting in the same metric, it is not only easy to verify, but also easy to find a contraction metric to use in our training algorithm (Theorem [6\\)]. Solving for M in −I = MA + A<sup>T</sup>M will produce a valid metric for any stable LTI system A [\\(Slotine & Li,] [1991\\)]. We therefore randomly generate fixed subnetworks satisfying Theorem [1] and train only the linear connections between them (Figure [3\\)], as well as the linear input and output layers.\n\nFigure 3: Example  $3 \\times 16$  Sparse Combo Net. The hidden-to-hidden nonlinear weights ( $\\hat{\\mathbf{W}}$  in Theorem 6) are initialized based on a set sparsity, and do not changed over training (A). The linear inter-subnetwork connections ( $\\mathbf{L}$  in Theorem 6) are constrained to be antisymmetric (with respect to the overall network metric), and are updated in the training process (B).\n\nFor the SVD Combo Net, we ensured contraction by directly parameterizing each of the  $W_i$  ( $i = 1, 2 \\dots p$ ) as:\n\n$$\\mathbf{W}_i = \\mathbf{\\Phi}_i^{-1} \\mathbf{U}_i \\mathbf{\\Sigma}_i \\mathbf{V}_i^T \\mathbf{\\Phi}_i \\tag{3}$$\n\nwhere  $\\Phi_i$  is diagonal and nonsingular,  $\\mathbf{U}_i$  and  $\\mathbf{V}_i$  are orthogonal, and  $\\Sigma_i$  is diagonal with  $\\Sigma_{ii} \\in [0, g^{-1})$ . We ensure orthogonality of  $\\mathbf{U}_i$  and  $\\mathbf{V}_i$  during training by exploiting the fact that the matrix exponential of a skew-symmetric matrix is orthogonal, as was done in (Lezcano-Casado & Martinez-Rubio, 2019). The network constructed from these subnetworks using Theorem 6 is contracting in metric  $\\tilde{\\mathbf{M}} = \\operatorname{BlockDiag}(\\Phi_1^2, \\ldots, \\Phi_n^2)$ .\n\nFor both networks, we constrained the matrix  ${\\bf B}$  in Theorem 6 to reflect underlying modularity assumptions. In particular, we only train the off-diagonal blocks of  ${\\bf B}$  and mask the diagonal blocks. We do this to maintain the interpretation of  ${\\bf L}$  as the matrix containing the connection weights between different modules, as diagonal blocks would correspond to self-connections. Furthermore, we only train the lower-triangular blocks of  ${\\bf B}$  while masking the others, to increase training speed.\n\nUnless specified otherwise, all networks were trained for 150 epochs, using an Adam optimizer with initial learning rate 1e-3 and weight decay 1e-5. The learning rate was cut to 1e-4 after 90 epochs and to 1e-5 after 140. After identifying the most promising settings, we ran repetitions trials on the best networks for 200 epochs with learning rate cuts after epochs 140 and 190. Detailed information on network initialization and hyperparameter tuning is provided in Section A3.\n\n#### 3.2 RESULTS\n\nThe Sparse Combo Net architecture achieved the highest overall performance on permuted seqM-NIST and seqCIFAR10 (96.94% and 64.75% best performance, respectively). On both these tasks, the Sparse Combo Net achieved SOTA for stable RNNs (Table 1). Furthermore, we were able to reproduce these scores over several repetitions (Section 3.2.3). Along with repeatability of results, we also show that the contraction constraint on the connections between subnetworks (L in Theorem 6) is important for performance, particularly in the Sparse Combo Net.\n\nAdditionally, we profile how various architecture settings impact performance of our networks. We found that the sparsity of the hidden-to-hidden weights in Sparse Combo Net had a large impact on the final network performance (Section 3.2.1). In both networks, we found that increasing the total number of neurons improved task performance, but with diminishing returns (Section 3.2.2).\n\n#### 3.2.1 EXPERIMENTS WITH SPARSITY\n\nBecause of the link between sparsity and stability as well as the biological relevance of sparsity, we explored in more detail how subnetwork sparsity affects the performance of Sparse Combo Net. We ran a number of additional experiments on the permuted seqMNIST task, varying sparsity level while holding network size and other hyperparameters constant. Here we use \"n% sparsity level\" to refer to a network with subnetworks that have just n% of their weights non-zero.\n\nWe observed a large (> 5 percentage point) performance boost when switching from a 26.5% sparsity level to a 10% sparsity level in the  $11 \\times 32$  Sparse Combo Net (Figure 4), and subsequently\n\n| Name                                          | Stable<br>RNN? | Params | Seq<br>MNIST<br>Best | PerSeq<br>MNIST<br>Best | Seq<br>CIFAR<br>Best | sMNIST<br>Repeats<br>Mean (n)<br>[Min] | psMNIST<br>Repeats<br>Mean (n)<br>[Min] | sCIFAR10<br>Repeats<br>Mean (n)<br>[Min] | |-----------------------------------------------|----------------|--------|----------------------|-------------------------|----------------------|----------------------------------------|-----------------------------------------|------------------------------------------| | LSTM<br>(Chang et al.,<br>2019)               |                | 68K    | 97.3%                | 92.7%                   | 59.7%                | —                                      | —                                       | —                                        | | Transformer<br>(Trinh et al.,<br>2018)        |                | 0.5M   | 98.9%                | 97.9%                   | 62.2%                | —                                      | —                                       | —                                        | | Antisymmetric<br>(Chang et al.,<br>2019)      | ?              | 36K    | 98%                  | 95.8%                   | 58.7%                | —                                      | —                                       | —                                        | | Sparse<br>Combo Net                           | X              | 130K   | 99.04%               | 96.94%                  | 64.75%               | —                                      | 96.85%<br>(4)<br>[96.65%]               | 64.37%<br>(3)<br>[64.04]                 | | Lipschitz<br>(Erichson<br>et al., 2021)       | X              | 134K   | 99.4%                | 96.3%                   | 64.2%                | 99.2%<br>(10)<br>[99.0%]               | 95.9%<br>(10)<br>[95.6%]                | —                                        | | Dense<br>IndRNN<br>(Li<br>et<br>al.,<br>2020) |                | 83K    | 99.48%               | 97.2%                   | —                    | —                                      | —                                       | —                                        | | CKConv<br>(Romero<br>et al., 2021)            |                | 1M     | 99.32%               | 98.54%                  | 63.74%               | —                                      | —                                       | —                                        | | Trellis<br>(Bai<br>et<br>al.,<br>2019)        |                | 8M     | 99.2%                | 98.13%                  | 73.42%               | —                                      | —                                       | —                                        |\n\nTable 1: Published benchmarks for sequential MNIST, permuted MNIST, and sequential CIFAR10 best test accuracy. Networks are grouped into three categories: baselines, best performing RNNs with claimed stability guarantee\\*, and networks achieving the current SOTA on each task. Within each grouping, networks are ordered by number of trainable parameters (for CIFAR10 if it differed across tasks). Our network is highlighted. Where possible, we include information on repeatability. \\*For more on stability guarantees in machine learning, see Section [2.3.]\n\ndecided to test significantly sparser subnetworks in a 16 × 32 Sparse Combo Net. We trained networks with sparsity levels of 5%, 3.3%, and 1%, as well as 10% for baseline comparison (Figure [S3A]). A 3.3% sparsity level produced the best results, leading to our SOTA performance on permuted seqMNIST. For a component RNN size of just 32 units, this sparsity level is quite small, with only one or two directional connections per neuron on average (Figure [S1\\)].\n\n![](_page_6_Figure_4.jpeg)\n\nFigure 4: Permuted seqMNIST performance over the course of training for two 11 × 32 Sparse Combo Nets with different sparsity levels.\n\nThe 16 × 32 Sparse Combo Net with 3.3% density also performed SOTA for stable RNNs on sequential CIFAR10. Not only was the test accuracy the highest to date for a provably stable RNN, it was higher than the 1 million parameter CKConv network, which holds the record for permuted seqMNIST accuracy (Table [1\\)]. Our network has 130 thousand trainable parameters by comparison.\n\nAs sparsity had such a positive effect on task performance, we did additional analyses to better understand why. We found that decreasing the magnitude of non-zero elements while holding sparsity level constant decreased task performance (Figure [S3B]), suggesting that the effect is driven in part by the fact that sparsity enables higher magnitude non-zero elements while still maintaining stability.\n\nIt is also worth noting that upon investigation of the subnetwork weight matrices across these trials, the sparser networks had substantially lower maximum eigenvalue of |W|, suggesting that stronger stability can actually correlate with improved performance on sequential tasks. This could be due to a mechanism such as that described in [\\(Radhakrishnana et al.,] [2020\\)].\n\n### 3.2.2 EXPERIMENTS WITH NETWORK SIZE\n\nUnderstanding the effect of size on network performance is important to practical application of these architectures. For both Sparse Combo Net and SVD Combo Net, increasing the number of subnetworks while holding other settings constant (including fixing the size of each subnetwork at 32 units) was able to increase network test accuracy on permuted seqMNIST to a point (Figure [5\\)].\n\nThe greatest performance jump happened when increasing from one module (37.1% Sparse Combo Net, 61.8% SVD Combo Net) to two modules (89.1% Sparse Combo Net, 92.9% SVD Combo Net). After that the performance increased steadily with number of modules until saturating at ∼ 97% for Sparse Combo Net and ∼ 95% for SVD Combo Net. Because the internal subnetwork weights are not trained in Sparse Combo Net, it is unsuprising that its performance was substantially worse at the smallest sizes. However Sparse Combo Net surpasses SVD Combo Net by the 12 × 32 network size, which contains a modest 384 total units.\n\n![](_page_7_Figure_6.jpeg)\n\nFigure 5: Permuted seqMNIST performance plotted against the number of subnetworks. Each subnetwork has 32 neurons. Results are shown for both Sparse Combo Net and SVD Combo Net.\n\nFor the Sparse Combo Net specifically we did additional experiments, both replicating the observed effect of network size using subnetworks with 16 units each (Figure [S2A]), and evaluating how task performance varies with modularity of a network fixed to have 352 total units (Figure [S2B]). In the modularity experiment we observed an inverse U shape, with poor performance of a 1×352 net and an 88 × 4 net, and best performance from a 44 × 8 net. Note that this experiment compared similar sparsity levels across the different subnetwork sizes. In practice we can achieve better performance with larger subnetworks by leveraging sparsity in a way not possible in smaller subnetworks.\n\nThe use of sparsity in subnetworks to improve performance suggests another interesting direction that could enable better scalability of total network size - enforcing sparsity in the linear feedback weight matrix (L). We performed pilot testing of this idea in a 24 × 32 Sparse Combo Net, varying the number of feedback connections that were fixed at 0 while all other settings remained as they were in our prior experiments (Table [S7\\)]. We obtained 65.14% test accuracy on sequential CIFAR10 using a network with only 50% of the possible feedback connections non-zero, surpassing our previous best and using a smaller number of epochs to do so.\n\n# 3.2.3 REPEATABILITY AND CONTROLS\n\nBecause the Sparse Combo Net does not have the connections within its subnetworks trained, network performance could be particularly susceptible to random initialization. Thus we ran repeatability studies on both permuted sequential MNIST and sequential CIFAR10 using our best network settings and an extened training period. Across four permuted seqMNIST trials with the same network settings (16 × 32 with subnetwork sparsity level of 3.3%), best test accuracy always fell between 96.65% and 96.94%, a range much smaller than the differences seen with changing sparsity settings and network size. Three of the four trials showed best test accuracy ≥ 96.88%, despite some variability in early training performance (Figure [S4\\)]. Similarly, across three seqCIFAR10 trials with the same network settings, best test accuracy always fell between 64.04% and 64.75% (Figure [S5\\)]. We also ran 9 sequential CIFAR10 trials using a shorter training period to demonstrate reproducibility of the early training curve (Figure [S6\\)].\n\nAs a control study, we tested how sensitive both Sparse Combo Net and SVD Combo Net were to the stabilization condition on the interconnection matrix (L in Theorem [6\\)]. To do this we constrained the hidden-to-hidden weights of the networks to satisfy their respective contraction conditions. However, instead of constraining L in a way that ensures contraction of the overall system, we set L = B and did not constrain B during training. In both cases we found that this hurt network performance on the permuted seqMNIST task, demonstrating the utility of the contraction condition. For the 24 × 32 Sparse Combo Network we saw a decrease from 96.7% test accuracy to 47.0% test accuracy. For the 24 × 32 SVD Combo Network we saw a decrease from 94.9% to 94.56%. The disparity in performance decrease makes sense when considering that the hidden-to-hidden weights of SVD Combo Network are trainable, while those of the Sparse Combo Network are not. Whatever instabilities are introduced by the lack of constraint on the inter-subnetwork connections cannot be adequately compensated for in the Sparse Combo Network.\n\n# 4 DISCUSSION\n\nMost work on stability of task-trained RNNs has focused on *single* RNNs. Here we leverage tools from nonlinear control theory to derive novel single-RNN stability conditions which enable the recursive construction of stable *assemblies* of RNNs. In particular we show that certain stability conditions for individual RNNs allow for a simple parameterization of connections between these RNNs that automatically preserves stability during training. We then show that these modular 'network of networks' repeatedly perform better than existing stable RNNs on key sequence classification tasks such as permuted seqMNIST and seqCIFAR10. We also provide control studies that show the stabilizing parameterization of connections between subnetwork RNNs is important for high performance.\n\nThere are numerous future directions enabled by this work. For example, Theorem [7] suggests that a less restrictive contraction condition on W in terms of the eigenvalues of the symmetric part is possible and desirable. Furthermore, the beneficial impact of sparsity on training these stable models suggests a potential avenue for additional experimental work – in particular adding a regularizing sparsity term during training. As 'network of network' approaches are becoming increasingly popular, our methodology is relevant to a variety of task types, including reinforcement learning applications. Moreover, because sparsity and modularity have important theoretical and empirical implications in neuroscience [\\(Kozachkov et al.,] [2020;] [Slotine & Liu,] [2012\\)] we expect to be able to apply the present work to those topics as well.\n\nUltimately, our work represents a step forward in understanding the stability properties of recurrent neural networks. Stability is a fundamental property of dynamical systems, and is inextricably linked to concepts such as generalization, control, predictability, and robustness. Therefore, as systems trained with deep learning become more modular, complex, and integrated into our lives, understanding the conditions under which these systems are stable will become increasingly important. Furthermore, it will be necessary to identify training techniques that can reliably generate provably stable RNNs with minimal performance loss. We provide a proof-of-concept of the use of combination networks to achieve this goal, and show that sparsity enables additional performance gains.\n\nAcknowledgements This work benefited from stimulating discussions with Michael Happ and Quang-Cuong Pham.\n\n# REFERENCES\n\n- Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In *International Conference on Machine Learning*, pp. 1120–1128. PMLR, 2016.\n- Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. *ICLR*, 2019.\n- Yoshua Bengio, Ian Goodfellow, and Aaron Courville. *Deep learning*, volume 1. MIT press Massachusetts, USA:, 2017.\n- Michael Buehner and Peter Young. A tighter bound for the echo state property. *IEEE Transactions on Neural Networks*, 17(3):820–824, 2006.\n- Bo Chang, Minmin Chen, Eldad Haber, and Ed H Chi. Antisymmetricrnn: A dynamical system view on recurrent neural networks. *arXiv preprint arXiv:1902.09689*, 2019.\n- Yimian Dai, Stefan Oehmcke, Fabian Gieseke, Yiquan Wu, and Kobus Barnard. Attention as activation. *arXiv preprint arXiv:2007.07729v2*, 2020.\n- N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney. Lipschitz recurrent neural networks. *ICLR*, 2021.\n- John Gerhart and Marc Kirschner. The theory of facilitated variation. *Proceedings of the National Academy of Sciences*, 104(1):8582–8589, 2007. doi: 10.1073/pnas.0701035104.\n- Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. *Inverse Problems*, 34(1):014004, 2017.\n- Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. *Neuron*, 95(2):245–258, 2017.\n- Herbert Jaeger. The \"echo state\" approach to analysing and training recurrent neural networks-with an erratum note. 2001.\n- Leo Kozachkov, Mikael Lundqvist, Jean-Jacques Slotine, and Earl K Miller. Achieving stable dynamics in neural circuits. *PLoS computational biology*, 16(8):e1007659, 2020.\n- Dmitry Krotov and John Hopfield. Large associative memory problem in neurobiology and machine learning. *arXiv preprint arXiv:2008.06996*, 2020.\n- Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. *arXiv preprint arXiv:1504.00941*, 2015.\n- Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In *International Conference on Machine Learning*, pp. 3794–3803. PMLR, 2019.\n- Shuai Li, Chris Cook, and Yanbo Gao. Deep independently recurrent neural network (indrnn). *arXiv preprint arXiv:1910.06251v3*, 2020.\n- Winfried Lohmiller and Jean-Jacques E Slotine. On contraction analysis for non-linear systems. *Automatica*, 34(6):683–696, 1998.\n- Kiyotoshi Matsuoka. Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. *Neural networks*, 5(3):495–500, 1992.\n- John Miller and Moritz Hardt. Stable recurrent models. *arXiv preprint arXiv:1805.10369*, 2018.\n- Kenneth D Miller and Francesco Fumarola. Mathematical equivalence of two common forms of firing rate models of neural networks. *Neural computation*, 24(1):25–31, 2012.\n- Kumpati S. Narendra and Robert Shorten. Hurwitz stability of metzler matrices. *IEEE Transactions On Automatic Control*, 55(6):1484–1487, 2010.\n\n- Merav Parter, Nadav Kashtan, and Uri Alon. Facilitated variation: How evolution learns from past environments to generalize to new environments. *PLOS Computational Biology*, 4(11), 2008.\n- Adityanarayanan Radhakrishnana, Mikhail Belkin, and Caroline Uhler. Overparameterized neural networks implement associative memory. *PNAS*, 117:27162–27170, 2020.\n- Hubert Ramsauer, Bernhard Schafl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas ¨ Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovic, Geir Kjetil Sandve, et al. Hopfield ´ networks is all you need. *arXiv preprint arXiv:2008.02217*, 2020.\n- Max Revay and Ian Manchester. Contracting implicit recurrent neural networks: Stable models with improved trainability. In *Learning for Dynamics and Control*, pp. 393–403. PMLR, 2020.\n- Max Revay, Ruigang Wang, and Ian R Manchester. A convex parameterization of robust recurrent neural networks. *IEEE Control Systems Letters*, 5(4):1363–1368, 2020a.\n- Max Revay, Ruigang Wang, and Ian R Manchester. Lipschitz bounded equilibrium networks. *arXiv e-prints*, pp. arXiv–2010, 2020b.\n- Max Revay, Ruigang Wang, and Ian R Manchester. Recurrent equilibrium networks: Unconstrained learning of stable and robust dynamical models. *arXiv preprint arXiv:2104.05942*, 2021.\n- David W. Romero, Anna Kuzina, Erik J. Bekkers, Jakub M. Tomczak, and Mark Hoogendoorn. Ckconv: Continuous kernel convolution for sequential data. *arXiv preprint arXiv:2102.02611v1*, 2021.\n- Ueli Rutishauser, Rodney J Douglas, and Jean-Jacques Slotine. Collective stability of networks of winner-take-all circuits. *Neural computation*, 23(3):735–773, 2011.\n- Ueli Rutishauser, Jean-Jacques Slotine, and Rodney Douglas. Computation in dynamically bounded asymmetric systems. *PLoS Comput Biol*, 11(1):e1004039, 2015.\n- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. *Nature*, 529(7587):484–489, 2016.\n- Herbert A. Simon. The architecture of complexity. *Proceedings of the American Philosophical Society*, 106(6):467–482, 1962. ISSN 0003-049X.\n- Jean-Jacques Slotine. Modular stability tools for distributed computation and control. *Int. J. Adaptive Control and Signal Processing*, 17(6), 2003.\n- Jean-Jacques Slotine and Yang-Yu Liu. The missing link. *Nature Physics*, 8(7):512–513, 2012.\n- Jean-Jacques Slotine and Winfried Lohmiller. Modularity, evolution, and the binding problem: a view from stability theory. *Neural Networks*, 14(2):137–145, 2001.\n- Jean-Jacques E Slotine and Weiping Li. *Applied Nonlinear Control*. Prentice-Hall, 1991.\n- Nicolas Tabareau and Jean-Jacques Slotine. Notes on contraction theory. *arXiv preprint nlin/0601011*, 2006.\n- Trieu H. Trinh, Andrew M. Dai, Minh-Thang Luong, and Quoc V. Le. Learning longer-term dependencies in rnns with auxiliary losses. *arXiv preprint arXiv:1803.00144v3*, 2018.\n- Eric W. Weisstein. Positive definite matrix. URL [https://mathworld.wolfram.com/](https://mathworld.wolfram.com/PositiveDefiniteMatrix.html) [PositiveDefiniteMatrix.html](https://mathworld.wolfram.com/PositiveDefiniteMatrix.html).\n- Guangyu R Yang and Manuel Molano-Mazon. Next-generation of recurrent neural network models for cognition, Apr 2021. URL <psyarxiv.com/w34n2>.\n- Guangyu Robert Yang, Igor Ganichev, Xiao-Jing Wang, Jonathan Shlens, and David Sussillo. A dataset and architecture for visual reasoning with a working memory. In *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 714–731, 2018.",
      "parsed_appendix_content_blind_submission_open_review": "# APPENDIX\n\n# A1 CONTRACTION MATH\n\nIt can be shown that the non-autonomous system\n\n$$\\dot{\\mathbf{x}} = \\mathbf{f}(\\mathbf{x}, t)$$\n\nis contracting if there exists a metric M(x, t) = Θ(x, t) <sup>T</sup> Θ(x, t) 0 such that uniformly\n\n$$\\dot{\\mathbf{M}} + \\mathbf{M}\\mathbf{J} + \\mathbf{J}^T \\mathbf{M} \\preceq -\\beta \\mathbf{M}$$\n\nwhere J = ∂f ∂x and β > 0. For more details see the main reference [\\(Lohmiller & Slotine,] [1998\\)]. Similarly, a non-autonomous discrete-time system\n\n$$\\mathbf{x}_{t+1} = \\mathbf{f}(\\mathbf{x}_t, t)$$\n\nis contracting if\n\n$$\\mathbf{J}^T \\mathbf{M}_{t+1} \\mathbf{J} - \\mathbf{M}_t \\preceq -\\beta \\mathbf{M}_t$$\n\n### A1.1 FEEDBACK AND HIERARCHICAL COMBINATIONS\n\nConsider two systems, independently contracting in constant metrics M<sup>1</sup> and M2, which are combined in feedback:\n\n$$\\dot{\\mathbf{x}} = \\mathbf{f}(\\mathbf{x}, t) + \\mathbf{B}\\mathbf{y}$$\n\n$\\dot{\\mathbf{y}} = \\mathbf{g}(\\mathbf{y}, t) + \\mathbf{G}\\mathbf{x}$  (Feedback Combination)\n\nIf the following relationship between B, G,M1, and M<sup>2</sup> is satisfied:\n\n$$\\mathbf{B} = -\\mathbf{M}_1^{-1} \\mathbf{G}^T \\mathbf{M}_2$$\n\nthen the combined system is contracting as well. This may be seen as a special case of the feedback combination derived in [\\(Tabareau & Slotine,] [2006\\)]. The situation is even simpler for hierarchical combinations. Consider again two systems, independently contracting in some metrics, which are combined in hierarchy:\n\n$$\\dot{\\mathbf{x}} = \\mathbf{f}(\\mathbf{x}, t)$$\n\n$\\dot{\\mathbf{y}} = \\mathbf{g}(\\mathbf{y}, t) + \\mathbf{h}(\\mathbf{x}, t)$  (Hierarchical Combination)\n\nwhere h(x, t) is a function with *bounded* Jacobian. Then this combined system is contracting in a diagonal metric, as shown in [\\(Lohmiller & Slotine,] [1998\\)]. By recursion, this extends to hierarchies of arbitrary depth.\n\n# A2 TWO DIFFERENT RNNS\n\nNote that in neuroscience, the variable x in equation [\\(1\\)] is typically thought of as a vector of neural membrane potentials. It was shown in [\\(Miller & Fumarola,] [2012\\)] that the RNN [\\(1\\)] is equivalent via an affine transformation to another commonly used RNN model,\n\n$$\\tau \\dot{\\mathbf{y}} = -\\mathbf{y} + \\phi(\\mathbf{W}\\mathbf{y} + \\mathbf{b}(t)) \\tag{4}$$\n\nwhere the variable y is interpreted as a vector of firing rates, rather than membrane potentials. The two models are related by the transformation x = Wy + b, which yields\n\n$$\\tau \\dot{\\mathbf{x}} = \\mathbf{W}(-\\mathbf{y} + \\phi(\\mathbf{W}\\mathbf{y} + \\mathbf{b})) + \\tau \\dot{\\mathbf{b}} = -\\mathbf{x} + \\mathbf{W}\\phi(\\mathbf{x}) + \\mathbf{v}$$\n\nwhere  $\\mathbf{v} \\equiv \\mathbf{b} + \\tau \\dot{\\mathbf{b}}$ . Thus b is a low-pass filtered version of  $\\mathbf{v}$  (or conversely,  $\\mathbf{v}$  may be viewed as a first order prediction of  $\\mathbf{b}$ ) and the contraction properties of the system are unaffected by the affine transformation. Note that the above equivalence holds even in the case where  $\\mathbf{W}$  is not invertible. In this case, the two models are proven to be equivalent, provided that  $\\mathbf{b}(0)$  and  $\\mathbf{y}(0)$  satisfy certain conditions—which are always possible to satisfy (Miller & Fumarola, 2012). Therefore, any contraction condition derived for the x (or y) system automatically implies contraction of the other system. We exploit this freedom freely throughout the paper.\n\n#### A3 Sparse Combo Net Supplementary Materials\n\nAll networks described in the main text were trained using a single GPU on Google Colab. An exported Colab notebook with the code to replicate all experiments is provided in the supplementary attachment.\n\nTo obtain the metric necessary for training the linear connections, scipy.integrate.quad was used with default settings to solve for  $\\mathbf{M}$  in the equation  $-\\mathbf{I} = \\mathbf{M}\\mathbf{W} + \\mathbf{W}^T\\mathbf{M}$ , as described in the main text. This is done by integrating  $e^{\\mathbf{W}^Tt}\\mathbf{Q}e^{\\mathbf{W}t}dt$  from 0 to  $\\infty$ . For efficiency reasons, and due to the guaranteed existence of a diagonal metric in the case of Theorem 1, integration was only performed to solve for the diagonal elements of  $\\mathbf{M}$ . Therefore a check was added prior to training to confirm that the initialized network indeed satisfied Theorem 1 with metric  $\\mathbf{M}$ . However, it was never triggered by our initialization method.\n\nInitial training hyperparameter tuning was done primarily with  $10 \\times 16$  combination networks on the permuted seqMNIST task, starting with settings based on existing literature on this task, and verifying promising settings using a  $15 \\times 16$  network. Initialization settings were held the same throughout, matching what was later done for the size comparison trials (described below). The results of all of the attempted trials are reported in Table S1. Once hyperparameters were decided upon, the trials reported on in the main text began.\n\nTo report on the number of trainable parameters, we used the following formula:\n\n$$\\frac{n^2 - M * C^2}{2} + i * n + n * o + n + o$$\n\nWhere n is the total number of units in the  $M \\times C$  combination network, o is the total number of output nodes for the task, and i is the total number of input nodes for the task. Thus for the  $16 \\times 32$  networks highlighted here, we have 129034 trainable parameters for the MNIST tasks, and 130058 trainable parameters for sequential CIFAR10.\n\nNote that the naive estimate for the number of trainable parameters would be  $n^2 + i*n + n*o + n + o$ , corresponding to the number of weights in  $\\mathbf{L}$ , the number of weights in the feedforward linear input layer, the number of weights in the feedforward linear output layer, and the bias terms for the input and output layers, respectively. However, because of the combination property constraints on  $\\mathbf{L}$ , only the lower triangular portion of a block off-diagonal matrix is actually trained, and  $\\mathbf{L}$  is then defined in terms of this matrix and the metric  $\\mathbf{M}$ . Thus we subtract  $M*C^2$  to remove the block diagonal portions corresponding to nonlinear RNN components, and then divide by 2 to obtain only the lower half.\n\nAfter training was completed, we inspected the state of all networks described in the main text, pulling both the nonlinear ( $\\mathbf{W}$ ) and linear ( $\\mathbf{L}$ ) weight matrices from both initialization time and the final model. For  $\\mathbf{W}$ , we confirmed it did not change over training, and inspected the max real part of the eigenvalues of  $|\\mathbf{W}|$  in accordance with Theorem 1. The densest tested matrices tended to have  $\\lambda_{max}(|\\mathbf{W}|) > 0.9$ , while the sparsest ones tended to have  $\\lambda_{max}(|\\mathbf{W}|) < 0.1$ . For  $\\mathbf{L}$ , we checked the maximum element and the maximum singular value before and after training. In general, both went up over the course of training, but by a modest amount.\n\n#### A3.1 Network Size and Modularity Comparison\n\nFor the size comparison trials (Figure 5A), the nonlinear RNN weights were set by drawing uniformly from between -0.4 and 0.4 with 40% density using scipy.sparse.random, and then zeroing out the diagonal entries. These settings were chosen because they resulted in  $\\sim 1\\%$  of 16 by 16\n\n![](_page_13_Figure_1.jpeg)\n\nFigure S1: Weight matrices for each of the 32 unit nonlinear component RNNs that were used in the best performing 16 × 32 network on permuted sequential MNIST.\n\nweight matrices meeting the Theorem [1] condition. During initialization only the matrices meeting this condition were kept, finishing when the desired number of component RNNs had been set producing a block diagonal W like pictured in Figure [3.] This same initialization process was used throughout our experiments.\n\nFirst, we held static the number of units and initialization settings for each component RNN, and tested the effect of changing the number of components in the combination network. 1, 3, 5, 10, 15, 20, 22, 25, and 30 component RNNs were tested in this experiment (Table [S2\\)]. Increasing the number of components initially lead to great improvements in test accuracy, but had diminishing returns - test accuracy consistently hit ∼ 93% with a large enough number of modules, but neither loss nor accuracy showed meaningful improvement past the 22×16 network (Figure [S2A]). Interestingly, early training loss and accuracy became substantially worse once the number of components increased past a certain point, falling from 70% to 43% epoch 1 test accuracy between the 22 × 16 and 30 × 16 networks.\n\nTo better understand how the modularity of the combination networks affects performance, the next experiment held the number of total units constant at 352, selected due to the prior success of the 22 × 16 network, and tested different allocations of these units amongst component RNNs. Thus 1 × 352, 11 × 32, 44 × 8, and 88 × 4 networks were trained to compare against the 22 × 16 (Table [S3\\)]. Increasing the modularity improved performance to a point, with the 44 × 8 network resulting in final test accuracy of 94.44%, while conversely the 11 × 32 resulted in decreased test accuracy (Figure [S2B]). However, the 88 × 4 network was unable to learn, and a 352 × 1 network would theoretically just be a scaled linear anti-symmetric network.\n\nBecause larger networks require different sparsity settings to meet the Theorem [1] condition, these were not held constant between trials in the modularity comparison experiment (Figure [5B]), but rather selected in the same way between trials - looking for settings that keep density and scalar balanced and result in ∼ 1% of the matrices meeting the condition. The scalar was applied after sampling non-zero entries from a uniform distribution between -1 and 1. The resulting settings were 7.5% density and 0.077 scalar for 352 unit component RNN, 26.5% density and 0.27 scalar for 32 unit component RNN, 60% density and 0.7 scalar for 8 unit component RNN, and 100% density and 1.0 scalar for 4 unit component RNN.\n\n![](_page_14_Figure_1.jpeg)\n\nFigure S2: Performance of Sparse Combo Nets on the Permuted seqMNIST task by combination network size. We test the effects on final and first epoch test accuracy of both total network size and network modularity. The former is assessed by varying the number of subnetworks while each subnetwork is fixed at 16 units (A), and the latter by varying the distribution of units across different numbers of subnetworks with the total sum of units in the network fixed at 352 (B). Note that these experiments were run prior to optimizing the sparsity initialization settings. Experiments on total network size were later repeated with the final sparsity settings (Figure [5\\)]. The results of both the size experiments are consistent.\n\n## A3.2 SPARSITY SETTINGS COMPARISON\n\nDensity and scalar settings for the component nonlinear RNNs were initially chosen for each network size using the percentage of random networks that met the Theorem [1] condition. For scalar s, a component network would have non-zero entries sampled uniformly between −s and s.\n\nWhen we began experimenting with sparsity in the initialization, we split the previously described scalar setting into two different scalars - one applied before a random matrix was checked against the Theorem [1] condition, and one applied after a matrix was selected. Of course the latter must be ≤ 1 to guarantee stability is preserved. The scalar was separated out after we noticed that at 5% density, random 32 by 32 weight matrices met the condition roughly 1% of the time whether the scalar was 10 or 100000 - ∼ 85% of sampled matrices using scalar 10 would continue to meet the condition even if multiplied by a factor of 10000. Therefore we wanted a mechanism that could bias selection towards matrices that are stable due to their sparsity and not due to magnitude constraints, while still keeping the elements to a reasonable size for training purposes.\n\nUltimately, both sparsity and magnitude had a clear effect on performance (Figure [S3\\)]. Increases in both had a positive correlation with accuracy and loss through most parameters tested, seemingly driven by the increase in non-zero element magnitude enabled by sparsity. Best test accuracy overall was 96.79%, which was obtained by both a 16 × 32 network with 5% density and entries between -5 and 5, and a 16 × 32 network with 3.3% density and entries between -6 and 6. The latter also achieved the best epoch 1 test accuracy observed of 86.79%.\n\n![](_page_15_Figure_2.jpeg)\n\nFigure S3: Permuted seqMNIST performance by component RNN initialization settings. Test accuracy is plotted over the course of training for four 16 × 32 networks with different density levels and entry magnitudes (A), highlighting the role of sparsity in network performance. Test accuracy is then plotted over the course of training for two 3.3% density 16 × 32 networks with different entry magnitudes (B), to demonstrate the role of the scalar. When the magnitude becomes too high however, performance is out of view of the current axis limits.\n\n### A3.3 REPEATABILITY\n\nTo further improve performance once network settings were explored on permuted seqMNIST, an extended training run was tested on the best performing option. Settings were kept the same as above using a 3.3% density 16 × 32 network, except training now ran for over 200 epochs, with just a single learning rate cut occurring after epoch 200 (exact number of epochs varied based on runtime limit). This experiment was repeated four times and resulted in 96.94% best test accuracy (Figure [S4\\)].\n\nWe performed a similar repetition task for the seqCIFAR10 task. To avoid hitting runtime limits, we performed three trials where each network was trained for 200 epochs with learning rate cuts at epochs 140 and 190. All networks exceeded 64% test accuracy, with the best performing network achieving 64.75% (Figure [S5\\)]. We also trained a larger number of networks over a smaller number of epochs (Figure [S6\\)], as well as explored additional hyperparameter tuning on this task (Table [S6\\)].\n\n![](_page_16_Figure_1.jpeg)\n\nFigure S5: seqCIFAR10 performance on repeated trials. Three different 16 × 32 networks with 3.3% density and entries between -6 and 6 were trained for 200 epochs, with learning rate divided by 10 after epochs 140 and 190. (A) depicts test accuracy for each of the networks over the course of training. (B) depicts the training loss for the same networks.\n\n# A3.4 TABLES OF RESULTS BY TRIAL\n\nTable [S1] shows all trials run on permuted sequential MNIST before beginning the more systematic experiments reported on in the main text, presented in chronological order. Notably, our networks did not require an extensive hyperparameter tuning process.\n\nTables [S2] and [S3] report additional details on the size and modularity experiments (Figure [S2\\)].\n\nTables [S4] and [S5] report additional details on the sparsity experiments described in the main text (Figure [4\\)], including results of all relevant trials, as some were left out of the main text for brevity.\n\nTable [S6] reports the results of all trials of different hyperparameters on the sequential CIFAR10 task, in chronological order.\n\n![](_page_17_Figure_1.jpeg)\n\nFigure S6: seqCIFAR10 performance on repeated trials with shorter training (done to complete more trials). Nine different 16 × 32 networks with 3.3% density and entries between -6 and 6 were set up to train for 150 epochs, with learning rate divided by 10 after epochs 90 and 140. Most of these networks hit runtime limit before completing, however they all got through at least 100 epochs and all had test accuracy exceed 61%. This figure depicts test accuracy for each of the networks over the course of training.\n\nFinally, Table [S7] reports the results of our pilot trial on introducing sparsity into the linear feedback connection matrix.\n\n| Size  | Epochs | Adam<br>WD | Initial<br>LR | LR Schedule                              | Final<br>Test<br>Acc. | |-------|--------|------------|---------------|------------------------------------------|-----------------------| | 10×16 | 150    | 5e-5       | 5e-3          | 0.1 after 91                             | 84%                   | | 10×16 | 150    | 1e-5       | 1e-2          | 0.1 after 50,100                         | 85%                   | | 15×16 | 150    | 2e-4       | 5e-3          | 0.1 after 50,100                         | 84%                   | | 10×16 | 150    | 2e-4       | 1e-2          | 0.5 every 10                             | 81%                   | | 10×16 | 200    | 2e-4       | 1e-2          | 0.5 after 10 then every 30               | 81%                   | | 10×16 | 171*   | 5e-5       | 1e-2          | 0.75 after 10,20,60,100 then<br>every 15 | 84%                   | | 15×16 | 179*   | 1e-5       | 1e-3          | 0.1 after 100,150                        | 90%                   |\n\nTable S1: Training hyperparameter tuning trials, presented in chronological order. \\* indicates that training was cut short by the 24 hour Colab runtime limit. LR Schedule describes the scalar the learning rate was multiplied by, and at what epochs. The best performing network is highlighted, and represents the training settings we used throughout most of the main text.\n\n|        | Final  | Epoch     | Final  | |--------|--------|-----------|--------| | Size   | Test   | 1<br>Test | Train  | |        | Acc.   | Acc.      | Loss   | | 1 × 16 | 38.69% | 24.61%    | 1.7005 | | 3 × 16 | 70.56% | 40.47%    | 0.9033 | | 5 × 16 | 77.86% | 47.99%    | 0.7104 | | 10×16  | 85.82% | 61.38%    | 0.4736 | | 15×16  | 90.28% | 69.09%    | 0.3156 | | 20×16  | 92.26% | 71.72%    | 0.2392 | | 22×16  | 93.01% | 70.11%    | 0.2073 | | 25×16  | 92.99% | 61.81%    | 0.2017 | | 30×16  | 93.16% | 43.21%    | 0.1991 |\n\nTable S2: Results for combination networks containing different numbers of component 16-unit RNNs. Training hyperparameters and network initialization settings were kept the same across all trials.\n\n|        | Final  | Epoch     | Final  | |--------|--------|-----------|--------| | Size   | Test   | 1<br>Test | Train  | |        | Acc.   | Acc.      | Loss   | | 1×352  | 40.17% | 26.97%    | 1.662  | | 11×32  | 89.12% | 61.29%    | 0.3781 | | 22×16  | 93.01% | 70.11%    | 0.2073 | | 44 × 8 | 94.44% | 25.78%    | 0.1500 | | 88 × 4 | 10.99% | 10.99%    | 2E+35  |\n\nTable S3: Results for different distributions of 352 total units across a combination network. This number was chosen based on prior 22×16 network performance. For each component RNN size tested, the same procedure was used to select appropriate density and scalar settings.\n\n| Size   | Density | Scalar | Final<br>Test<br>Acc. | Epoch<br>1<br>Test<br>Acc. | Final<br>Train<br>Loss | |--------|---------|--------|-----------------------|----------------------------|------------------------| | 11×32  | 26.5%   | 0.27   | 89.12%                | 61.29%                     | 0.3781                 | | 11×32  | 10%     | 1.0    | 94.86%                | 70.67%                     | 0.1278                 | | 22×16  | 40%     | 0.4    | 93.01%                | 70.11%                     | 0.2073                 | | 22×16  | 20%     | 1.0    | 95.27%                | 76.58%                     | 0.0924                 | | 22×16  | 10%     | 1.0    | 94.26%                | 71.53%                     | 0.1425                 | | 44 × 8 | 60%     | 0.7    | 94.44%                | 25.78%                     | 0.1500                 | | 44 × 8 | 50%     | 1.0    | 95.05%                | 30.52%                     | 0.1180                 |\n\nTable S4: Results for different initialization settings - varying sparsity and magnitude of the component RNNs for different network sizes.\n\n|         | Pre    | Post   | Final  | Epoch     | Final   | |---------|--------|--------|--------|-----------|---------| | Density | select | select | Test   | 1<br>Test | Train   | |         | Scalar | Scalar | Acc.   | Acc.      | Loss    | | 10%     | 1.0    | 1.0    | 95.87% | 73.67%    | 0.074   | | 5%      | 10.0   | 0.1    | 95.11% | 73.10%    | 0.1311  | | 5%      | 10.0   | 0.2    | 96.15% | 82.50%    | 0.0051  | | 5%      | 10.0   | 0.5    | 96.69% | 75.76%    | 0.0001  | | 5%      | 6.0    | 1.0    | 96.41% | 21.55%    | 3.3E-5  | | 5%      | 7.5    | 1.0    | 16.75% | 11.39%    | 3068967 | | 3.3%    | 30.0   | 0.1    | 96.24% | 83.89%    | 0.0005  | | 3.3%    | 30.0   | 0.2    | 96.54% | 86.79%    | 4E-5    | | 1%      | 10.0   | 1.0    | 96.04% | 81.2%     | 0.0001  |\n\nTable S5: Further optimizing the sparsity settings for high performance using a 16×32 network. The final scalar is the product of the pre-selection and post-selection scalars. Note that the 5% density and 7.5 scalar network was killed after 18 epochs due to exploding gradient.\n\n| Density | Pre-<br>select<br>Scalar | Post-<br>select<br>Scalar | Epochs | Adam<br>WD | Initial<br>LR | LR Schedule       | Best<br>Test<br>Acc. | |---------|--------------------------|---------------------------|--------|------------|---------------|-------------------|----------------------| | 3.3%    | 30                       | 0.2                       | 150    | 1e-5       | 1e-3          | 0.1 after 90,140  | 64.63%               | | 3.3%    | 30                       | 0.2                       | 34*    | 1e-5       | 5e-3          | 0.1 after 90,140  | 35.42%               | | 5%      | 6                        | 1                         | 150    | 1e-5       | 1e-3          | 0.1 after 90,140  | 60.9%                | | 5%      | 10                       | 0.5                       | 150    | 1e-5       | 1e-4          | 0.1 after 90,140  | 54.86%               | | 3.3%    | 30                       | 0.2                       | 150    | 1e-5       | 5e-4          | 0.1 after 90,140  | 61.83%               | | 3.3%    | 30                       | 0.2                       | 200    | 1e-6       | 2e-3          | 0.1 after 140,190 | 62.31%               | | 3.3%    | 30                       | 0.2                       | 186*   | 1e-5       | 1e-3          | 0.1 after 140,190 | 64.75%               | | 3.3%    | 30                       | 0.2                       | 132*   | 1e-6       | 1e-3          | 0.1 after 140,190 | 62.31%               | | 5%      | 10                       | 0.5                       | 195*   | 1e-5       | 1e-3          | 0.1 after 140,190 | 64.68%               |\n\nTable S6: Additional hyperparameter tuning for the CIFAR10 task, presented in chronological order. \\* indicates that training was cut short by the 24 hour Colab runtime limit, or in the case of high learning rate was killed intentionally due to exploding gradient. LR Schedule describes the scalar the learning rate was multiplied by, and at what epochs. The best performing network is highlighted.\n\n| Size           | Feedback<br>Density | Epochs | Best Overall<br>Test Acc. | Best Test Acc. Through 85 Epochs | |----------------|---------------------|--------|---------------------------|----------------------------------| | $24 \\times 32$ | 100%                | 86     | 52.7%                     | 52.7%                            | | $24 \\times 32$ | 75%                 | 88     | 56.49%                    | 56.48%                           | | $24 \\times 32$ | 66.6%               | 89     | 58.84%                    | 58.84%                           | | $24 \\times 32$ | 50%                 | 124    | 65.14%                    | 58.01%                           | | $24 \\times 32$ | 33.3%               | 129    | 61.86%                    | 56.05%                           | | $24 \\times 32$ | 25%                 | 92     | 54.26%                    | 50.54%                           | | $24 \\times 32$ | 0%                  | 130    | 39.8%                     | 38.38%                           | | $16 \\times 32$ | 100%                | 150    | 64.63%                    | 55.82%                           |\n\nTable S7: Results from pilot testing on the sparsity of negative feedback connections in a  $24 \\times 32$  Sparse Combo Net. Feedback Density refers to the percentage of possible subnetwork pairings that were trained in negative feedback, while the remaining inter-network connections were held at 0. All networks were trained with the same 150 epoch training paradigm as mentioned in the main text, but were stopped after hitting a 24 hour runtime limit. Decreasing Feedback Density is a promising path towards further improving performance as the size of Sparse Combo Nets is scaled.\n\n# A4 PROOFS FOR MAIN RESULTS\n\n#### A4.1 PROOF OF THEOREM [1]\n\nOur first theorem is motivated by the observation that if the y-system is to be interpreted as a vector of firing rates, it must stay positive for all time. For a linear, time-invariant system with positive states, diagonal stability is equivalent to stability. Therefore a natural question is if diagonal stability of a linearized y-system implies anything about stability of the nonlinear system. More formally, given an excitatory neural network (i.e ∀ij, Wij ≥ 0), if the linear system\n\n$$\\dot{\\mathbf{x}} = -\\mathbf{x} + g\\mathbf{W}\\mathbf{x}$$\n\nis stable, then there exists a positive diagonal matrix P such that:\n\n$$\\mathbf{P}(g\\mathbf{W} - \\mathbf{I}) + (g\\mathbf{W} - \\mathbf{I})^T \\mathbf{P} \\prec 0$$\n\nThe following theorem shows that the nonlinear system [\\(1\\)] is indeed contracting in metric P, and extends this result to a more general W by considering only the magnitudes of the weights.\n\nTheorem 1. *Let* |W| *denote the matrix formed by taking the element-wise absolute value of* W*. If there exists a positive, diagonal* P *such that:*\n\n$$\\mathbf{P}(g|\\mathbf{W}| - \\mathbf{I}) + (g|\\mathbf{W}| - \\mathbf{I})^T \\mathbf{P} \\prec 0$$\n\n*then equation [1] is contracting in metric* P*. Moreover, if* Wii ≤ 0*, then* |W|ii *may be set to zero to reduce conservatism.*\n\nThis condition is particularly straightforward in the common special case where the network does not have any self weights, with the leak term driving stability. While it can be applied to a more general W, the condition will of course not be met if the network was relying on highly negative values on the diagonal of W for linear stability. As demonstrated by counterexample in the proof of Theorem [1,] it can be impossible to use the same metric P for the nonlinear RNN in such cases.\n\nTheorem [1] allows many weight matrices with low magnitudes or a generally sparse structure to be verified as contracting in the nonlinear system equation [1,] by simply checking a linear stability condition (as linear stability is equivalent to diagonal stability for Metzler matrices too [\\(Narendra &] [Shorten,] [2010\\)]).\n\nBeyond verifying contraction, Theorem [1] actually provides a metric, with little need for additional computation. Not only is it of inherent interest that the same metric can be shared across systems in this case, it is also of use in machine learning applications, where stability certificates are becoming increasingly necessary. Critically, it is feasible to enforce the condition during training via L2 regularization on W. More generally, there are a variety of systems of interest that meet this condition but do not meet the well-known maximum singular value condition, including those with a hierarchical structure.\n\n*Proof.* Consider the differential, quadratic Lyapunov function:\n\n$$V = \\delta \\mathbf{x}^T \\mathbf{P} \\delta \\mathbf{x}$$\n\nwhere P 0 is diagonal. The time derivative of V is:\n\n$$\\dot{V} = 2\\delta \\mathbf{x}^T \\mathbf{P} \\dot{\\delta \\mathbf{x}} = 2\\delta \\mathbf{x}^T \\mathbf{P} \\mathbf{J} \\delta \\mathbf{x} = -2\\delta \\mathbf{x}^T \\mathbf{P} \\delta \\mathbf{x} + 2\\delta \\mathbf{x}^T \\mathbf{P} \\mathbf{W} \\mathbf{D} \\delta \\mathbf{x}$$\n\nwhere D is a diagonal matrix such that Dii = dφ<sup>i</sup> dx ≥ 0. We can upper bound the quadratic form on the right as follows:\n\n$$\\delta \\mathbf{x}^T \\mathbf{PWD} \\delta \\mathbf{x} = \\sum_{ij} P_i W_{ij} D_j \\delta x_i \\delta x_j \\le$$\n\n$$\\sum_{i} P_{i} W_{ii} D_{i} |\\delta x_{i}|^{2} + \\sum_{ij,i \\neq j} P_{i} |W_{ij}| D_{j} |\\delta x_{i}| |\\delta x_{j}| \\leq g |\\delta \\mathbf{x}|^{T} \\mathbf{P} |\\mathbf{W}| |\\delta \\mathbf{x}|$$\n\nIf Wii ≤ 0, the term PiWiiD<sup>i</sup> |δx<sup>i</sup> 2 contributes non-positively to the overall sum, and can therefore be set to zero without disrupting the inequality. Now using the fact that P is positive and diagonal, and therefore δx <sup>T</sup> Pδx = |δx| <sup>T</sup> P|δx|, we can upper bound V˙ as:\n\n$$\\dot{V} \\leq |\\delta\\mathbf{x}|^T (-2\\mathbf{P} + \\mathbf{P}|\\mathbf{W}| + |\\mathbf{W}|\\mathbf{P})|\\delta\\mathbf{x}| = |\\delta\\mathbf{x}|^T [(\\mathbf{P}(|\\mathbf{W}| - \\mathbf{I}) + (|\\mathbf{W}|^T - \\mathbf{I})\\mathbf{P})]|\\delta\\mathbf{x}|$$\n\nwhere  $|W|_{ij} = |W_{ij}|$ , and  $|W|_{ii} = 0$  if  $W_{ii} \\le 0$  and  $|W|_{ii} = |W_{ii}|$  if  $W_{ii} > 0$ . This completes the proof. Note that  $\\mathbf{W} - \\mathbf{I}$  is Metzler, and therefore will be Hurwitz stable if and only if  $\\mathbf{P}$  exists (Narendra & Shorten, 2010).\n\nIt is also worth noting that highly negative diagonal values in  $\\mathbf{W}$  will prevent the same metric  $\\mathbf{P}$  from being used for the nonlinear system. Therefore the method used in this proof cannot feasibly be adapted to further relax the treatment of the diagonal part of  $\\mathbf{W}$ . The intuitive reason behind this is that in the symmetric part of the Jacobian,  $\\frac{\\mathbf{P}\\mathbf{W}\\mathbf{D}+\\mathbf{D}\\mathbf{W}^T\\mathbf{P}}{2} - \\mathbf{P}$ , the diagonal self weights will also be scaled down by small  $\\mathbf{D}$ , while the leak portion  $-\\mathbf{P}$  remains untouched by  $\\mathbf{D}$ . Now we actually demonstrate a counterexample, presenting a  $2\\times 2$  symmetric Metzler matrix  $\\mathbf{W}$  that is contracting in the identity in the linear system, but cannot be contracting *in the identity* in the nonlinear system equation 1:\n\n$$\\mathbf{W} = \\begin{bmatrix} -9 & 2.5 \\\\ 2.5 & 0 \\end{bmatrix}$$\n\nTo see that it is not possible for the more general nonlinear system with these weights to be contracting in the identity, take  $\\mathbf{D} = \\begin{bmatrix} 0 & 0 \\\\ 0 & 1 \\end{bmatrix}$ . Now\n\n$$(\\mathbf{WD})_{sym} - \\mathbf{I} = \\begin{bmatrix} -1 & 1.25\\\\ 1.25 & -1 \\end{bmatrix}$$\n\nwhich has a positive eigenvalue of  $\\frac{1}{4}$ .\n\n#### A4.2 Proof of Theorem 2\n\nWhile regularization may push networks towards satisfying Theorem 1, strictly enforcing the condition during optimization is not straightforward. This motivates the rest of our theorems, which derive contraction results for specially structured weight matrices. Unlike Theorem 1, these results have direct parameterizations which can easily be plugged into modern optimization libraries.\n\n**Theorem 2.** If\n$$\\mathbf{W} = \\mathbf{W}^T$$ and  $g\\mathbf{W} \\prec \\mathbf{I}$ , then (1) is contracting.\n\nWhen **W** is symmetric, (1) may be seen as a continuous-time Hopfield network. Continuous-time Hopfield networks with symmetric weights were recently shown to be closely related to Transformer architectures (Krotov & Hopfield, 2020; Ramsauer et al., 2020). Specifically, the dot-product attention rule may be seen as a discretization of the continuous-time Hopfield network with softmax activation function (Krotov & Hopfield, 2020). Our results here provide a simple sufficient (and nearly necessary, see above remark) condition for global exponential stability of a given *trajectory* for the Hopfield network. In the case where the input into the network is constant, this trajectory is a fixed point. Moreover, each trajectory associated with a unique input is guaranteed to be unique. Finally, we note that our results are flexible with respect to activation functions so long as they satisfy the slope-restriction condition. This flexibility may be useful when, for example, considering recent work showing that standard activation functions may be advantageously replaced by attention mechanisms (Dai et al., 2020).\n\n*Proof.* We begin by writing  $\\mathbf{W} = \\mathbf{R} - \\mathbf{P}$  for some unknown  $\\mathbf{R} = \\mathbf{R}^T$  and  $\\mathbf{P} = \\mathbf{P}^T \\succ 0$ . The approach of this proof is to show by construction that the condition  $g\\mathbf{W} \\prec \\mathbf{I}$  implies the existence of an  $\\mathbf{R}$  and  $\\mathbf{P}$  such that the system is contracting in metric  $\\mathbf{P}$ . We consider the y version of the RNN, which as discussed above is equivalent to the x version via an affine transformation.\n\nConsider the contraction condition:\n\n$$-2\\mathbf{M} + \\mathbf{M}\\mathbf{D}\\mathbf{W} + \\mathbf{W}^T\\mathbf{D}\\mathbf{M} \\prec -\\beta\\mathbf{M}$$\n\nwith β > 0. Substituting in the definitions of W and M, this condition becomes:\n\n$$-2\\mathbf{P} + \\mathbf{P}\\mathbf{D}(\\mathbf{R} - \\mathbf{P}) + (\\mathbf{R} - \\mathbf{P})\\mathbf{D}\\mathbf{P} \\leq -\\beta\\mathbf{P}$$\n\nSimplifying the terms and collecting them all on one side, the above may be written as:\n\n$$(\\beta - 2)\\mathbf{P} + \\mathbf{R}\\mathbf{D}\\mathbf{P} + \\mathbf{P}\\mathbf{D}\\mathbf{R} - 2\\mathbf{P}\\mathbf{D}\\mathbf{P} \\leq 0$$\n\nvia the Schur complement, the above term will be satisfied if:\n\n$$(2 - \\beta)\\mathbf{P} - \\mathbf{R}\\mathbf{D}\\mathbf{P}(2\\mathbf{P}\\mathbf{D}\\mathbf{P})^{-1}\\mathbf{P}\\mathbf{D}\\mathbf{R} =$$ $$(2 - \\beta)\\mathbf{P} - \\frac{1}{2}(\\mathbf{R}\\mathbf{D}\\mathbf{R}) \\succeq (2 - \\beta)\\mathbf{P} - \\frac{g}{2}(\\mathbf{R}\\mathbf{R}) \\succeq 0$$\n\nWe continue by setting P = γ <sup>2</sup>RR with γ <sup>2</sup> = g 2(2−β) , so that the above inequality is satisfied. At this point, we have shown that if W can be written as:\n\n$$\\mathbf{W} = \\mathbf{R} - \\gamma^2 \\mathbf{R} \\mathbf{R}$$\n\nthen [\\(1\\)] is contracting in metric M = γ <sup>2</sup>RR. What remains to be shown is that if the condition:\n\n$$g\\mathbf{W} - \\mathbf{I} \\prec 0$$\n\nIs satisfied, then this implies the existence of an R such that the above is true. To show that this is indeed the case, assume that:\n\n$$\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{W} \\succeq 0$$\n\nSubstituting in the definition of γ, this is just the statement that:\n\n$$\\frac{2(2-\\beta)}{4g}\\mathbf{I} - \\mathbf{W} \\succeq 0$$\n\nSetting β = 2λ > 0, this yields:\n\n$$(1-\\lambda)\\mathbf{I} \\succeq g\\mathbf{W}$$\n\nSince W is orthogonal, we have the eigendecomposition:\n\n$$\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{W} = \\mathbf{V}(\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{\\Lambda})\\mathbf{V}^T$$\n\nwhere V<sup>T</sup> V = I and Λ is a diagonal matrix containing the eigenvalues ofW. Denote the symmetric square-root of this expression as S:\n\n$$\\mathbf{S} = \\mathbf{V} \\sqrt{(\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{\\Lambda})} \\mathbf{V}^T = \\mathbf{S}^T$$\n\nWhich implies that:\n\n$$\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{W} = \\mathbf{S}^T \\mathbf{S}$$\n\nWe now define R in terms of S as follows:\n\n$$\\mathbf{R} = \\frac{1}{\\gamma}\\mathbf{S} + \\frac{1}{2\\gamma^2}\\mathbf{I}$$\n\nWhich means that:\n\n$$\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{W} = (\\gamma \\mathbf{R} - \\frac{1}{2\\gamma}\\mathbf{I})(\\gamma \\mathbf{R} - \\frac{1}{2\\gamma}\\mathbf{I})$$\n\nExpanding out the right side, we get:\n\n$$\\frac{1}{4\\gamma^2}\\mathbf{I} - \\mathbf{W} = \\gamma^2 \\mathbf{R} \\mathbf{R} - \\mathbf{R} + \\frac{1}{4\\gamma^2} \\mathbf{I}$$\n\nSubtracting <sup>1</sup> <sup>4</sup>γ<sup>2</sup> I from both sides yields:\n\n$$\\mathbf{W} = \\mathbf{R} - \\gamma^2 \\mathbf{R} \\mathbf{R}$$\n\nAs desired.\n\n### A4.3 PROOF OF THEOREM [3]\n\nTheorem 3. *If there exists positive diagonal matrices* P<sup>1</sup> *and* P2*, as well as* Q = Q<sup>T</sup> 0 *such that*\n\n$$\\mathbf{W} = -\\mathbf{P}_1 \\mathbf{Q} \\mathbf{P}_2$$\n\n*then [\\(1\\)] is contracting in metric* M = (P1QP1) −1 *.*\n\n*Proof.* Consider again a differential Lyapunov function:\n\n$$V = \\delta \\mathbf{x}^T \\mathbf{M} \\delta \\mathbf{x}$$\n\nthe time derivative is equal to:\n\n$$\\dot{V} = -2V + \\delta \\mathbf{x}^T \\mathbf{MWD} \\delta \\mathbf{x}$$\n\nSubstituting in the definitions of W and M, we get:\n\n$$\\dot{V} = -2V - \\delta \\mathbf{x}^T \\mathbf{P}_1^{-1} \\mathbf{P}_2 \\mathbf{D} \\delta \\mathbf{x} \\le -2V$$\n\nTherefore V converges exponentially to zero.\n\n#### A4.4 PROOF OF THEOREM [4]\n\nTheorem 4. *If* gW − I *is triangular and Hurwitz, then [\\(1\\)] is contracting in a diagonal metric.*\n\nNote that in the case of a triangular weight matrix, the system [\\(1\\)] may be seen as a feedforward (i.e hierarchical) network. Therefore, this result follows from the combination properties of contracting systems. However, our proof provides a means of explicitly finding a metric for this system.\n\n*Proof.* Without loss of generality, assume that W is lower triangular. This implies that Wij = 0 if i ≤ j. Now consider the generalized Jacobian:\n\n$$\\mathbf{F} = -\\mathbf{I} + \\mathbf{\\Gamma} \\mathbf{W} \\mathbf{D} \\mathbf{\\Gamma}^{-1}$$\n\nwith Γ diagonal and Γ<sup>i</sup> = <sup>i</sup> where > 0. Because Γ is diagonal, the generalized Jacobian is equal to:\n\n$$\\mathbf{F} = -\\mathbf{I} + \\mathbf{\\Gamma} \\mathbf{W} \\mathbf{\\Gamma}^{-1} \\mathbf{D}$$\n\nNow note that:\n\n$$(\\mathbf{\\Gamma}\\mathbf{W}\\mathbf{\\Gamma}^{-1})_{ij} = \\epsilon^i W_{ij} \\epsilon^{-j} = W_{ij} \\epsilon^{i-j}$$\n\nWhere i ≤ j, we have Wij = 0 by assumption. Therefore, the only nonzero entries are where i ≥ j. This means that by making arbitrarily small, we can make ΓWΓ<sup>−</sup><sup>1</sup> approach a diagonal matrix with Wii along the diagonal. Therefore, if:\n\n$$\\max_{i} gW_{ii} - 1 < 0$$\n\nthe nonlinear system is contracting. Since W is triangular, Wii are the eigenvalues of W, meaning that this condition is equivalent to gW − I being Hurwitz.\n\n### A4.5 PROOF OF THEOREM [5]\n\nTheorem 5. *If there exists a positive diagonal matrix* P *such that:*\n\n$$g^2 \\mathbf{W}^T \\mathbf{P} \\mathbf{W} - \\mathbf{P} \\prec 0$$\n\n*then [\\(1\\)] is contracting in metric* P*.*\n\nNote that this is equivalent to the discrete-time diagonal stability condition developed in (Revay & Manchester, 2020), for a constant metric. Note also that when  $\\mathbf{M}=\\mathbf{I}$ , Theorem 5 is identical to checking the maximum singular value of  $\\mathbf{W}$ , a previously established condition for stability of equation 1. However a much larger set of weight matrices are found via the condition when  $\\mathbf{M}=\\mathbf{P}$  instead.\n\n*Proof.* Consider the generalized Jacobian:\n\n$$\\mathbf{F} = \\mathbf{P}^{1/2} \\mathbf{J} \\mathbf{P}^{-1/2} = -\\mathbf{I} + \\mathbf{P}^{1/2} \\mathbf{W} \\mathbf{P}^{-1/2} \\mathbf{D}$$\n\nwhere **D** is a diagonal matrix with  $\\mathbf{D}_{ii} = \\frac{d\\phi_i}{dx_i}$ . Using the subadditivity of the matrix measure  $\\mu_2$  of the generalized Jacobian we get:\n\n$$\\mu_2(\\mathbf{F}) \\le -1 + \\mu_2(\\mathbf{P}^{1/2}\\mathbf{W}\\mathbf{P}^{-1/2}\\mathbf{D})$$\n\nNow using the fact that  $\\mu_2(\\cdot) \\leq ||\\cdot||_2$  we have:\n\n$$\\mu_2(\\mathbf{F}) \\le -1 + ||\\mathbf{P}^{1/2}\\mathbf{W}\\mathbf{P}^{-1/2}\\mathbf{D})||_2 \\le -1 + g||\\mathbf{P}^{1/2}\\mathbf{W}\\mathbf{P}^{-1/2}||_2$$\n\nUsing the definition of the 2-norm, imposing the condition  $\\mu_2(\\mathbf{F}) \\leq 0$  may be written:\n\n$$g^2 \\mathbf{W}^T \\mathbf{P} \\mathbf{W} - \\mathbf{P} \\prec 0$$\n\nwhich completes the proof.\n\n#### A4.6 Proof of Theorem 6\n\n**Theorem 6** (Network of Networks). Consider a collection of p subnetwork RNNs governed by equation 1. Assume that these RNNs each have hidden-to-hidden weight matrices  $\\{\\mathbf{W}_1,\\ldots,\\mathbf{W}_p\\}$  and are independently contracting in metrics  $\\{\\mathbf{M}_1,\\ldots,\\mathbf{M}_p\\}$ . Define the block matrices  $\\tilde{\\mathbf{W}} \\equiv BlockDiag(\\mathbf{W}_1,\\ldots,\\mathbf{W}_p)$  and  $\\tilde{\\mathbf{M}} \\equiv BlockDiag(\\mathbf{M}_1,\\ldots,\\mathbf{M}_p)$ , as well as the overall state vector  $\\tilde{\\mathbf{x}}^T \\equiv (\\mathbf{x}_1^T \\cdots \\mathbf{x}_2^T)$ . Then the following 'network of networks' is globally contracting in metric  $\\tilde{\\mathbf{M}}$ :\n\n$$\\tau \\dot{\\tilde{\\mathbf{x}}} = -\\tilde{\\mathbf{x}} + \\tilde{\\mathbf{W}}\\phi(\\tilde{\\mathbf{x}}) + \\mathbf{u}(t) + \\mathbf{L}\\tilde{\\mathbf{x}}$$\n\n$$\\mathbf{L} = \\mathbf{B} - \\tilde{\\mathbf{M}}^{-1}\\mathbf{B}^{T}\\tilde{\\mathbf{M}}$$ (5)\n\nWhere **B** is an arbitrary square matrix.\n\n*Proof.* To see that this overall system is contracting in constant metric  $\\tilde{\\mathbf{M}}$ , consider the corresponding differential Lyapunov equation (Lohmiller & Slotine, 1998):\n\n$$\\lambda_{max}(\\beta \\tilde{\\mathbf{M}} + \\tilde{\\mathbf{M}} \\mathbf{J} + \\mathbf{J}^{T} \\tilde{\\mathbf{M}}) \\leq 0$$\n\n$$\\lambda_{max}(2(\\beta \\tilde{\\mathbf{M}} - \\tilde{\\mathbf{M}} + \\tilde{\\mathbf{M}} \\tilde{\\mathbf{W}} \\tilde{\\mathbf{D}})_{s}) + \\lambda_{max}(\\tilde{\\mathbf{M}} \\mathbf{B} - \\tilde{\\mathbf{B}}^{T} \\tilde{\\mathbf{M}}) \\leq 0$$ (6)\n\nwhere **J** is the Jacobian of equation 5 and  $D_{ii} = \\phi'(\\tilde{x}_i)$  is a diagonal matrix containing the slopes of the activation functions. The last inequality follows from the original assumption that the individual subnetworks are contracting in metrics  $\\{\\mathbf{M}_1, \\dots, \\mathbf{M}_p\\}$  with rate  $\\beta$ .\n\n#### A4.7 PROOF OF THEOREM 7\n\n**Theorem 7.** Let **D** be a positive, diagonal matrix with  $D_{ii} = \\frac{d\\phi_i}{dx_i}$ , and let **P** be an arbitrary, positive diagonal matrix. If:\n\n$$(q\\mathbf{W} - \\mathbf{I})\\mathbf{P} + \\mathbf{P}(q\\mathbf{W}^T - \\mathbf{I}) \\prec -c\\mathbf{P}$$\n\nand\n\n$$\\dot{\\mathbf{D}} - cq^{-1}\\mathbf{D} \\preceq -\\beta\\mathbf{D}$$\n\nfor  $c, \\beta > 0$ , then (1) is contracting in metric **D** with rate  $\\beta$ .\n\n*Proof.* Consider the differential, quadratic Lyapunov function:\n\n$$V = \\delta \\mathbf{x}^T \\mathbf{P} \\mathbf{D} \\delta \\mathbf{x}$$\n\nwhere  $\\mathbf{D} \\succ 0$  is as defined above. The time derivative of V is:\n\n$$\\dot{V} = \\delta \\mathbf{x}^T \\mathbf{P} \\dot{\\mathbf{D}} \\delta \\mathbf{x} + \\delta \\mathbf{x}^T (-2\\mathbf{P}\\mathbf{D} + \\mathbf{P}\\mathbf{D}\\mathbf{W}\\mathbf{D} + \\mathbf{D}\\mathbf{W}^T \\mathbf{D}\\mathbf{P}) \\delta \\mathbf{x}$$\n\nThe second term on the right can be factored as:\n\n$$\\begin{split} \\delta\\mathbf{x}^T (-2\\mathbf{P}\\mathbf{D} + \\mathbf{P}\\mathbf{D}\\mathbf{W}\\mathbf{D} + \\mathbf{D}\\mathbf{W}^T\\mathbf{D}\\mathbf{P})\\delta\\mathbf{x} &= \\\\ \\delta\\mathbf{x}^T\\mathbf{D} (-2\\mathbf{P}\\mathbf{D}^{-1} + \\mathbf{P}\\mathbf{W} + \\mathbf{W}^T\\mathbf{P})\\mathbf{D}\\delta\\mathbf{x} &\\leq \\\\ \\delta\\mathbf{x}^T\\mathbf{D} (-2\\mathbf{P}g^{-1} + \\mathbf{P}\\mathbf{W} + \\mathbf{W}^T\\mathbf{P})\\mathbf{D}\\delta\\mathbf{x} &= \\\\ \\delta\\mathbf{x}^T\\mathbf{D} [\\mathbf{P}(\\mathbf{W} - g^{-1}\\mathbf{I}) + (\\mathbf{W}^T - g^{-1}\\mathbf{I})\\mathbf{P}]\\mathbf{D}\\delta\\mathbf{x} &\\leq \\\\ -cg^{-1}\\delta\\mathbf{x}^T\\mathbf{P}\\mathbf{D}^2\\delta\\mathbf{x} & \\end{split}$$\n\nwhere the last inequality was obtained by substituting in the first assumption above. Combining this with the expression for  $\\dot{V}$ , we have:\n\n$$\\dot{V} \\leq \\delta \\mathbf{x}^T \\mathbf{P} \\dot{\\mathbf{D}} \\delta \\mathbf{x} - c q^{-1} \\delta \\mathbf{x}^T \\mathbf{P} \\mathbf{D}^2 \\delta \\mathbf{x}$$\n\nSubstituting in the second assumption, we have:\n\n$$\\dot{V} \\le \\delta \\mathbf{x}^T \\mathbf{P} (\\dot{\\mathbf{D}} - cg^{-1} \\mathbf{D}^2) \\delta \\mathbf{x} \\le -\\beta \\delta \\mathbf{x}^T \\mathbf{P} \\mathbf{D} \\delta \\mathbf{x} = -\\beta V$$\n\nand thus V converges exponentially to 0 with rate  $\\beta$ .\n\n#### A4.8 PROOF OF THEOREM 8\n\n**Theorem 8.** Satisfaction of the condition\n\n$$g\\mathbf{W}_{sym} - \\mathbf{I} \\prec 0$$\n\nis **NOT** sufficient to show global contraction of the general nonlinear RNN (1) in any constant metric. High levels of antisymmetry in  $\\mathbf{W}$  can make it impossible to find such a metric, which we demonstrate via a  $2\\times 2$  counterexample of the form\n\n$$\\mathbf{W} = \\begin{bmatrix} 0 & -c \\\\ c & 0 \\end{bmatrix}$$\n\nwith  $c \\geq 2$ .\n\nNote that  $g\\mathbf{W}_{sym} - \\mathbf{I} = g\\frac{\\mathbf{W} + \\mathbf{W}^T}{2} - \\mathbf{I} \\prec 0$  is equivalent to the condition for contraction of the system with *linear* activation in the identity metric.\n\nThe main intuition behind this counterexample is that high levels of antisymmetry can prevent a constant metric from being found in the nonlinear system. This is because  $\\mathbf{D}$  is a diagonal matrix with values between 0 and 1, so the primary functionality it can have in the symmetric part of the\n\nJacobian is to downweight the outputs of certain neurons selectively. In the extreme case of all 0 or 1 values, we can think of this as selecting a subnetwork of the original network, and taking each of the remaining neurons to be single unit systems receiving input from the subnetwork. For a given static configuration of  $\\mathbf{D}$  (think linear gains), this is a hierarchical system that will be stable if the subnetwork is stable. But as  $\\mathbf{D}$  can evolve over time when a nonlinearity is introduced, we would need to find a constant metric that can serve completely distinct hierarchical structures simultaneously - which is not always possible.\n\nPut in terms of matrix algebra, D can zero out columns of W, but not their corresponding rows. So for a given weight pair  $w_{ij}$ ,  $w_{ji}$ , which has entry in  $\\mathbf{W}_{sym} = \\frac{w_{ij} + w_{ji}}{2}$ , if  $D_i = 0$  and  $D_j = 1$ , the i, j entry in  $(\\mathbf{WD})_{sym}$  will be guaranteed to have lower magnitude if the signs of  $w_{ij}$  and  $w_{ji}$  are the same, but guaranteed to have higher magnitude if the signs are different. Thus if the linear system would be stable based on magnitudes alone D poses no real threat, but if the linear system requires antisymmetry to be stable, D can make proving contraction quite complicated (if possible at all).\n\n*Proof.* The nonlinear system is globally contracting in a *constant* metric if there exists a symmetric, positive definite  $\\mathbf{M}$  such that the symmetric part of the Jacobian for the system,  $(\\mathbf{MWD})_{sym} - \\mathbf{M}$  is negative definite uniformly. Therefore  $(\\mathbf{MWD})_{sym} - \\mathbf{M} \\prec 0$  must hold for all possible  $\\mathbf{D}$  if  $\\mathbf{M}$  is a constant metric the system *globally* contracts in with any allowed activation function, as some combination of settings to obtain a particular  $\\mathbf{D}$  can always be found.\n\nThus to prove the main claim, we present here a simple 2-neuron system that is contracting in the identity metric with linear activation function, but can be shown to have no  $\\mathbf{M}$  that simultaneously satisfies the  $(\\mathbf{MWD})_{sym} - \\mathbf{M} \\prec 0$  condition for two different possible  $\\mathbf{D}$  matrices.\n\nTo begin, take\n\n$$\\mathbf{W} = \\begin{bmatrix} 0 & -2 \\\\ 2 & 0 \\end{bmatrix}$$\n\nNote that any off-diagonal magnitude  $\\geq 2$  would work, as this is the point at which  $\\frac{1}{2}$  of one of the weights (found in  $\\mathbf{W}_{sym}$  when the other is zeroed) will have magnitude too large for  $(\\mathbf{WD})_{sym} - \\mathbf{I}$  to be stable.\n\nLooking at the linear system, we can see it is contracting in the identity because\n\n$$\\mathbf{W}_{sym} - \\mathbf{I} = \\begin{bmatrix} -1 & 0\\\\ 0 & -1 \\end{bmatrix} \\prec 0$$\n\nNow consider  $(\\mathbf{MWD})_{sym} - \\mathbf{M}$  with  $\\mathbf{D}$  taking two possible values of\n\n$$\\mathbf{D}_1 = \\begin{bmatrix} 1 & 0 \\\\ 0 & 0 \\end{bmatrix} \\quad and \\quad \\mathbf{D}_2 = \\begin{bmatrix} 0 & 0 \\\\ 0 & 1 \\end{bmatrix}$$\n\nWe want to find some symmetric, positive definite  $\\mathbf{M} = \\begin{bmatrix} a & m \\\\ m & b \\end{bmatrix}$  such that  $(\\mathbf{MWD}_1)_{sym} - \\mathbf{M}$  and  $(\\mathbf{MWD}_2)_{sym} - \\mathbf{M}$  are both negative definite.\n\nWorking out the matrix multiplication, we get\n\n$$(\\mathbf{MWD}_1)_{sym} - \\mathbf{M} = \\begin{bmatrix} 2m - a & b - m \\\\ b - m & -b \\end{bmatrix}$$\n\nand\n\n$$(\\mathbf{MWD}_2)_{sym} - \\mathbf{M} = \\begin{bmatrix} -a & -(a+m) \\\\ -(a+m) & -2m-b \\end{bmatrix}$$\n\nWe can now check necessary conditions for negative definiteness on these two matrices, as well as for positive definiteness on M, to try to find an M that will satisfy all these conditions simultaneously. In this process we will reach a contradiction, showing that no such M can exist.\n\nA necessary condition for positive definiteness in a real, symmetric n × n matrix X is xii > 0, and for negative definiteness xii < 0. Another well known necessary condition for definiteness of a real symmetric matrix is |xii + xjj | > |xij + xji| = 2|xij | ∀i 6= j. See [\\(Weisstein\\)] for more info on these conditions.\n\nThus we will require a and b to be positive, and can identify the following conditions as necessary for our 3 matrices to all meet the requisite definiteness conditions:\n\n$$2m < a \\tag{7}$$\n\n$$-2m < b \\tag{8}$$\n\n$$|2m - (a+b)| > 2|b-m| \\tag{9}$$\n\n$$|-2m - (a+b)| > 2|a+m| \\tag{10}$$\n\nNote that the necessary condition for M to be PD, a + b > 2|m|, is not listed, as it is automatically satisfied if equation [7] and equation [8] are.\n\nIt is easy to see that if m = 0, conditions equation [9] and equation [10] will result in the contradictory conditions a > b and b > a respectively, so we will require a metric with off-diagonal elements. To make the absolute values easier to deal with, we will check m > 0 and m < 0 cases independently.\n\nFirst we take m > 0. By condition equation [7] we must have a > 2m, so between that and knowing the signs of all unknowns are positive, we can reduce many of the absolute values. Condition equation [9] becomes a + b − 2m > |2b − 2m|, and condition equation [10] becomes a + b + 2m > 2a + 2m, which is equivalent to b > a. If b > a we must also have b > m, so condition equation [9] further reduces to a + b − 2m > 2b − 2m, which is equivalent to a > b. Therefore we have again reached contradictory conditions.\n\nA very similar approach can be applied when m < 0. Using condition equation [8] and the known signs we reduce condition equation [9] to 2|m| + a + b > 2b + 2|m|, i.e. a > b. Meanwhile condition equation [10] works out to a + b − 2|m| > 2a − 2|m|, i.e. b > a.\n\nTherefore it is impossible for a single constant M to accommodate both D<sup>1</sup> and D2, so that no constant metric can exist for W to be contracting in when a nonlinearity is introduced that can possibly have derivative reaching both of these configurations. One real world example of such a nonlinearity is ReLU. Given a sufficiently high negative input to one of the units and a sufficiently high positive input to the other, D can reach one of these configurations. The targeted inputs could then flip at any time to reach the other configuration.\n\nAn additional condition we could impose on the activation function is to require it to be a strictly increasing function, so that the activation function derivative can never actually reach 0. We will now show that a very similar counterexample applies in this case, by taking\n\n$$\\mathbf{D}_{1*} = \\begin{bmatrix} 1 & 0 \\\\ 0 & \\epsilon \\end{bmatrix} \\quad and \\quad \\mathbf{D}_{2*} = \\begin{bmatrix} \\epsilon & 0 \\\\ 0 & 1 \\end{bmatrix}$$\n\nNote here that the W used above produced a (WD)sym − I that just barely avoided being negative definite with the original D<sup>1</sup> and D2, so we will have to increase the values on the off-diagonals a\n\nbit for this next example. In fact anything with magnitude larger than 2 will have some > 0 that will cause a constant metric to be impossible, but for simplicity we will now take\n\n$$\\mathbf{W}_* = \\begin{bmatrix} 0 & -4 \\\\ 4 & 0 \\end{bmatrix}$$\n\nNote that with W∗, even just halving one of the off-diagonals while keeping the other intact will produce a (WD)sym − I that is not negative definite. Anything less than halving however will keep the identity metric valid. Therefore, we expect that taking in D1<sup>∗</sup> and D2<sup>∗</sup> to be in the range 0.5 ≥ > 0 will also cause issues when trying to obtain a constant metric.\n\nWe will now actually show via a similar proof to the above that M is impossible to find for W<sup>∗</sup> when ≤ 0.5. This result is compelling because it not only shows that does not need to be a particularly small value, but it also drives home the point about antisymmetry - the larger in magnitude the antisymmetric weights are, the larger the where we will begin to encounter problems.\n\nWorking out the matrix multiplication again, we now get\n\n$$(\\mathbf{M}\\mathbf{W}_*\\mathbf{D}_{1*})_{sym} - \\mathbf{M} = \\begin{bmatrix} 4m - a & 2b - m - 2a\\epsilon \\\\ b - m - 2a\\epsilon & -4m\\epsilon - b \\end{bmatrix}$$\n\nand\n\n$$(\\mathbf{M}\\mathbf{W}_*\\mathbf{D}_{2*})_{sym} - \\mathbf{M} = \\begin{bmatrix} 4m\\epsilon - a & -(2a + m - 2b\\epsilon) \\\\ -(2a + m - 2b\\epsilon) & -4m - b \\end{bmatrix}$$\n\nResulting in two new main necessary conditions:\n\n$$|4m - a - b - 4m\\epsilon| > 2|2b - m - 2a\\epsilon| \\tag{11}$$\n\n$$|4m\\epsilon - a - b - 4m| > 2|2a + m - 2b\\epsilon| \\tag{12}$$\n\nAs well as new conditions on the diagonal elements:\n\n$$4m - a < 0 \\tag{13}$$\n\n$$-4m - b < 0 \\tag{14}$$\n\nWe will now proceed with trying to find a, b, m that can simultaneously meet all conditions, setting = 0.5 for simplicity.\n\nLooking at m = 0, we can see again that M will require off-diagonal elements, as condition equation [11] is now equivalent to the condition a + b > |4b − 2a| and condition equation [12] is similarly now equivalent to a + b > |4a − 2b|.\n\nEvaluating these conditions in more detail, if we assume 4b > 2a and 4a > 2b, we can remove the absolute value and the conditions work out to the contradicting 3a > 3b and 3b > 3a respectively. As an aside, if > 0.5, this would no longer be the case, whereas with < 0.5, the conditions would be pushed even further in opposite directions.\n\nIf we instead assume 2a > 4b, this means 4a > 2b, so the latter condition would still lead to b > a, contradicting the original assumption of 2a > 4b. 2b > 4a causes a contradiction analogously. Trying 4b = 2a will lead to the other condition becoming b > 2a, once again a contradiction. Thus a diagonal M is impossible\n\nSo now we again break down the conditions into m > 0 and m < 0 cases, first looking at m > 0. Using condition equation [13] and knowing all unknowns have positive sign, condition equation [11] reduces to a + b − 2m > |4b − 2(a + m)| and condition equation [12] reduces to a + b + 2m > |4a−2(b−m)|. This looks remarkably similar to the m = 0 case, except now condition equation [11] has −2m added to both sides (inside the absolute value), and condition equation [12] has 2m added to both sides in the same manner. If 4b > 2(a + m) the −2m term on each side will simply cancel,\n\nand similarly if 4a>2(b-m) the +2m terms will cancel, leaving us with the same contradictory conditions as before.\n\nTherefore we check 2(a+m)>4b. This rearranges to 2a>2(2b-m)>2(b-m), so that from condition equation 12 we get b>a. Subbing condition equation 13 in to 2(a+m)>4b gives 8b<4a+4m<5a i.e.  $b<\\frac{5}{8}a$ , a contradiction. The analogous issue arises if trying 2(b-m)>4a. Trying 2(a+m)=4b gives m=2b-a, which in condition equation 12 results in 5b-a>|6a-6b|, while in condition equation 13 leads to 5a>8b, so equation 12 can further reduce to 5b-a>6a-6b i.e. 11b>7a. But  $b>\\frac{7}{11}a$  and  $b<\\frac{5}{8}a$  is a contradiction. Thus there is no way for m>0 to work.\n\nFinally, trying m < 0, we now use condition equation 14 and the signs of the unknowns to reduce condition equation 11 to a+b+2|m|>|4b-2(a-|m|)| and condition equation 12 to a+b-2|m|>|4a-2(b+|m|)|. These two conditions are clearly directly analogous to in the m>0 case, where b now acts as a with condition equation 14 being b>4|m|. Therefore the proof is complete.\n\n#### A5 CODE FOR MODELS USED IN EXPERIMENTS\n\n```\n# class definitions for the sparse combo networks\nclass rnnAssemblyCell_Thm1(LightningModule): Pytorch module for training the following system: tau^*dx/dt = -x + W^*phi(x) + L^*x + u(t) where tau > 0, phi is a nonlinearity, W is block diagonal, L is some 'contracting' combination matrix and u is some input. def __init__(self, input_size, hidden_sizes, output_size, alpha, A, density, pre_select_mult, post_select_mult): <u>__init__</u>(self, input_size, hidden_sizes, out super(mnAssemblyCell_Thm1, self).__init__() self.input_size = input_size self.hidden_sizes = hidden_sizes self.hidden_size = int(np.sum(hidden_sizes)) self.output_size = output_size self.alpha = alpha self.weight_ih = nn.Parameter(torch.normal(0,1/np.sqrt(self.hidden_size),(self.hidden_size, self.input_size))) self.weight_ho = nn.Parameter(torch.normal(0,1/np.sqrt(self.hidden_size),(self.output_size, self.hidden_size))) self.bias_oh = nn.Parameter(torch.normal(0,1/np.sqrt(self.hidden_size),(1,self.output_size))) self.bias_oh = nn.Parameter(torch.normal(0,1/np.sqrt(self.hidden_size),(1,self.hidden_size)))\n#self.register buffer(\"V mask\", create mask given A(torch.eye(len(ns)),ns).bool())\n# Specify w and n here based on the random intitution mentioned\nW, M = generate_initial_W_M(self.hidden_sizes, density, pre_select_mult, post_select_mult) M inv = torch.inverse(M) self.register_buffer(\"W\",W) self.register_buffer(\"M\", M) self.register_buffer(\"M_inv\", M_inv) L_mask = create_mask_given_A(A,self.hidden_sizes).bool() self.register_buffer(\"L_mask\", L_mask) self.L train = nn.Parameter(self.L mask*torch.normal(0,1/np.sqrt(np.mean(self.hidden sizes)),(np.sum(self.hidden sizes), np.sum(self.hidden sizes)) def forward(self, input): L masked = self.L train*self.L mask state = torch.zeros((input.shape[0],self.hidden_size),device = self.device) state = state.type_as(state) inputs = input.unbind(1) for i in range(len(inputs)): fx = -state + F.relu(state @ self.W.T + inputs[i] @ (self.weight ih.T) + self.bias hh) + state @ (L masked.T - (self.M @ L masked) @ self.M inv state = state + self.alpha*fx hy = state @ (self.weight_ho.T)\n#outputs += [hy]\nreturn hy, state ```\n\nFigure S7: Pytorch Lightning code for Sparse Combo Net Cell\n\nFigure S8: Pytorch Lightning code for SVD Combo Net cell.\n\n### A5.1 NETWORK DIAGRAMS\n\n![](_page_31_Figure_1.jpeg)\n\nFigure S9: Detailed architecture diagrams for Sparse Combo Net (A) and SVD Combo Net (B)."
  }
]