Revision History for Oracle-Efficient Reinforcement...

Camera Ready Revision Edit by Authors

  • 15 Jan 2025, 16:33 Coordinated Universal Time
  • Title: Oracle-Efficient Reinforcement Learning for Max Value Ensembles
  • Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
  • Authorids: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
  • Keywords: Reinforcement Learning Theory, Ensembling, Max-Following, Learning Theory
  • TLDR: We provide an efficient algorithm to learn an approximate max-following policy using K constituent policies in large state spaces.
  • Abstract:

    Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of base or constituent policies (possibly heuristic) upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

  • PDF: pdf
  • Supplementary Material: zip
  • Primary Area: reinforcement_learning

    Edit Info


    Readers: Everyone
    Writers: NeurIPS 2024 Conference, NeurIPS 2024 Conference Submission13146 Authors
    Signatures: NeurIPS 2024 Conference Submission13146 Authors

    Camera Ready Revision Edit by Authors

    • 15 Jan 2025, 15:58 Coordinated Universal Time
    • Title: Oracle-Efficient Reinforcement Learning for Max Value Ensembles
    • Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
    • Authorids: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
    • Keywords: Reinforcement Learning Theory, Ensembling, Max-Following, Learning Theory
    • TLDR: We provide an efficient algorithm to learn an approximate max-following policy using K constituent policies in large state spaces.
    • Abstract:

      Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of base or constituent policies (possibly heuristic) upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

    • PDF: pdf
    • Supplementary Material: zip
    • Primary Area: reinforcement_learning

      Edit Info


      Readers: Everyone
      Writers: NeurIPS 2024 Conference, NeurIPS 2024 Conference Submission13146 Authors
      Signatures: NeurIPS 2024 Conference Submission13146 Authors

      Camera Ready Revision Edit by Authors

      • 15 Jan 2025, 15:35 Coordinated Universal Time
      • Title: Oracle-Efficient Reinforcement Learning for Max Value Ensembles
      • Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
      • Authorids: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
      • Keywords: Reinforcement Learning Theory, Ensembling, Max-Following, Learning Theory
      • TLDR: We provide an efficient algorithm to learn an approximate max-following policy using K constituent policies in large state spaces.
      • Abstract:

        Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of base or constituent policies (possibly heuristic) upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

      • PDF: pdf
      • Supplementary Material: zip
      • Primary Area: reinforcement_learning

        Edit Info


        Readers: Everyone
        Writers: NeurIPS 2024 Conference, NeurIPS 2024 Conference Submission13146 Authors
        Signatures: NeurIPS 2024 Conference Submission13146 Authors

        Camera Ready Revision Edit by Authors

        • 15 Jan 2025, 01:26 Coordinated Universal Time
        • Title: Oracle-Efficient Reinforcement Learning for Max Value Ensembles
        • Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
        • Authorids: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
        • Keywords: Reinforcement Learning Theory, Ensembling, Max-Following, Learning Theory
        • TLDR: We provide an efficient algorithm to learn an approximate max-following policy using K constituent policies in large state spaces.
        • Abstract:

          Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of base or constituent policies (possibly heuristic) upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

        • PDF: pdf
        • Supplementary Material: zip
        • Primary Area: reinforcement_learning

          Edit Info


          Readers: Everyone
          Writers: NeurIPS 2024 Conference, NeurIPS 2024 Conference Submission13146 Authors
          Signatures: NeurIPS 2024 Conference Submission13146 Authors