Abstract: One of the core challenges in building general reasoning systems lies in generating diverse, human-aligned solution trajectories—different yet valid paths by which a problem can be solved. Prior approaches often rely on handcrafted templates, rule-based augmentations, or human demonstrations, which are limited in scalability and stylistic diversity. To address this, we explore the use of Generative Flow Networks (GFlowNets) for automated solution augmentation in reasoning tasks. We propose a framework that learns to generate diverse reasoning trajectories with probabilities proportional to their quality, guided by a human-inspired reward function and a novel geometric forward policy. This enables the generation of multiple plausible solution paths without relying on manual supervision. Moreover, our method supports efficient test-time augmentation from input-output examples alone, without access to ground-truth programs or external demonstrations—making it suitable for zero-shot settings. We evaluate our framework on the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed to test compositional and abstract reasoning. Our results show that GFlowNets can effectively explore the space of valid reasoning processes, producing a variety of plausible reasoning trajectories, similar to how different individuals might solve the same problem using different intermediate steps. These trajectories are generated at scale—over 100k per task in under an hour, and follow a logarithmic yield trend, enabling practical tradeoffs between augmentation volume and novelty. Furthermore, fine-tuning a large language model (LLaMA 3.1 Instruct 8B) on these synthetic trajectories leads to a 28.6% improvement in reasoning accuracy on ARC tasks, demonstrating the downstream utility of our method. These findings suggest that GFlowNets offer a promising foundation for modeling structured reasoning in automated trajectory generation. Our code is here: https://anonymous.4open.science/r/GFN_to_ARC-B500/
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ---
### 1. Strengthened Analyses and Empirical Evidence
- **Comparison with Human Trajectories (Figure 5):** To empirically support our proposed heuristics, we have added **Figure 5 (page 12)**, which directly compares the length distribution of our GFlowNet-generated trajectories with those from actual human solvers. (Responds to Reviewer VQao)
- **In-depth Analysis of Exploration Depth (Section 4.2.2):** To provide a more balanced perspective beyond the 'shorter is better' assumption, the analysis from the former Appendix B.3 has been integrated into the main body as **Section 4.2.2 (page 18)**. (Responds to Reviewer VQao)
- **Analysis of Cycle Constraint (Table 7):** To address the comment on the marginal utility of the Cycle constraint, we have added **Table 7 (page 17)** and its accompanying analysis, demonstrating that explicitly controlling for cycles is more effective than relying on length penalties alone. (Responds to Reviewer VQao)
- **Competitive Analysis (Appendix D):** To clarify the novelty of our work, we have created a new **Appendix D (page 38)**, which provides a qualitative and quantitative comparison of our framework against other ARC datasets and augmentation methods. (Responds to Reviewer XpcR)
- **Downstream LLM Validation (Section 4.3):** As the most significant change, we have added a new section to empirically validate the practical utility of our generated trajectories. **Section 4.3 (page 22)** demonstrates that fine-tuning a LLAMA 3.1 8B model with our data leads to a **28.6% accuracy** on ARC tasks, proving the downstream value of our framework. (Responds to Reviewers XpcR and 43UG)
- **In-depth Efficiency and Scalability Analysis (Appendix E):** We have added a new **Appendix E (page 40)** to provide a deep dive into our framework's performance. It includes quantitative analysis of the generation speed (**Table 22**) and the logarithmic yield of diverse trajectories (**Figure 16**), highlighting the method's practical scalability. (Responds to all reviewers by strengthening the paper's claims)
---
### 2. Clarifications and Revisions to Framing
- **Refined Framing and Tone:** Reflecting the feedback that our claims about 'human-like reasoning' could be overstated, we have revised the phrasing throughout the paper to **'capturing syntactic features of efficient problem-solving'** instead of 'mimicking deep human strategies'. (Responds to Reviewer VQao)
- **Strengthened AGI Motivation:** To strengthen the motivation for our research, we have reinforced the relevant content in the **Introduction (page 2)** to clarify that our goal is to model **'diverse reasoning processes'**. (Responds to Reviewer 43UG)
- **Clarified Experimental Setup:** We have increased the transparency and reproducibility of our experiments by clearly describing the specific reward functions used in **RQ2 and RQ3 (pages 19-21)**. (Responds to Reviewer VQao)
- **Emphasis on 'Test-Time Augmentability':** We have highlighted our framework's core advantage—the ability to generate solutions for unseen tasks on the fly—throughout the **Abstract, Introduction, and Methods (Section 3.1)** to better differentiate our work. (Responds to all reviewers by clarifying contribution)
- **Updated List of Contributions:** To reflect our new analyses, we have added **'Test-Time Efficiency and Yield Analysis'** as a fourth point to the list of **Contributions (page 3)**.
---
### 3. Final Corrections for Consistency
- **Corrections to Table Numbering and Task IDs:** We have corrected minor inconsistencies that arose during the revision process, including updated appendix table numbers and misnoted Task IDs from the rebuttal, to ensure the final manuscript's accuracy.
Assigned Action Editor: ~Sungsoo_Ahn1
Submission Number: 4808
Loading