Abstract: Venture capital (VC) returns are highly skewed: most investments underperform, while a few yield outsized gains. Accurately predicting startup success is thus crucial. Graph-based models confirm the value of structural signals but offer limited reasoning, whereas large language models (LLMs) provide strong reasoning and broad knowledge yet hallucinate without domain grounding. A core challenge is therefore to align LLM reasoning with explicit multi-hop graph paths and fuse these paths with unstructured evidence. Classical retrieval-augmented generation (RAG) mitigates this via textual evidence but overlooks high-order investor-company relations. Embedding-based graph RAG encodes such relations while discarding the explicit chains LLMs exploit. We propose MIRAGE-VC, a multi-perspective RAG framework for VC prediction. Our approach couples semantic retrieval with an information-gain–guided, stepwise path retriever that selects a compact set of cross-typed paths as explicit evidence. Specialized agents analyze heterogeneous sources, and a learnable gate weights their signals before a final LLM decision. On a real-world VC dataset, MIRAGE-VC achieves state-of-the-art performance with a 5.0\% relative F1 gain and a 16.6\% relative Precision@5 gain over the best baseline. Our implementation is available.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: financial/business NLP, retrieval-augmented generation,startup success prediction
Contribution Types: NLP engineering experiment
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We provide the model used in Appendix A.5
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: We explain in Ethical Considerations section that our data is used under a license from PitchBook.
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: We provide details of the data used in Ethical Considerations section
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: N/A
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: We provide an estimate of the computational cost of our approach in Appendix A.6.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: We provide more detailed parameter settings in Appendix A.3, 4, and 5
C3 Descriptive Statistics: Yes
C3 Elaboration: We explained in Section 5.5 that our experimental results are based on the average of multiple calls.
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: We used ChatGPT-4o-mini to help polish Section 4 and refine figure captions, all final content was author-edited and verified.
Author Submission Checklist: yes
Submission Number: 347
Loading