Abstract: We propose a structured framework for generating and evaluating synthetic intelligence reports using large language models (LLMs), specifically GPT-3.5 and GPT-4. Our approach integrates recursive prompting with symbolic and spatial grounding via knowledge graphs and map metadata to produce multi-perspective, JSON-formatted reports that emulate real-world intelligence workflows. To assess quality, we conduct a human evaluation using a rubric based on five analytic dimensions: clarity, objectivity, comprehensiveness, rigor, and relevance. Results show that GPT-4 produces more coherent and reliable outputs, while GPT-3.5, when scaffolded with structured input, performs competitively in analytical depth and relevance. Our framework extends prior LLM benchmarks by targeting long-form synthesis and structured reasoning in complex, mission-oriented domains.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Natural Language Generation, Evaluation and Evaluation Metrics, Generation, Resources and Evaluation, Large Language Models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 4173
Loading