# GLARE: Scalable Neuro-Symbolic Reward Shaping for LLM Agents via Group-Level Automata

This repository contains the implementation code for **GLARE**, a scalable neuro-symbolic reward shaping framework for Large Language Model (LLM) agents, utilizing group-level automata.

> **Note:** This codebase is currently provided for reviewer verification of our work's authenticity. Upon acceptance, the repository will be fully reorganized and polished.

## Overview
Reinforcement Learning (RL) with Group Relative Policy Optimization (GRPO) shows great promise for enhancing LLM reasoning, but remains challenged by sparse and unstable rewards in long-horizon tasks. Existing approaches to reward shaping struggle to balance semantic
expressiveness, reliability, and computational efficiency: heuristic rules lack flexibility, while LLM-as-a-Judge incurs high computational cost and suffer from inconsistent and misaligned scoring signals in long-context settings. To address these challenges, we introduce GLARE, a neuro-symbolic reward framework that decouples semantic abstraction from credit assignment. Specifically, to leverage semantic understanding while avoiding unstable direct scoring, we employ a small LLM to extract semantic events and translate them into Linear Temporal Logic (LTL) formulas that symbolically formalize the reward logic. These formulas are compiled into deterministic automata that perform reward shaping over the trajectories, yielding dense and consistent reward signals under long horizons while significantly reducing computational cost Empirical results on ALFWorld show that GLARE outperforms GRPO by 12.1\% in success rate, while achieving an 8.1\% improvement over conventional LLM-based judges using only 15\% of their computational cost.

## Project Structure

The core logic is primarily located in the `trpo/` directory (referencing the project's early internal codename).

```text
.
├── trpo/
│   ├── core_trpo.py          # Core implementation of the TRPO/GLARE algorithm
│   ├── core_reward.py        # Reward calculation logic
│   ├── ltl_reward_engine.py  # LTL formula processing and automata operations (via Spot)
│   ├── llm_client_factory.py # Factory for asynchronous LLM client interactions
│   ├── trace_utils.py        # Utilities for handling and reconstructing agent traces
│   ├── aps/
│   │   └── llm_extract_aps.py # Neuro-symbolic extractor for Atomic Propositions
│   ├── eval/                 # Evaluation scripts for APs and LTL formulas
│   ├── prompt/               # Prompt templates for LLM agents (e.g., ALFWorld)
│   └── trpo_trainer/         # Shell scripts to launch training and experiments
└── verl/                     # Integration with the VeRL (Volcano RL) trainer
```

## Dependencies

This project is built upon **verl-agent**. To run this code, you must have the `verl` environment set up.

Key dependencies include:
- `verl-agent` (Base RL training framework)
- `spot` (For LTL and automata manipulation)
- `torch`, `numpy`
- `vllm` (For efficient LLM inference)

## Usage

### Training

To reproduce the experiments on ALFWorld, use the provided shell scripts in `trpo/trpo_trainer/`.

Example command to start training with the GLARE configuration:

```bash
bash trpo/trpo_trainer/run_alfworld_trpo.sh
```

### Configuration Details

The training script allows customization of the reward shaping mechanism:
- `algorithm.trpo.ltl_reward`: Positive reward for satisfying LTL properties.
- `algorithm.trpo.ltl_penalty`: Penalty for violating properties.
- `algorithm.trpo.ltl_trend_reward`: Reward for progressing towards the goal state in the automaton.

