# Backwards Time Investigation Environment Design Document

## Background

The Backwards Time Investigation Environment places the agent in the role of a detective operating within a unique temporal constraint where investigation proceeds in reverse chronological order. Each episode begins at the moment when a crime's immediate effects are visible and fully resolved, such as a theft having been discovered, evidence collected, and suspects initially identified. However, the true underlying cause—the original triggering event that set the entire criminal sequence in motion—remains hidden in the earlier timeline. The agent must navigate backwards through this fixed temporal sequence, uncovering progressively earlier events and establishing causal relationships to reconstruct the complete chain of events. This environment simulates the challenging investigative scenario where outcomes are known but their origins must be methodically uncovered through systematic examination of antecedent conditions.

## Objective

The agent's primary goal is to correctly identify the unique root cause event, specifically termed "the original crime trigger," before the investigation timeline reaches its earliest boundary at time index 0. This identification must be precise, encompassing three critical components: the exact time index when the triggering event occurred, the specific perpetrator who initiated the sequence, and the particular action they performed. Success requires not merely collecting clues or eliminating suspects, but synthesizing discovered information into a complete and accurate reconstruction of the causal chain that led to the observed criminal outcome.

## State Setup

The environment initializes each episode with a predetermined timeline spanning exactly 40 time indices, numbered in descending order from 40 to 0, representing the backwards progression through time. At episode start, the agent observes the state at time index 40, where the crime's immediate effects are fully visible and documented. The initial state includes a comprehensive crime scene report detailing all discovered evidence, witness statements describing observed activities, and a preliminary suspect list containing 4-6 individuals with varying degrees of apparent involvement. The environment also establishes several unresolved effects—specific inconsistencies or unexplained elements that demand causal explanation, such as unexplained access to secured areas, missing objects with unclear disappearance timing, or witness testimony contradictions. Each level maintains consistent complexity by incorporating exactly one true root cause, 6-8 essential clues distributed across the timeline, and 2-3 plausible decoy suspects whose involvement appears significant but ultimately proves tangential to the core criminal act.

## Actions

The agent operates through five distinct action types, each designed to reveal or manipulate information from earlier points in the timeline. ExamineScene(location) allows detailed investigation of specific locations during the time period immediately preceding the current observation point, uncovering physical evidence, environmental clues, or traces of previous activity that may not have been initially apparent. InterrogatePerson(person) provides access to an individual's activities, knowledge, and movements during the earlier time period, revealing their actions and potentially their awareness of other participants' behaviors. TraceObject(object) follows the ownership history and movement patterns of specific items, documents, or tools backwards through time, exposing how these elements came to be in their observed positions. ConnectClues(clueA, clueB) represents the agent's analytical process of proposing causal relationships between discovered evidence pieces, with the environment providing definitive validation or rejection of these proposed connections based on their logical consistency with the true timeline. JumpEarlier(steps) enables strategic time management by allowing the agent to voluntarily advance 1-3 time indices without performing investigative actions, useful when current evidence suggests more productive investigation opportunities exist further back in the timeline.

## State Transition Rule

State transitions operate under the fundamental principle that all agent actions target the timeline period immediately before the current observation point, ensuring strict adherence to backwards temporal progression. When ExamineScene or InterrogatePerson actions are executed, the environment reveals information that existed during the targeted earlier time period, adding these discoveries to the persistent timeline ledger while advancing the agent's current time index one step backwards. TraceObject actions uncover ownership chains and movement histories, potentially revealing connections spanning multiple earlier time periods and updating the timeline ledger accordingly. ConnectClues actions receive immediate evaluation, with valid causal relationships being permanently recorded in the timeline ledger and becoming part of the established fact base, while invalid connections are rejected without affecting the ledger. JumpEarlier actions advance the time index by the specified number of steps without revealing new information, allowing rapid movement to potentially more productive investigation periods. Throughout all transitions, the environment maintains strict logical consistency, ensuring that newly revealed information never contradicts previously validated causal relationships and that the timeline maintains coherent causality patterns.

## Rewards

This environment employs a binary reward structure, delivering either complete success or complete failure without intermediate reinforcement. The agent receives a reward of +1 exclusively when it correctly identifies all three components of the original crime trigger: the precise time index of occurrence, the specific perpetrator involved, and the exact action performed. This identification must be submitted through a formal IdentifyRootCause action and successfully validated by the environment's internal truth model. All other outcomes, including partial correctness, near-miss identifications, or failure to submit an identification before timeline exhaustion, result in a reward of 0. No intermediate rewards are provided for clue discovery, suspect elimination, or causal relationship establishment, requiring the agent to develop strategies that optimize for complete problem resolution rather than incremental progress indicators.

## Observation

Agent observations are carefully structured to provide sufficient information for pattern recognition and strategic decision-making while maintaining appropriate investigative challenge. At each time step, the agent receives the current time index prominently displayed to maintain temporal orientation, a comprehensive timeline ledger showing all revealed events organized by their occurrence time with clear indicators of validated causal relationships, and a dynamic list of unresolved effects that still require causal explanation with specific descriptions of what type of evidence might address each mystery. The observation includes an inventory of all collected clues categorized by type and discovery location, with metadata indicating their potential relevance to different aspects of the investigation. The current suspect list displays each individual's known involvement level, supported evidence, and likelihood scores based on discovered connections. Additionally, the observation provides action feedback, clearly showing what new information was revealed by the most recent action, whether proposed causal connections were validated or rejected, and how the overall investigation state changed as a result. This observation design ensures agents can track their progress, understand the consequences of their actions, and identify productive next steps while requiring them to synthesize complex temporal and causal relationships.

## Termination

Episodes terminate under three specific conditions, each designed to provide clear boundaries for agent learning while maintaining investigation realism. First, the episode ends immediately when an agent submits an IdentifyRootCause action, regardless of the action's accuracy, with the binary reward determined by the correctness of the identification. Second, the episode automatically terminates when the time index reaches 0, representing the exhaustion of the available investigation timeline, and if no correct identification has been submitted, the agent receives a reward of 0. Third, episodes terminate immediately with a reward of 0 if the agent attempts an invalid action, such as examining a location that doesn't exist, interrogating a person not present in the timeline, or proposing causal connections between incompatible evidence types. These termination conditions encourage agents to develop both accuracy and efficiency, balancing thorough investigation against time constraints while maintaining valid action selection throughout the episode.

## Special Features

The environment incorporates several unique mechanics that distinguish it from conventional investigation scenarios while supporting consistent agent learning. The backwards causality system ensures that effects are always observed before their causes are revealed, requiring agents to develop reverse reasoning capabilities and hypothesis-driven investigation strategies. Conservation of truth maintains that once a causal relationship is validated and recorded in the timeline ledger, it cannot be contradicted by subsequently revealed information, providing reliable anchoring points for agent reasoning and ensuring logical consistency across the entire timeline. The limited horizon constraint creates strategic pressure by providing exactly one action opportunity per time index, requiring agents to balance thorough investigation of current evidence against the need to explore earlier time periods. Level consistency guarantees that all episodes maintain identical structural complexity with one root cause, comparable clue distribution patterns, and similar suspect pool characteristics, ensuring that successful strategies developed on one level transfer effectively to others. Finally, the deterministic truth mapping ensures that the relationship between correct causal reconstruction and reward achievement remains constant across all episodes, providing the stable learning foundation necessary for effective reinforcement learning while randomizing surface-level details to prevent memorization and encourage generalizable investigation skills.