# Entropy Reversal Engineering Environment Design Document

## Background

The Entropy Reversal Engineering Environment simulates a complex clustered system where an agent must manipulate the fundamental forces of order and chaos across interconnected domains. The setting represents an advanced engineering facility managing four critical subsystems: the Thermal Grid responsible for temperature regulation, the Data Archive handling information storage and processing, the Crystal Lattice maintaining structural integrity, and the Bio-Habitat supporting life systems. These domains exist in a state of thermodynamic coupling where any manipulation of entropy in one system creates cascading effects throughout the entire network. The agent takes on the role of a system engineer who has discovered techniques to locally reverse entropy by expending energy, but must carefully balance these interventions to prevent catastrophic system failures while achieving global optimization objectives.

## Objective

The agent must successfully raise the Global Order Score from its initial value of 0 to a target value of +100 while maintaining system stability across all domains. System stability requires keeping the chaos level in each of the four domains at or below 70 units, as exceeding this threshold in any domain risks triggering catastrophic instability. This dual objective creates a complex optimization challenge where the agent must simultaneously maximize global order while managing local stability constraints. The Global Order Score represents the cumulative excess order across all domains, calculated as the sum of individual domain order levels minus the baseline value of 200. Success requires achieving this target within a strict time limit of 40 operational steps, demanding efficient and strategic decision-making throughout the episode.

## State Setup

The environment initializes with four domains, each containing three primary state variables that determine system behavior. The Thermal Grid, Data Archive, Crystal Lattice, and Bio-Habitat each begin with Order values ranging from 40 to 60 units, Energy reserves between 80 and 120 units, and Chaos levels starting between 20 and 40 units. These initial ranges ensure that the starting Global Order Score begins near zero while providing sufficient operational headroom in each domain. The shared Entropy Tokens reservoir initializes with 200 units, representing the available energy budget for entropy manipulation operations. Global state tracking includes the calculated Global Order Score derived from individual domain order levels, the current step count starting at zero, and system-wide coupling effects that propagate changes across domains. Each domain maintains its state independently while participating in the coupled system dynamics that create the core challenge of balancing local interventions with global consequences.

## Actions

The agent selects one primary action per step from five available intervention types, each with discrete parameter options and specific targeting requirements. Inject Energy operations allow the agent to add 10, 20, or 30 units to the shared Entropy Tokens reservoir by targeting a specific domain, providing the energy budget necessary for other operations while introducing manageable chaos increases to the selected domain. Reverse Entropy actions represent the core entropy manipulation capability, allowing the agent to spend 5, 10, or 15 Entropy Tokens to directly increase order in a target domain, though this process generates chaos spillover effects in non-targeted domains. Redistribute Order operations enable the agent to transfer 5 or 10 units of order between domains without direct chaos penalties, providing a mechanism for rebalancing order distribution across the system. Vent Chaos actions allow the agent to reduce chaos levels in a target domain by 5, 10, or 15 units by expending equivalent Entropy Tokens from the shared reservoir, offering direct chaos management capabilities. Lock-down operations provide temporary system isolation for a target domain, preventing cross-domain coupling effects for one step at the cost of 10 Entropy Tokens, creating opportunities for strategic intervention timing.

## State Transition Rule

System state transitions follow deterministic rules that govern both direct action effects and coupled system dynamics. Energy Injection operations increase the Entropy Tokens reservoir by the specified amount while raising chaos in the target domain by one-tenth of the injected energy, rounded up. Reverse Entropy operations consume the specified number of Entropy Tokens to increase order in the target domain by an equivalent amount, while simultaneously distributing chaos increases to all other domains based on half the entropy tokens consumed, rounded up. Order Redistribution transfers the specified order amount from source to destination domain without immediate chaos penalties, though domains with chaos levels above 60 may experience additional order degradation due to instability cascades. Chaos Venting reduces target domain chaos by the specified amount when sufficient Entropy Tokens are available, but failed venting attempts due to insufficient tokens result in doubled chaos increases in the target domain. Lock-down operations suspend cross-domain coupling effects for the targeted domain during the current step while consuming the fixed token cost. At the end of each step, automatic coupling effects distribute additional chaos to each domain based on the average chaos levels in other domains, creating the persistent challenge of managing system-wide entropy propagation.

## Rewards

The environment employs a cumulative reward structure that provides continuous feedback for agent learning while maintaining clear success criteria. Order Creation rewards provide 2 points for each net unit increase in the Global Order Score, directly incentivizing progress toward the primary objective while allowing agents to understand the value of different order-increasing strategies. Stability Sustained rewards grant 1 point each step when all four domains maintain chaos levels at or below 50 units, encouraging proactive chaos management and system stability maintenance. Efficient Reversal bonuses award 0.5 points when entropy reversal operations succeed without pushing any domain's chaos level above 60, promoting careful timing and strategic intervention planning. Negative feedback includes Chaos Spike penalties of 3 points whenever any domain's chaos exceeds 70 units, providing clear signals about dangerous system states that threaten mission success. Domain Collapse events trigger severe 25-point penalties when chaos levels exceed 90 units, reflecting the catastrophic nature of system failures while immediately terminating the episode. Goal Achievement provides a substantial 50-point bonus when the Global Order Score reaches or exceeds 100 units, clearly marking successful mission completion and immediately ending the episode to prevent unnecessary risk-taking after success.

## Observation

The agent receives complete visibility into all system state variables, providing comprehensive information necessary for strategic decision-making while maintaining appropriate challenge through system complexity rather than information limitations. Domain-specific observations include the current Order level ranging from 0 to 100 units, available Energy reserves from 0 to 200 units, and current Chaos level from 0 to 100 units for each of the four domains. Global system observations provide the calculated Global Order Score, current shared Entropy Tokens reservoir level, and episode step count, enabling agents to track progress toward objectives and resource management requirements. The observation space maintains consistent information hierarchy by presenting domain-specific data in standardized formats while highlighting critical global metrics that drive strategic planning. State information includes sufficient detail for agents to understand action preconditions, predict immediate effects of interventions, and identify patterns in system coupling behavior. The complete observability design supports learning by ensuring agents can correlate their actions with resulting state changes while providing actionable signals that guide effective exploration of the complex state-action space. Historical patterns emerge through consistent state representation, allowing agents to develop sophisticated strategies for managing the trade-offs between local interventions and global system stability.

## Termination

Episodes conclude under three distinct conditions that capture both success and failure scenarios while maintaining appropriate time constraints for learning efficiency. Success termination occurs immediately when the Global Order Score reaches or exceeds 100 units, representing achievement of the primary objective and triggering the substantial goal completion reward before ending the episode. Failure termination results from catastrophic system collapse when any domain's chaos level exceeds 90 units, reflecting the critical importance of maintaining system stability throughout the optimization process and immediately ending the episode with associated penalties. Time-based termination occurs after exactly 40 steps regardless of current system state, creating pressure for efficient decision-making while ensuring episodes conclude within reasonable computational bounds for training purposes. The termination conditions work together to create clear success criteria, meaningful failure consequences, and bounded episode lengths that support effective reinforcement learning while maintaining the challenging nature of the optimization task.

## Special Features

The Entropy Reversal Engineering Environment incorporates several unique mechanics that distinguish it from traditional optimization challenges while maintaining learnable consistency across different configurations. The fundamental entropy coupling mechanism creates a thermodynamic realism where local order improvements necessarily generate disorder elsewhere in the system, forcing agents to develop sophisticated balancing strategies rather than simple greedy optimization approaches. Cross-domain propagation effects operate through deterministic rules that agents must discover through experimentation, creating emergent complexity without relying on random elements that would prevent effective learning. The shared resource economy through Entropy Tokens creates strategic decision points where agents must balance immediate intervention capabilities against future operational flexibility. Lock-down operations provide tactical timing control that adds strategic depth by allowing agents to temporarily decouple system dynamics when planning complex multi-step interventions. The environment maintains strict parameter consistency across different initial configurations, ensuring that learned strategies transfer effectively while difficulty scaling occurs through initial state variations rather than rule changes. Action parameter discretization creates clear decision boundaries that support effective exploration while preventing excessive combinatorial complexity that could hinder learning progress. The cumulative reward structure provides continuous learning signals while maintaining clear success criteria, enabling agents to develop nuanced strategies that optimize both intermediate progress and final objective achievement.