# Terraforming Project Management Environment Design Document

## Background

The Terraforming Project Management Environment simulates an interstellar colonization scenario where the agent serves as the chief manager of a planetary transformation program. Set in a distant future where humanity has discovered alien technology capable of manipulating planetary systems, the environment presents the challenge of converting barren, hostile worlds into habitable ecosystems. Each planet represents a complex system with interconnected atmospheric, hydrological, geological, and biological components that respond to terraforming interventions through sophisticated feedback mechanisms. The alien technology operates through non-linear environmental interactions, meaning that simple cause-and-effect relationships are replaced by complex cascading effects that require strategic thinking and careful resource management. The environment maintains scientific plausibility while abstracting complex planetary science into learnable patterns that remain consistent across different worlds.

## Objective

The primary objective is to successfully raise a planet's Habitability Index from its initial value of zero percent to one hundred percent within a constrained timeframe of forty steps, while simultaneously preventing catastrophic system collapse. Success requires balancing multiple competing factors: atmospheric composition must be carefully adjusted without triggering temperature extremes, water systems need activation without overwhelming existing infrastructure, biological seeding requires proper environmental preparation, and geological stability must be maintained throughout the transformation process. The agent must achieve complete habitability while keeping the planet's Instability Index below the critical threshold of one hundred percent, as exceeding this limit results in irreversible planetary collapse and mission failure.

## State Setup

The environment initializes each episode with a structured state representation encompassing five major planetary systems plus global metrics. The atmospheric system tracks oxygen percentage, carbon dioxide percentage, atmospheric pressure, and surface temperature, with initial values typically representing hostile conditions such as toxic gas compositions and extreme temperatures. The hydrosphere monitors surface water coverage percentage, subsurface ice reserves percentage, and pH levels, usually starting with minimal water availability and acidic conditions. The lithosphere tracks soil fertility levels and tectonic stress indicators, often beginning with sterile soil and unstable geological activity. The biosphere seed system monitors dormant microbe mass and dormant flora mass, representing the potential for life that requires activation through proper environmental conditions. Infrastructure status includes the number and upgrade levels of terraforming stations ranging from zero to three, along with current energy reserves that power all terraforming operations. Global metrics encompass the crucial Habitability Index starting at zero percent, the Instability Index beginning at a low but non-zero value, and a step counter tracking episode progress. Each planet uses different initial seed values within predefined ranges, ensuring variety while maintaining consistent underlying dynamics and difficulty levels across all scenarios.

## Actions

The agent selects from seven distinct actions each step, with each action consuming exactly one time unit and potentially affecting multiple state variables through complex feedback mechanisms. Deploy Atmospheric Processor actively modifies atmospheric composition by adjusting oxygen and carbon dioxide levels while influencing temperature and pressure systems. Release Water Catalysts activates hydrological processes by converting subsurface ice to surface water, adjusting pH levels, and potentially triggering atmospheric humidity changes. Seed Microbial Life introduces biological processes by activating dormant microbe populations, which subsequently influence atmospheric composition, soil fertility, and overall ecosystem stability. Stabilize Tectonics directly addresses geological instability by reducing tectonic stress levels, which indirectly affects infrastructure integrity and long-term planetary stability. Construct or Upgrade Terraforming Station builds new stations or enhances existing ones, improving the effectiveness of future actions while requiring significant energy investment and potentially increasing short-term instability during construction. Divert Energy to Shields provides emergency stabilization by reducing the Instability Index at the cost of energy reserves, serving as a crucial safety mechanism when other actions create excessive system stress. Passive Observation allows the agent to skip active intervention, letting current processes continue their natural progression while conserving energy and observing system evolution, which can be valuable for understanding delayed effects or waiting for optimal intervention timing.

## State Transition Rule

State transitions follow deterministic non-linear dynamics where each action triggers immediate effects and delayed cascading consequences across interconnected planetary systems. Atmospheric interventions create complex feedback loops where reducing carbon dioxide levels rapidly decreases atmospheric temperature, potentially freezing surface water and reducing habitability despite improved air quality. Hydrological changes affect atmospheric humidity and soil conditions, with water catalyst releases increasing surface water availability while potentially destabilizing local geology through increased erosion and pressure changes. Biological seeding requires compatible environmental conditions to succeed, with microbes and flora responding positively to appropriate oxygen levels, water availability, and soil fertility, while contributing to atmospheric regulation and further soil development once established. Geological stabilization efforts immediately reduce tectonic stress but require sustained energy input to maintain effectiveness, with insufficient ongoing support leading to gradual stress accumulation. Infrastructure construction and upgrades provide permanent benefits to action effectiveness while creating temporary instability spikes during implementation phases. Energy diversion to shields provides immediate instability reduction with linear effectiveness, consuming energy reserves without contributing to long-term habitability improvements. The Instability Index increases from excessive action frequency, insufficient energy reserves, incompatible environmental conditions during interventions, and cascading failures between systems, while decreasing through shield activation, balanced system development, and maintaining adequate energy reserves. Time delays between actions and their full consequences mean that aggressive early-game strategies often create mid-game crisis situations, requiring agents to develop patience and strategic planning capabilities.

## Rewards

The environment employs a cumulative reward structure where multiple reward components stack each step, providing dense feedback signals that guide learning while maintaining bounded total episode rewards. Stability maintenance rewards provide +0.1 points per step for keeping the Instability Index below fifty percent, encouraging conservative resource management and sustainable development practices. Atmospheric development rewards grant +0.05 points per percentage-point permanent increase in oxygen levels between fifteen and twenty-five percent, promoting balanced atmospheric composition without over-oxygenation. Hydrological development provides +0.1 points per percentage-point permanent increase in surface water between thirty and seventy percent, encouraging proper water system activation while avoiding flooding scenarios. Habitability progress rewards offer +0.2 points per percentage-point permanent increase in the Habitability Index, directly incentivizing the primary objective while rewarding incremental progress. Mission completion provides a substantial +20 point bonus upon reaching one hundred percent habitability, strongly reinforcing successful mission outcomes. Instability penalties impose -0.2 points per percentage-point of Instability Index above seventy percent, creating increasingly severe consequences for reckless actions that threaten planetary stability. Catastrophic failure results in an immediate -40 point penalty when the Instability Index reaches one hundred percent, representing irreversible planetary collapse and total mission failure. This reward structure ensures that every action produces immediate, interpretable feedback while maintaining episode reward bounds approximately between negative sixty and positive sixty points, enabling stable learning dynamics across different reinforcement learning algorithms.

## Observation

The agent receives complete state information presented as a structured vector containing all planetary system variables, infrastructure status, and global metrics, ensuring full observability while maintaining strategic complexity through system interactions rather than hidden information. Atmospheric observations include precise numerical values for oxygen percentage, carbon dioxide percentage, atmospheric pressure, and surface temperature, providing clear feedback on atmospheric intervention effectiveness and enabling agents to identify optimal composition ranges for habitability. Hydrosphere data presents surface water percentage, subsurface ice reserves, and pH levels, allowing agents to track water system development and understand the relationship between water availability and overall habitability progress. Lithosphere information displays soil fertility and tectonic stress levels, enabling agents to monitor geological stability and understand how infrastructure development and biological seeding depend on proper geological conditions. Biosphere seed data shows dormant microbe and flora mass levels, helping agents time biological interventions appropriately and understand how environmental preparation affects biological success rates. Infrastructure observations detail the number and upgrade levels of terraforming stations plus current energy reserves, providing crucial information for resource management and action planning. Global metrics present the current Habitability Index, Instability Index, and remaining steps, offering clear progress tracking and risk assessment information. The observation design provides sufficient granularity for agents to identify patterns in system responses, understand delayed consequences of previous actions, and develop strategic approaches to balancing competing objectives, while the complete state visibility ensures that learning difficulties stem from strategic complexity rather than information limitations.

## Termination

Episodes terminate under three distinct conditions that provide clear success and failure signals while maintaining consistent episode length expectations. Successful termination occurs immediately when the Habitability Index reaches or exceeds one hundred percent, representing complete mission success and triggering final reward calculation including the substantial completion bonus. Catastrophic termination happens instantly when the Instability Index reaches one hundred percent, indicating irreversible planetary collapse that ends the episode with maximum penalty and failure status. Timeout termination activates when the step counter reaches the forty-step limit without achieving either success or failure conditions, forcing episode conclusion and final evaluation based on accumulated rewards and final state metrics. Upon any termination condition, the environment returns the final cumulative reward, sets the terminal flag, and prepares for episode reset with a new planetary scenario. This termination structure ensures that episodes maintain bounded length for practical training purposes while providing clear learning signals about success, failure, and progress toward objectives.

## Special Features

The environment incorporates several unique mechanics that distinguish it from standard control problems while maintaining learnability and strategic depth. Non-linear feedback systems create complex cause-and-effect relationships where simple actions produce cascading consequences across multiple planetary systems, requiring agents to develop sophisticated understanding of system interactions rather than relying on direct action-outcome mappings. Coupled subsystem dynamics ensure that atmospheric, hydrological, geological, and biological components influence each other through realistic physical relationships, creating emergent complexity where optimal strategies must consider holistic planetary development rather than optimizing individual metrics in isolation. Energy budget constraints force strategic resource allocation decisions where infrastructure development, active interventions, and emergency stabilization compete for limited energy reserves, adding economic planning elements to environmental management challenges. The dual-index system creates constant tension between habitability progress and stability maintenance, requiring agents to balance aggressive development against conservative risk management throughout the terraforming process. Deterministic dynamics with seed-based variation ensure that identical action sequences produce identical outcomes for the same initial conditions while providing sufficient scenario diversity across the ten fixed planetary levels for robust policy development. Rule consistency across all levels guarantees that learned terraforming strategies transfer directly between planets, enabling agents to develop generalizable expertise rather than memorizing planet-specific solutions. The forty-step time horizon provides sufficient opportunity for complex multi-phase strategies while maintaining tractable credit assignment for reinforcement learning algorithms, supporting both reactive tactical decisions and long-term strategic planning within the same framework.