# Bioluminescent Signal Decoding Environment Design Document

## Background

The environment simulates a deep-sea research scenario where an agent controls a sophisticated submersible equipped with programmable bioluminescent arrays. The submersible operates in the abyssal depths where unique organisms have evolved complex communication systems using coordinated light patterns. These creatures possess intelligence and follow consistent but alien communication protocols that differ fundamentally from human linguistic structures. The agent's mission involves deciphering these non-human languages through direct interaction, establishing meaningful communication bridges between species separated by millions of years of evolution. Each encounter represents a critical opportunity to unlock the secrets of deep-sea intelligence, requiring patience, pattern recognition, and adaptive communication strategies.

## Objective

The agent must successfully complete three consecutive handshake sequences with a single deep-sea organism within one communication session. Each handshake consists of receiving an incoming bioluminescent pattern from the creature, analyzing its structure and meaning, then responding with an appropriately formatted reply pattern that demonstrates understanding of the underlying communication protocol. The creature evaluates each response according to its internal rule system, either accepting the communication attempt and proceeding to the next exchange, or rejecting it entirely and terminating the session. Success requires not only decoding the initial pattern but maintaining consistent application of the discovered protocol across all three exchanges, as any single failure immediately ends the episode.

## State Setup

The environment initializes each episode by randomly selecting one of four distinct protocol families that govern creature communication behavior. These families include color-mirroring systems where responses must reflect specific color relationships, duration-inversion protocols requiring temporal transformations, intensity-parity rules based on mathematical relationships between brightness levels, and sequence-based systems following patterns like Fibonacci progressions. Each selected protocol family receives two randomly generated parameters that fine-tune its specific implementation, such as color offset values, modulo operations, or sequence starting positions. The creature then generates the first incoming pattern according to these hidden rules, creating a sequence of 2-6 light pulses where each pulse contains three distinct attributes: color selected from four possible values, duration categorized as either short or long, and intensity classified as low or high. The environment maintains a historical log capable of storing the last five complete exchanges, an energy counter starting at full capacity, and a step countdown beginning at 40.

## Actions

The agent constructs response patterns by making a series of discrete choices that define each aspect of the transmitted signal. Initially, the agent selects the total sequence length, choosing between 2 and 6 pulses for the complete response pattern. For each individual pulse within this sequence, the agent specifies three attributes independently. Color selection offers four distinct options that correspond to the creature's available palette, duration specifies either short or long pulse timing to match the temporal characteristics of the communication system, and intensity determines low or high brightness levels appropriate for deep-sea visual communication. This combinatorial action space provides sufficient flexibility for the agent to construct responses matching any of the possible protocol requirements while maintaining discrete, manageable decision points that support systematic exploration and learning.

## State Transition Rule

When the agent submits a response pattern, the environment processes this input through the active protocol family using the episode's fixed parameters. The evaluation system applies the hidden rules deterministically, comparing the agent's response against the mathematically correct pattern that the protocol demands for the given incoming sequence. If the response satisfies all protocol requirements, the creature accepts the communication attempt, incrementing the successful handshake counter and generating a new incoming pattern for the next exchange. The historical log updates to include the complete interaction tuple, remaining steps decrease by one, and energy levels reduce accordingly. If the response fails to meet protocol requirements, the creature immediately rejects the attempt, triggering episode termination without generating additional patterns. Throughout this process, the underlying protocol rules remain completely stable, ensuring that successful pattern recognition translates directly into continued communication success.

## Rewards

This environment employs a binary reward structure with immediate but sparse feedback mechanisms. Each successful handshake where the creature accepts the agent's response pattern generates an immediate reward of 1, providing clear positive reinforcement for correct protocol application. However, failed communications result in immediate episode termination with a reward of 0, creating high stakes for each decision and emphasizing the importance of confident pattern recognition before response generation. The reward system contains no cumulative elements beyond the individual handshake successes, meaning agents must focus on achieving consistent performance rather than optimizing incremental improvements. This binary approach reflects the realistic nature of interspecies communication where partial understanding often proves insufficient for meaningful exchange, while successful communication creates clear breakthrough moments that advance the overall research mission.

## Observation

The agent receives comprehensive state information designed to support pattern recognition and strategic learning while maintaining appropriate challenge levels. Current incoming patterns display all pulse attributes clearly, showing color, duration, and intensity values in a structured format that enables systematic analysis. The historical log provides crucial context by maintaining complete records of previous exchanges, including incoming patterns, agent responses, and creature feedback decisions, allowing agents to identify consistent rule applications across multiple interactions. Step counters and energy levels offer session management information, helping agents balance exploration needs against time constraints. Importantly, the observation space excludes any direct information about the active protocol family or its parameters, requiring agents to deduce these hidden rules through careful analysis of pattern relationships and feedback responses. This information balance ensures that learning remains challenging while providing sufficient observational data for pattern recognition algorithms to identify meaningful relationships and develop effective communication strategies.

## Termination

Episodes conclude under three distinct conditions that reflect different aspects of the communication challenge. Successful termination occurs when the agent completes all three required handshakes, demonstrating mastery of the creature's communication protocol and achieving the primary research objective. Failure termination triggers immediately upon any rejected response, reflecting the realistic consequence that misunderstood communications often end interaction opportunities with cautious deep-sea organisms. Time-based termination activates when the 40-step limit expires, representing the practical constraints of submersible operation time and creature attention spans. The step limit ensures that episodes maintain reasonable duration while providing sufficient opportunity for pattern recognition and multiple communication attempts, balancing learning efficiency with exploration needs.

## Special Features

The environment incorporates several unique mechanisms that enhance learning potential while maintaining consistent challenge levels. Protocol family stability represents a core feature where selected rules remain completely fixed throughout each episode, ensuring that successful pattern recognition immediately translates into reliable communication strategies rather than requiring constant readaptation to changing rules. Parameter randomization occurs at episode initialization, drawing from consistent distributions that maintain uniform difficulty across all sessions while preventing agents from memorizing specific solutions rather than learning general pattern recognition capabilities. The deterministic evaluation system guarantees that identical response patterns always receive identical feedback for the same incoming sequence, enabling agents to build reliable causal models of communication success. Historical context preservation supports advanced learning strategies by maintaining detailed interaction records that enable pattern analysis across multiple exchanges. Energy and step tracking add atmospheric elements that enhance the research simulation experience without creating additional mechanical complexity that might interfere with core learning objectives.