**Magnetic Field Pattern Recognition Environment Design**

**Background**

The Magnetic Field Pattern Recognition Environment presents agents with a cryptographic challenge disguised as a physics visualization task. In this environment, each episode displays a static magnetic field pattern rendered as a grid of interconnected lines, intersections, and empty spaces. However, beneath this scientific facade lies a deterministic encoding system where geometric primitives within the field pattern serve as a visual cipher. The magnetic field lines are not randomly distributed for aesthetic purposes, but rather follow a precise encoding scheme where specific spatial arrangements of loops, straight segments, intersections, and cusps correspond to binary information. This binary data is systematically organized into groups of four bits that represent hexadecimal characters, creating a hidden message within the apparent chaos of magnetic field visualization. The environment maintains complete consistency in its encoding rules across all episodes, ensuring that agents can develop transferable pattern recognition skills rather than memorizing individual solutions.

**Objective**

The agent's primary goal is to decode the hidden hexadecimal message embedded within the magnetic field pattern and submit the correct four-character hexadecimal string within a limited time budget of 40 steps. The challenge requires the agent to develop an understanding of the underlying encoding system that maps visual geometric patterns to binary digits, then apply this knowledge to extract the encoded information from each unique field configuration. Success demands not only pattern recognition capabilities but also strategic decision-making regarding when to explore potential character mappings versus when to commit to a final answer. The agent must balance learning the encoding rules through trial and error with efficiently utilizing the limited interaction budget, as only one complete answer submission is permitted per episode.

**State Setup**

Each episode initializes with a completely static 9×9 grid representing the magnetic field visualization, where the spatial arrangement encodes a specific four-character hexadecimal message. The grid is systematically divided into 2×2 sub-regions that are processed in raster order to generate a 36-bit binary sequence, with the first 16 bits representing the target message as four 4-bit hexadecimal characters. The environment maintains a fixed but hidden lookup table that consistently maps each possible 2×2 pattern configuration to a specific two-bit binary value throughout all episodes. The agent begins each episode with a cursor positioned at the first character slot, and all four character slots are initially empty, awaiting the agent's input. The step counter starts at zero and increments with each action taken, providing the agent with awareness of the remaining interaction budget.

**Actions**

The environment provides a discrete action space consisting of 19 possible actions that enable both character input and navigation functionality. Actions 0 through 15 correspond to submitting hexadecimal characters 0 through F respectively into the currently selected character slot, allowing the agent to input any valid hexadecimal digit. Action 16 moves the cursor to the right across the four character slots with circular wrapping, so moving right from the fourth slot returns to the first slot. Action 17 moves the cursor to the left with the same circular wrapping behavior, enabling bidirectional navigation through the character positions. Action 18 serves as the finalization command, immediately submitting the current four-character sequence as the agent's final answer and terminating the episode regardless of remaining steps. This action structure allows agents to explore different character combinations, revise their inputs by navigating between slots, and strategically decide when to commit to their decoded solution.

**State Transition Rule**

State transitions in this environment follow deterministic patterns based on the action type selected by the agent. When the agent chooses actions 0-15 to input hexadecimal characters, the corresponding character is placed in the currently selected slot, overwriting any previous character in that position, while the grid observation, cursor position, and step counter remain unchanged except for incrementing the step count. Navigation actions 16 and 17 modify only the cursor position according to the circular movement rules, leaving the grid, character inputs, and step counter unchanged aside from the step increment. Action 18 triggers immediate episode termination while preserving the current character sequence for final evaluation. Throughout all transitions, the magnetic field grid remains completely static, as the core challenge focuses on pattern interpretation rather than dynamic system control. The step counter continuously increments with each action, providing consistent feedback about the remaining interaction budget and approaching termination conditions.

**Rewards**

The environment implements an improved reward system that encourages learning through partial credit rather than demanding perfect accuracy. Agents now receive 0.25 points for each correctly positioned character in their 4-character answer, allowing for a maximum reward of 1.0 when the complete answer is correct. This partial reward structure transforms the task from an all-or-nothing challenge into a progressive learning opportunity where agents can build understanding incrementally. The reward evaluation occurs only at episode termination, maintaining the episodic nature while providing more granular feedback on performance quality. This approach enables agents to learn from partial successes and gradually improve their pattern recognition capabilities.

**Observation**

The agent receives comprehensive observational information designed to support effective learning while maintaining appropriate challenge levels. The primary observation consists of the 9×9 integer matrix representing the magnetic field pattern, where each cell contains a value of 0 for empty space, 1 for a field line segment, or 2 for line intersections where multiple segments cross. This encoding provides clear visual distinction between different geometric elements while maintaining computational efficiency. Additionally, the agent observes the current step index ranging from 0 to 39, enabling strategic planning regarding the remaining interaction budget. The observation includes the current cursor position indicating which character slot is selected for input, as well as the current state of all four character slots showing previously entered hexadecimal digits. 

Crucially, the observation now includes encoding hints that reveal some example pattern-to-bit mappings, providing agents with concrete examples of how specific 2×2 patterns correspond to binary values. These hints give agents a foundation for understanding the broader encoding scheme while still requiring them to apply this knowledge creatively. This hybrid approach balances the cryptographic challenge with sufficient guidance to make the task learnable by advanced AI systems.

**Termination**

Episodes terminate under two specific conditions that provide clear boundaries for agent interaction. The primary termination trigger occurs when the agent executes action 18 to finalize their answer, immediately ending the episode regardless of how many steps remain in the budget. This allows confident agents to submit solutions early and potentially preserve interaction budget for subsequent learning episodes. The secondary termination condition activates when the step counter reaches 40, automatically ending the episode and evaluating whatever four-character sequence the agent has constructed at that point. Both termination mechanisms trigger the same reward evaluation process, where the agent's final character sequence is compared against the true encoded message to determine success or failure. The dual termination system balances agent autonomy in decision-making with necessary computational constraints, preventing indefinitely long episodes while allowing strategic timing of answer submission.

**Special Features**

The environment incorporates several unique mechanisms that distinguish it from conventional reinforcement learning tasks while ensuring consistent learnability. The fixed encoding table represents the core special feature, maintaining identical pattern-to-bit mappings across all episodes to enable transferable learning, though this table remains hidden from agents and must be discovered through interaction. Difficulty scaling across different levels is achieved exclusively through visual complexity variations, such as adding decorative intersections or irrelevant line segments that do not affect the underlying 36-bit encoding, ensuring that fundamental rules remain consistent while challenging pattern recognition capabilities. The immutable observation system creates a unique learning paradigm where agents must develop interpretive skills rather than control strategies, shifting the focus from sequential decision-making to cryptographic analysis. The limited interaction budget of 40 steps creates strategic tension between exploration and exploitation, forcing agents to balance learning the encoding system with efficiently applying that knowledge. Rule consistency is rigorously enforced through standardized grid dimensions, action spaces, reward structures, and step limits across all levels, with stochasticity limited exclusively to cosmetic elements that do not impact the encoded information, guaranteeing that learned patterns transfer effectively between episodes and enabling systematic skill development in visual pattern recognition and cryptographic reasoning.