env_describe_full = {
    'Alien-v5':'The task is a reinforcement learning problem where an agent controls an astronaut navigating through a dangerous alien world. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. In the environment, the agent receives +50 points for defeating an alien and +100 points for clearing a level. Small rewards like +10 points are given for collecting power-ups, while penalties include -50 points for taking damage and -100 points for losing a life. The game ends when the agent loses all lives, with the goal being to maximize cumulative rewards through effective combat, exploration, and survival.',

    'Amidar-v5':'The task is a reinforcement learning problem where an agent controls a character navigating a maze to avoid enemies and complete objectives by marking sections of the maze. The action space is discrete with 10 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up and fire, 7: move right and fire, 8: move left and fire, 9: move down and fire}}. In the environment, the "fire" action has no functional effect, as the primary objective is to move through the maze. The observation space consists of raw pixel values representing the game screen, showing the character, enemies, and the maze layout. The agent receives +10 points for marking a section of the maze and +50 points for completing an entire maze level. Additionally, the agent earns +100 points for capturing an enemy while in a powered-up state, and +20 points for collecting special bonus items scattered throughout the environment. However, the agent is penalized with -50 points for being caught by an enemy, and an additional -5 points for excessive inaction or idling for too long. The game ends when the agent loses all lives or completes the entire maze. The goal is to maximize the score by navigating the maze efficiently while avoiding enemies.',

    'Hero-v5':'The task is a reinforcement learning problem where an agent controls a hero navigating through an underground cave system filled with enemies and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, showing the hero, enemies, environmental hazards, and collectible items. The reward mechanism is designed to incentivize the exploration of the cave and the collection of various items, such as treasure. The agent earns points for defeating enemies and gathering treasures scattered throughout the cave. The hero may also gain points by rescuing trapped miners. There are penalties for losing health due to enemy attacks or environmental hazards. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the cave system, defeating enemies, avoiding hazards, and collecting valuable items.',

    'Assault-v5':'The task is a reinforcement learning problem where an agent controls a spaceship that must shoot down waves of enemy ships and avoid being hit by their projectiles. The action space is discrete with 7 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move right and fire, 6: move left and fire}}. The observation space consists of raw pixel values representing the game screen, where the spaceship, enemies, and projectiles are displayed. In the environment, the agent receives rewards based on the in-game score, which increases primarily when the agent shoots and destroys enemy ships. There are no explicit negative rewards (like point deductions) for taking damage, but when the agent gets hit by enemy fire, it loses a life. Once all lives are lost, the game ends. Therefore, the agent\'s main goal is to maximize its cumulative score by shooting down enemies and surviving for as long as possible while avoiding enemy attacks.',

    'Asterix-v5':'The task is a reinforcement learning problem where an agent controls a character navigating through a colorful world filled with obstacles and enemies. The action space is discrete with 9 options: {{0: no operation, 1: move up, 2: move right, 3: move left, 4: move down, 5: move up-right, 6: move up-left, 7: move down-right, 8: move down-left}}. The observation space consists of raw pixel values representing the game screen, displaying the character, various enemies, and collectible items. The reward mechanism is designed to incentivize the collection of various items, each providing different point values. Asterix can earn 50 points for collecting a Cauldron, 100 points for a Helmet, 200 points for a Shield, and 300 points for a Lamp. On the other hand, Obelix can earn higher rewards, receiving 400 points for an Apple and 500 points for each of the following: Fish, Wild Boar Leg, Mug, and Surprise Object. The game ends when all three lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the environment, avoiding hazards, and strategically interacting with enemies and collectibles.',

    'BankHeist-v5': 'The task is a reinforcement learning problem where an agent controls a character involved in a bank heist, navigating through a dynamic environment filled with guards and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, showing the agent, guards, and loot. In this environment, the agent receives rewards for successfully stealing loot and evading or neutralizing guards. The game ends when the agent loses all lives, and the primary objective is to maximize cumulative rewards through stealthy navigation, effective shooting, and strategic interactions with the environment.',

    'BattleZone-v5': 'The task is a reinforcement learning problem where an agent controls a tank in a strategic battlefield filled with enemies and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, showing the tank, various enemy units, and environmental features. In this environment, the agent receives rewards for successfully destroying enemy tanks and avoiding damage from enemy fire. Points are awarded for each enemy destroyed, while penalties may occur if the agent is hit or loses lives. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards through effective maneuvering, strategic firing, and careful navigation across the battlefield to outsmart enemy units.',

    'Boxing-v5': 'The task is a reinforcement learning problem where an agent controls a boxer in a ring, fighting against an opponent to score points by landing punches while avoiding incoming strikes. The action space is discrete with 18 options: {{0: no operation, 1: punch, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and punch, 11: move right and punch, 12: move left and punch, 13: move down and punch, 14: move up-right and punch, 15: move up-left and punch, 16: move down-right and punch, 17: move down-left and punch}}. The observation space consists of raw pixel values representing the boxing ring, the agent, and the opponent. The agent earns points for successfully landing punches on the opponent, with more points awarded for strategic combinations and avoiding getting hit. The game ends after a set number of rounds or when one of the boxers is knocked out. The primary objective is to maximize cumulative points by skillfully managing offensive and defensive movements, landing punches, and dodging the opponent’s attacks.',

    'Breakout-v5': 'The task is a reinforcement learning problem where an agent controls a paddle at the bottom of the screen, aiming to hit a ball and break bricks at the top. The action space is discrete with 4 options: {{0: no operation, 1: fire (launch the ball), 2: move right, 3: move left}}. The observation space consists of raw pixel values representing the game screen, displaying the paddle, the ball, and the bricks. The reward mechanism is designed to incentivize the destruction of bricks, with the agent earning points each time a brick is broken. In this reward mechanism, players score points by hitting bricks of various colors with a ball. Each brick color is assigned a specific point value: red and orange bricks yield 7 points, yellow and green bricks grant 4 points, while aqua and blue bricks provide 1 point each. The game ends when the agent loses all its lives by failing to catch the ball with the paddle. The primary objective is to maximize cumulative rewards by strategically controlling the paddle to keep the ball in play and target higher-value bricks while avoiding misses.',

    'ChopperCommand-v5': 'The task is a reinforcement learning problem where an agent controls a helicopter navigating through a desert environment filled with enemy vehicles and aircraft. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the helicopter, enemy vehicles, aircraft, and fuel depots. In this reward design mechanism, players earn points by shooting down enemy aircraft: 100 points for each enemy helicopter and 200 points for each enemy jet. A bonus is awarded for destroying an entire wave of hostile aircraft, calculated by multiplying the number of remaining trucks in the convoy by the wave number (from one to ten) and then by 100. This system incentivizes players to maximize their score through both individual kills and strategic gameplay. The game ends when the agent runs out of fuel or is hit by enemy fire and loses all lives. The primary objective is to maximize cumulative rewards by skillfully navigating the environment, destroying enemies, collecting fuel, and avoiding hazards to survive as long as possible.',

    'CrazyClimber-v5': 'The task is a reinforcement learning problem where an agent controls a climber scaling the side of a tall building while avoiding various obstacles. The action space is discrete with 9 options: {{0: no operation, 1: move up, 2: move right, 3: move left, 4: move down, 5: move up-right, 6: move up-left, 7: move down-right, 8: move down-left}}. The observation space consists of raw pixel values representing the game screen, displaying the climber, the building, windows, and various obstacles such as falling objects. In the reward mechanism, players earn points in two ways: climbing points for each row of windows climbed and bonus points for reaching the top of each skyscraper. The climbing points vary by building, with 100 points per row for Building 1, 200 for Building 2, 300 for Building 3, and 400 for Building 4. Bonus points serve as a timer; they start at a maximum value when climbing a new building and decrease by 100 points every ten seconds. To retain bonus points, players must reach the top and grab the helicopter within 30 seconds, as bonus points continue to decline until the helicopter is reached. The maximum bonus points also increase with each building, ranging from 100,000 points for Building 1 to 400,000 points for Building 4. The game ends when the climber falls or loses all lives. The primary objective is to maximize cumulative rewards by skillfully navigating the vertical environment, dodging hazards, and climbing as high as possible without falling.',

    'DemonAttack-v5': 'The task is a reinforcement learning problem where an agent controls a spaceship shooting down waves of demons in a space-themed environment. The action space is discrete with 6 options: {{0: no operation, 1: fire, 2: move right, 3: move left, 4: move right and fire, 5: move left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the spaceship, enemy demons, and their projectiles. The reward mechanism is designed to incentivize the destruction of enemy demons, with the agent earning points for each demon successfully shot down. The game becomes progressively harder as more enemies appear and fire projectiles. There are no explicit negative rewards, but the agent loses a life if hit by enemy fire. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully avoiding enemy fire, destroying as many demons as possible, and surviving through increasingly difficult waves of enemies.',

    'Freeway-v5': 'The task is a reinforcement learning problem where an agent controls a character attempting to cross a busy highway filled with fast-moving cars. The action space is discrete with 3 options: {{0: no operation, 1: move up, 2: move down}}. The observation space consists of raw pixel values representing the game screen, displaying the character, various lanes of traffic, and the road. The reward mechanism is designed to incentivize the successful crossing of the highway. The agent earns points for reaching the other side of the road, with each successful crossing awarding a fixed number of points. There are no explicit negative rewards, but the agent loses time and progress when hit by a car, as it is sent back to the starting point. The game ends when a time limit is reached. The primary objective is to maximize cumulative rewards by skillfully navigating through the traffic, avoiding cars, and making as many successful crossings as possible before time runs out.',

    'Frostbite-v5': 'The task is a reinforcement learning problem where an agent controls a character navigating through an icy environment, building an igloo while avoiding various obstacles and enemies. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the character, platforms of ice, and hazards such as polar bears. The reward mechanism incentivizes the collection of ice blocks used to build the igloo. The agent earns points for each ice block collected and placed correctly to complete the igloo. Additional points can be gained by avoiding enemies and hazards. The game ends when the character loses all lives by falling into icy waters or being caught by enemies. The primary objective is to maximize cumulative rewards by skillfully navigating the icy platforms, building the igloo, avoiding hazards, and strategically interacting with enemies.',

    'Gopher-v5': 'The task is a reinforcement learning problem where an agent controls a character defending a farm from gophers attempting to steal crops. The action space is discrete with 8 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move up and fire, 6: move right and fire, 7: move left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the character, gophers, and crops. The reward mechanism incentivizes the agent to shoot gophers and protect the crops. Players earn 100 points for each Gopher they successfully bonk, which encourages active participation and skillful gameplay. Additionally, for every section of tunnel that players fill, they receive 20 points. There are no explicit negative rewards, but the game ends when a certain number of crops are stolen by gophers, or the player loses all lives. The primary objective is to maximize cumulative rewards by skillfully moving and shooting gophers, preventing them from stealing crops, and surviving through increasingly challenging waves of gophers.',

    'Hero-v5': 'The task is a reinforcement learning problem where an agent controls a hero navigating through an underground cave system filled with enemies and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, showing the hero, enemies, environmental hazards, and collectible items. The reward mechanism is designed to incentivize the exploration of the cave and the collection of various items, such as treasure. The agent earns points for defeating enemies and gathering treasures scattered throughout the cave. The hero may also gain points by rescuing trapped miners. There are penalties for losing health due to enemy attacks or environmental hazards. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the cave system, defeating enemies, avoiding hazards, and collecting valuable items.',

    'Jamesbond-v5': 'The task is a reinforcement learning problem where an agent controls James Bond navigating through various action-packed levels filled with enemies and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying James Bond, various enemies, vehicles, and obstacles. In this reward system, players earn points by collecting various targets. For the reward system, each target has the following point value: a Diamond is worth 50 points, while the Frogman, Space Shuttle, and Submarine each provide 200 points. The Poison Bomb and Torpedo are worth 100 points each. The Spinning Satellite offers the highest reward at 500 points, while the Rapid Rocket and Fire Bomb also contribute 100 points each. Completing the mission yields a substantial bonus of 5,000 points. This design encourages players to explore actively and prioritize collecting high-value targets to maximize their cumulative score. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the levels, shooting enemies, and strategically completing missions while avoiding hazards and enemy attacks.',

    'Kangaroo-v5': 'The task is a reinforcement learning problem where an agent controls a kangaroo navigating through a colorful environment filled with obstacles, enemies, and platforms. The action space is discrete with 18 options: {{0: no operation, 1: fire (punch), 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the kangaroo, enemies like monkeys, platforms, and collectible items such as fruit. The reward mechanism is designed to incentivize jumping between platforms, collecting items, and avoiding enemies. The agent earns points for successfully jumping to higher platforms, collecting fruit, and defeating enemies by punching them. Penalties include losing points or lives when the kangaroo is hit by enemies or falls off platforms. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating through the environment, avoiding enemies, collecting items, and jumping to safety.',

    'Krull-v5': 'The task is a reinforcement learning problem where an agent controls a character navigating through a vibrant fantasy world filled with enemies, moving platforms, and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the character, various enemies, laser barriers, and collectible items such as gems and keys. The reward mechanism is designed to incentivize progressing through different rooms by collecting keys to unlock doors and defeating enemies with laser shots. The agent earns points for defeating enemies, collecting gems, and clearing levels. The game becomes progressively more difficult with more enemies and complex rooms to navigate. The game ends when all lives are lost or when the player completes all levels. The primary objective is to maximize cumulative rewards by skillfully navigating the environment, defeating enemies, avoiding hazards, and collecting items to progress through the world.',

    'KungFuMaster-v5': 'The task is a reinforcement learning problem where an agent controls a martial artist navigating through various floors of a building, defeating enemies and avoiding obstacles. The action space is discrete with 14 options: {{0: no operation, 1: move up, 2: move right, 3: move left, 4: move down, 5: move down-right, 6: move down-left, 7: right and attack, 8: left and attack, 9: down and attack, 10: up-right and attack, 11: up-left and attack, 12: down-right and attack, 13: down-left and attack}}. The observation space consists of raw pixel values representing the game screen, displaying the martial artist, various enemies, and obstacles. The reward mechanism is designed to incentivize defeating enemies and progressing through the levels. The agent earns points for successfully attacking and defeating enemies, while avoiding obstacles and enemy attacks. Higher rewards are earned for defeating stronger enemies and completing each level. The game becomes progressively harder as the agent encounters more enemies and traps. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the environment, defeating enemies, and avoiding hazards to clear as many floors as possible.',

    'MsPacman-v5': 'The task is a reinforcement learning problem where an agent controls Ms. Pacman navigating through a maze filled with pellets, power-ups, and enemy ghosts. The action space is discrete with 9 options: {{0: no operation, 1: move up, 2: move right, 3: move left, 4: move down, 5: move up-right, 6: move up-left, 7: move down-right, 8: move down-left}}. The observation space consists of raw pixel values representing the game screen, displaying Ms. Pacman, pellets, power pellets, and ghosts moving around the maze. The reward mechanism is designed to incentivize the collection of pellets and the strategic use of power-ups. Ms. Pacman earns points for each pellet collected and additional points for eating ghosts after consuming a power pellet. However, if she gets caught by a ghost without the power-up, a life is lost. The game ends when all lives are lost or when all pellets in the maze are collected. The primary objective is to maximize cumulative rewards by skillfully navigating the maze, avoiding or chasing ghosts when appropriate, and collecting as many pellets and power-ups as possible.',

    'PrivateEye-v5': 'The task is a reinforcement learning problem where an agent controls a private detective navigating through various scenes in search of clues, solving cases, and capturing criminals. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the detective, various criminals, vehicles, and interactive elements in the environment. The reward mechanism is designed to incentivize the discovery of clues and solving mysteries. The agent earns points for capturing criminals, collecting clues, and advancing through different levels by solving cases. Penalties may occur if the detective misses important clues or gets caught by enemies. The game ends when the detective loses all lives or fails to solve the case in time. The primary objective is to maximize cumulative rewards by skillfully navigating through scenes, finding clues, and capturing criminals while avoiding dangers.',

    'Pong-v5':'The task is a reinforcement learning problem where an agent controls a paddle to hit a ball and score points by getting the ball past the opponent \'s paddle. The action space is discrete with 6 options: {{0: no operation, 1: fire, 2: move the paddle up, 3: move the paddle down, 4: right fire, 5: left fire}}. In the environment, the "fire" action has no functional effect, as we can only move the paddle up and down. The observation space consists of raw pixel values representing the game screen. The agent receives a reward of +1 for scoring and -1 when the opponent scores. The game ends when either side reaches 21 points.',

    'Qbert-v5': 'The task is a reinforcement learning problem where an agent controls Qbert, a character navigating through a pyramid of cubes while avoiding enemies and hazards. The action space is discrete with 6 options: {{0: no operation, 1: fire (jump), 2: move up, 3: move right, 4: move left, 5: move down}}. The observation space consists of raw pixel values representing the game screen, displaying Qbert, enemies, and the pyramid of cubes that Q*bert must jump on to change their color. The reward mechanism is designed to incentivize jumping on cubes and avoiding enemies. Qbert earns points for each successful jump that changes the color of a cube, and additional points for completing a level by changing all cubes to the desired color. Penalties occur if Qbert is hit by enemies or falls off the pyramid, resulting in a lost life. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the pyramid, changing the colors of cubes, avoiding enemies, and completing levels efficiently.',

    'RoadRunner-v5': 'The task is a reinforcement learning problem where an agent controls the Road Runner navigating through a desert environment filled with obstacles and enemies, including the persistent Wile E. Coyote. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the Road Runner, Wile E. Coyote, various obstacles like rocks and traps, and collectible items. The reward mechanism is designed to incentivize speed and avoidance of obstacles. The agent earns points by collecting bird seeds scattered throughout the environment and for progressing through the level while avoiding capture by Wile E. Coyote. Penalties occur if the Road Runner is caught by Wile E. Coyote or falls into traps, leading to a loss of life. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating through the environment, collecting bird seeds, and avoiding hazards and Wile E. Coyote.',

    'Seaquest-v5': 'The task is a reinforcement learning problem where an agent controls a submarine navigating through an underwater world filled with enemy submarines, divers, and obstacles. The action space is discrete with 18 options: {{0: no operation, 1: fire, 2: move up, 3: move right, 4: move left, 5: move down, 6: move up-right, 7: move up-left, 8: move down-right, 9: move down-left, 10: move up and fire, 11: move right and fire, 12: move left and fire, 13: move down and fire, 14: move up-right and fire, 15: move up-left and fire, 16: move down-right and fire, 17: move down-left and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the submarine, enemies, friendly divers, and the underwater environment. The reward mechanism is designed to incentivize the destruction of enemy submarines and the rescue of divers. The agent earns points for shooting enemy submarines and other hostile underwater threats, as well as for rescuing divers and bringing them safely to the surface. Penalties occur if the submarine is hit by enemy fire or runs out of oxygen, which results in a loss of life. The game ends when all lives are lost. The primary objective is to maximize cumulative rewards by skillfully navigating the underwater environment, avoiding enemies, rescuing divers, and managing oxygen levels effectively.',

    'UpNDown-v5': 'The task is a reinforcement learning problem where an agent controls a car navigating through a colorful, fast-paced world filled with other vehicles and obstacles on winding roads. The action space is discrete with 6 options: {{0: no operation, 1: fire, 2: move up, 3: move down, 4: move up and fire, 5: move down and fire}}. The observation space consists of raw pixel values representing the game screen, displaying the agent\'s car, other vehicles, and road obstacles. The reward mechanism is designed to incentivize avoiding collisions and overtaking other vehicles. The agent earns points for passing other cars on the road and avoiding crashes. Higher rewards are earned by overtaking more cars and successfully navigating tricky sections of the road. The game ends when the agent collides with another car or falls off the road, resulting in a loss of life. The primary objective is to maximize cumulative rewards by skillfully maneuvering the car, avoiding collisions, overtaking as many vehicles as possible, and progressing through the levels without losing lives.'
}

env_describe_name = {
    'Alien-v5':'The task is the Alien-v5 game in Atari environments.',

    'Amidar-v5':'The task is the Amidar-v5 game in Atari environments.',

    'ChopperCommand-v5':'The task is the ChopperCommand-v5 game in Atari environments.',

    'Hero-v5':'The task is the Hero-v5 game in Atari environments.',

    'Pong-v5':'The task is the Pong-v5 game in Atari environments.',

    'Freeway-v5':'The task is the FreeWay-v5 game in Atari environments.',

    'MsPacman-v5':'The task is the MsPacman-v5 game in Atari environments.',
}