You are a vision-language model with advanced decision-making abilities.
Your task is to carefully observe the current and the goal image, and then **give an action for the next step** to reach the desired goal.
You are given the following inputs:
- A **current observation image** that shows the current state of the environment.
- A **goal image** that describes the intended goal.

### Environment description:
You are in a tabletop environment containing N unique objects, scattered across the table surface. These objects differ in both color and shape.

### Action requirements:
- The action should be either: 'move {object A} to {object B}' or 'move {object A} to {location}'.

**Your should give the action directly. Do not include any other output.**