
# TODO
Instruction for Hugging Face
- First install awq if wants to run awq models
- Then install latest version of `transformers` to be able to run Qwen-VL
- The order matters because awq donwgrades transformers to a version where Qwen-VL is not available

# Change the log file saving to the same dir as the results. Create a dir that holds logs


# Agentic:
## - Possibily select and caption images from external database
## - Smart State selection
## - Each agent can have a propose and critique loop inside
## - Example selector
## - Memory consolidator
## - Memory retriever


## Logistics
- Documentation for `experiments_utils`
- Cleanup of Prompt Constructors
- Improve naming and pattern for results organization considering the Agent configuration (currently based on provider and model used).
    - Why: models can differ among sub-agents; this pattern dont tell if e.g. Critique was used (the config is saved within the folders though, so it's possible to know from them)
- Organize the Agent and PromptConstructor classes. Create one class for each agent, their prompt constructors and other components.
- In the ExecutorAgent, change the way that feedback is given for parsing error:
    - Move the logic to create the verbal feedback from env_utils to the Agent file
    - Use the act_parse_retry function from the Agent class

- in `prompt_constructor.py`:
    - Refactor 'instruction' structure. Instead, dir structure separating `examples`, `system_prompts`, interaction-level `prompt_template` , etc
    - Implement dynamic examples

- Refactor the captioning logic. Files: image_utils, captioner_utils, host_captioner.py
- Modularize away the data collection code from `test` loop in `run.py`
- Modularize prompt creation for HuggingFace in `prompt_constructor.py`

- Code and test for other generation algorithms, e.g., beam search? Currently: greedy decoding or temperature with p/k sampling

## Bugs on the environment
- SOM click-failure cases for when elements are not exactly on the center
    - for the `click` action, ad-hoc solution that works. See if can be a problem for `type` and other actions that requires clicking but dont change url
    - file: browser_env/actions.py -> execute_action

- shopping 309: image input sento to GPT was too zoomed in. Why?
- in the SOM marker, there are elements such as [A] which are not explained in the prompt. They seem to be 'anchor' elements from the HTML / DOM tree. Give more context to the model.

- in browser_env/actions.py , `create_id_based_action` 
    - the `type` action automatically adds the `enter` command. 
    - This helps for some test files: example when have to 'search' for an object
    - But is a problem in others (like shopping 237, where Agent has to fill the user address).
    - Possible solution: Let the agent decide when to press enter?

- in `type` actions, it is difficulty to clear the element for the agent to type
    - in `browser_env/actions.py` there is a 'clear' action, but is not used anywhere nor passed in the prompt
    - there is a commented section in the `browser_env/actions.py`, `execute_type` that clears the element first
    - possible solution: 
        - 1. allow the use of the `clear` action directly
        - 2. automatically clear before typing
        - 3. Assume the agent has to clear the field first. The misleading thing is that there is a direct `type` instruction. This is not the case for humans, who have to click first before typing. The Agent has to undestand that it must click on the field, and clear with a combination of presses (ctrl +A then delete, each will be it's own action). Then `type` (but never `click` then `type`; since it's not necessary).
        - See also notes on problems with key presses below.   
    - make tasks that needs to input 'qty' fields in `shopping` almost impossible.
        - Example: shopping 255 gpt4o state aware execution
            - (i) not clearing the field makes model add 21 items instead of 2
            - (ii) auto 'enter' adds the 21 items to the cart. If model had a plan to click to add cart after filling the field will get confused due to the messages that appear in the screen (after adding 21, cannot add anymore and shows 'quantity not available' warning)
    - make tasks to change address almost impossible.
        - Example: shopping 339 gpt4o state aware execution
            - (i) Not clearing the text input field makes this task unreasonably difficult.
            - (ii) the automatic ENTER after entering text changes the screen unexpectedly

- Cannot leave review stars when SoM input

- Problem in key press, `browser_env/actions.py`, `execute_key_press`
    - something like `press [ctrl+a]` will have no effect. Playwright requires the format: `'Control+a'`
    - Obs: needs to always be capitalized in the 1st letter, `Backspace` for example
    - Solution: create functions to translate ctrl, shift, etc to playwright equivalents


## Done:
- Fixed bug: tab info not shown in `get_observation` when text iput from Image processor ovewrites text input from TextProcessor
- Fixed bug in `click` when SOM action tag
- Define subset of representative tasks for evaluation
- Give more context in SOM previous actions hint
    - Originally, previous action is like: `click [14] where [14]`
    - Change this to give more context; example: the alt-text and type of element
    - the numbers of SOM change from page to page. 14 in one page may be a totally different thing
    -  `type [7] [red blanket] where [7]`
        - why not saying 7 was a search bar?
- Remove the LMConfig logic from the codebase
- In critique modes, compute the scores without the revisions during the navigation
- add token counting per agent


## Model / Agent implementations

### Planning
- implement the planner, with one-pass, two-pass
- implement dynamic planner

### Dynamci examples
- Model generates examples to use as in-context learning? Such as to solve the keyboard combination problem


### Critique
- add options to select just first and last state
- conditional state critique (up until now dec-12 it's all "unconditional" (conditioned only on the intent))
- Improve the logic for the different types of execution history to make it more organized (see p_critique.py)
