# TALES - Text Adventure Learning Environment Suite

TALES is a benchmark, which consists of a diverse collection of synthetic and human-written text-adventure games designed to evaluate reasoning capabilities of Large Language Model (LLM)-based agents.

### WHAT CAN TALES DO

TALES was developed to evaluate LLM-based agents’ capabilities to solve text-adventure games. Text-adventure games are goal-oriented environments where an agent is required to interact with a game engine in multi-step setting to understand the goal, explore the game world, find clues, and plan itself towards solving the game. We curated the set of games in TALES in a way to cover a diverse spectrum of reasoning skills an LLM-based agent may need in solving real-world tasks, such as inductive reasoning, deductive reasoning, spatial reasoning, and grounded reasoning. We believe while being much more cost-efficient compared to realistic tasks, testing LLM-based agents’ performance on TALES can provide useful insights in evaluating the agents from different aspects, including LLM backbones, agent architecture design, and prompt engineering. These insights can further guide practitioners in developing their agents in use cases beyond text-adventure games.

A detailed discussion of TALES, including how it was developed and tested, can be found in our paper at: "Anonymized for Review"


### INTENDED USES

TALES is best suited for Evaluating AI agents’ capability of solving text-adventure games.

TALES is being shared with the research community to facilitate reproduction of our results and foster further research in this area.

TALES is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.

### OUT-OF-SCOPE USES

TALES is designed exclusively for evaluation; it is not well suited for training AI agents.

We develop TALES for research purposes only, the benchmark does not cover all necessary criteria for real world decision making. We do not recommend using TALES in any way to make real world decisions.

### LIMITATIONS 

TALES was developed for research and experimental purposes. The games in the benchmark are exclusively selected to test LLM-based agents’ inductive reasoning, deductive reasoning, spatial reasoning, and grounded reasoning capabilities. We acknowledge that in real-world scenarios, decision making process may require additional context, more complex reasoning, as well as the combination of multiple reasoning types. We do not claim that our research findings can be directly transferred into real-world decision making. Further testing and validation are needed before considering its application in commercial or real-world scenarios.


TALES was designed and tested using the English language. Performance in other languages may vary and should be assessed by someone who is both an expert in the expected outputs and a native speaker of that language. 

Outputs generated by AI may include factual errors, fabrication, or speculation. Users are responsible for assessing the accuracy of generated content. All decisions leveraging outputs of the system should be made with human oversight and not be based solely on system outputs.

### BEST PRACTICES 

We strongly encourage users to use LLMs/MLLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications:

"Anonymized for Review"

"Anonymized for Review"

"Anonymized for Review"

[OpenAI’s Usage policies](https://openai.com/policies/usage-policies)

"Anonymized for Review"

TALES contains a set of text adventure games specifically curated to fulfill our research on LLM-based agents’ capability of performing certain types of reasoning. We refer practitioners to our paper "Anonymized for Review" for detailed guidance on how to properly use this benchmark and how to correctly interpret an LLM-based agent’s results on this benchmark. Additionally, we recommend practitioners to use TALES in concert with other benchmarks to understand LLM-based agents’ performance and capabilities from multiple perspective and thus gain a less biased view.

### LICENSE

We use the MIT license, please see the [license file](https://github.com/"Anonymized for Review"/blob/main/LICENSE).

### CONTACT

We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us via [GitHub issues](https://github.com"Anonymized for Review"/issues) or at "Anonymized for Review"
