ToolFuzz - Correctness Failure Report

Raw results file.

{% for tool, tool_results in test_results.items() %} {% set tool_loop = loop %}

{% for result in tool_results %} {% set result_loop = loop %} {% set highlight = 'text-danger' if not result.tool_arguments_inconsistency and not result.tool_output_inconsistency and result.individual_run_test_results | selectattr('unexpected_agent_output', 'lt', 5) | list | length > 0 else '' %}

LLM Output Expectation:

{{ result.llm_output_expectation }}

Individual Test Runs:

Tool output, Agent output and Agent trace are truncated for brevity. The full contents are available at: Raw results

{% for test in result.individual_run_test_results %} {% set test_loop = loop %} {% set highlight = 'text-danger' if not result.tool_arguments_inconsistency and not result.tool_output_inconsistency and test.unexpected_agent_output < 5 else '' %}

  • {% if test.input_bucket == -1 %} {% else %} {{ test.input_bucket }} {% endif %} Input arguments for tool invocation: {{ test.tool_arguments }}
  • {% if test.output_bucket == -1 %} {% else %} {{ test.output_bucket }} {% endif %} Tool Output:
    {{ test.tool_output | safe }}
  • Agent Output:
    {{ test.agent_output | safe }}
  • Observed Runtime Tool Failure: {{ test.tool_failure }}
  • LLM Oracle relevancy score: {{ test.llm_agent_out_reason }}. {% set llm_eval = 'text-danger' if test.unexpected_agent_output < 5 else '' %} Score: {{ test.unexpected_agent_output }}
  • Trace:
    {{ test.trace | safe }}
{% endfor %}
{% endfor %}
{% endfor %}