# PPT-Eval 

A benchmark to evaluate Computer-Use Agents on tasks on slide decks, using Microsoft PowerPoint Online.

Note: PowerPoint Online was chosen as the slide deck editing software of choice as it has the closest fidelity to all the features in PowerPoint that aren't in open source implementations, and is still available as a free offering with a consumer OneDrive account. 

## Installation
This project has been tested on:
- Windows with WSL 2 and python 3.12. (No COM support)
- Windows with python 3.11+

## Framework
This project makes use of HuggingFace ScreenEnv to set up a virtualized sandbox where agents can carry out various tasks. The orchestrator uploads the data files to your OneDrive account and launches PowerPoint from within the OneDrive account on a browser in the sandbox.

## Setup
This section details the setup process. Please note that you will need the following:
1. A free tier OneDrive account
2. A free tier Microsoft Entra account to set up automation access to OneDrive
3. Ability to run Docker containers on your machine. 

### Step 1:  Python Environment
It is recommended to create a conda or similar virtual environment for the project.

From the root:
Install officearena package:

```sh
pip install -e .
```

Install the AI Rubric package

```sh
cd ./rubric/
pip install -e .
```

### Step 2:

- Install and **ensure your docker daemon is running**: (https://docs.docker.com/)

## Logging into a Microsoft Account
To make use of Office Applications, a valid personal Microsoft account is required, for access to OneDrive for consumers. 

Useful tip: Create a free account that is separate from your personal account so your OneDrive does not get cluttered with files from the benchmark.

### Providing access to the OneDrive account:
First create a free Entra account:
#### Register an app in Azure Portal
1. Go to the Microsoft Entra Admin Center (https://entra.microsoft.com/#home) and login with a **personal** Microsoft account
2. Navigate to App registrations > New registration
3. Provide a name for your application, (APPLICATION_NAME)
4. Choose "Accounts in any organizational directory and personal Microsoft accounts"
5. Click Register.
6. Copy the Application (Client) Id to the environment file, and save it as CLIENT_ID. This application will be used to access your OneDrive
7. Click on Authentication in the sidebar and allow toggle public client flows to Yes

### Creating your .env file
In the root folder, create a .env file with the following keys:
```sh
CLIENT_ID=USE THE CLIENT_ID from previous step 
RUBRIC_DEFAULT_LLM= POPULATE WITH APPROPRIATE VALUE
AZURE_API_BASE= POPULATE WITH APPROPRIATE VALUE
AZURE_OPENAI_ENDPOINT= POPULATE WITH APPROPRIATE VALUE
AZURE_OPENAI_API_VERSION= POPULATE WITH APPROPRIATE VALUE
AZURE_API_VERSION= POPULATE WITH APPROPRIATE VALUE
AZURE_OPENAI_DEPLOYMENT_NAME= POPULATE WITH APPROPRIATE VALUE
OPENAI_API_KEY= POPULATE WITH APPROPRIATE VALUES
OPENAI_API_VERSION= POPULATE WITH APPROPRIATE VALUES
UITARS_ENDPOINT_URL= POPULATE WITH APPROPRIATE VALUE
UITARS_TOKEN= POPULATE WITH APPROPRIATE VALUES
CUA_API_KEY= POPULATE WITH APPROPRIATE VALUES
CUA_ENDPOINT= POPULATE WITH APPROPRIATE VALUE
CUA_BASE_URL= POPULATE WITH APPROPRIATE VALUES
CUA_MODEL_NAME= POPULATE WITH APPROPRIATE VALUE
ANTHROPIC_API_KEY= POPULATE WITH APPROPRIATE VALUES
ANTHROPIC_BASE_URL= POPULATE WITH APPROPRIATE VALUE
```

Note 1: This project used "anthropic/claude-sonnet-4-20250514" for the default evaluation LLM
Note 2: The codebase has support to access claude, azure openai and openai directly. 
Note 3: UITARS-7B was deployed on cloud compute as a REST endpoint that uses a key. Your endpoint implementation may differ, and might require changes to the adapter if your endpoint takes in a different input.


### Install PlayWright Browser (optional)
Playwright installed on your host machine lets you run verification by letting you download the slides as images without needing PowerPoint installed locally.

Note: the project also allows you to use COM (Microsoft PowerPoint desktop application), LibreOffice + Poppler/GhostScript instead of this online option, but open source tools may not render the files exactly as PowerPoint intended.

`playwright install`

### Download the PowerPoint sample files from Internet Archive
Then run `python download_data_files.py` to download the files from Internet Archive.

**Source Attribution:**
The files that are provided from Internet Archive are licensed for use with source attribution. You can find the same in ATTRIBUTION.md

**!! Important !!**
After you download the files, upload them to your OneDrive, open them in PowerPoint, and then download them to your local machine.
If you fail to do this, the evaluation code detects extraneious changes in the file due to the way PowerPoint Online modifies the file (changes order of certain elements in the underlying XML), and your tasks would score lower than expected.

After uploading each file to PowerPoint download both:
1. <file.pptx>: The file itself ( File > Create a copy > Download a copy )
2. <file.zip>: Images of the slides ( File > Export > Export as images )


**Save both <file>.pptx and <file>.zip into data/files/PowerPoint**

Having the zip file of slide images vastly speeds up the process **of running the benchmark. While images can be retrieved locally, this is more reliable.

### Running the benchmark

```python -m officearena.run_benchmark --models cua uitars claude --registry task_registry --results-dir results --use-cached-original-images --num-concurrent 3 --mode=evaluate --onedrive-path /OfficeArena```

Note: increasing concurrency past 3 might lead to more errors due to rate limits from OneDrive.

Note: the first run of the codebase will be slow - as the docker image needs to be downloaded.



