# Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

This repository contains the code for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives"


## Paper abstract

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it—they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output—they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the ``Llama``, ``Gemma`` and ``Mistral`` families, and input prompts from the LMSYS Chatbot Arena platform

## Dependencies

All the experiments were performed using Python 3.11.2. In order to create a virtual environment and install the project dependencies you can run the following commands:

```bash
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```


All the experiments were performed using Python 3.11.2. In order to create a virtual environment and install the project dependencies you can run the following commands:


```bash
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```

## Repository structure

```
├── data
├── figures
├── notebooks
├── outputs
    ├──cpt
    ├──energy_outputs
    └──heuristic
├── scripts
└── src
    ├──energy.py
    ├──heuristic_misreporting.py
    ├──LMSYS_generation.py
    ├──tokenizations.py
    └── utils.py
```

- `data` contains the processed set of LMSYS prompts used in the experiments. The original dataset has been omitted due to size constraints.
- `figures` contains all the figures presented in the paper.
- `notebooks` contains python notebooks to generate all the figures included in the paper:
    - `appendix_example.ipynb` generates the examples outputs for the heuristic policy presented in the appendix.
    - `cpt.ipynb` analyzes the generation of outputs and the effect of pay-per-character across languages.
    - `energy_plots_profit.ipynb` analyzes the energy cost of generation and verification, and plots the increase of the provider's utility as a function of their margin.
    - `plot_profit_no_transparency.ipynb` analyzes the effect of a random policy the randomly splits tokens, plotting the increase in overcahrged tokens, and the likelihood of finding plausible tokenizations.
    - `plots_heur.ipynb` analyzes the results using the heuristic policy, and determines how much users can be overcharged, and how likely it is for the heuristic policy to find a plausible tokenization.
    - `process_ds.ipynb` builds the LMSYS dataset.
    
    
- `outputs` contains intermediate output files generated by the experiments' scripts and analyzed in the notebooks. They can be generated using the scripts in the `src` folder.
    - `cpt` contains answers generated to the LMSYS prompts to estimate the number of character-per-token. Due to size constraints, not all languages are analyzed in the work are present. You can use `LMSYS_generation.py` to generate all the data for different languages.
    - `energy_outputs` contains the results of `energy.py` used to analyze the energy cost of generation and verification across models.
    - `heuristic` contains the results of running the heuristic algorithm ``heuristic_misreporting.py``.
- `scripts` contains a set of scripts used to run all the experiments presented in the paper. 
- `src` contains all the code necessary to reproduce the results in the paper. Specifically:
  - `energy.py` is used to measure the energy consumption of the models for generation and verification.
  - `heuristic_misreporting.py` is the main script used in the paper. It implements the misreporting heuristic based on token indices, runs it on prompts (taken from the LMSYS dataset) for multiple iterations, determining the plausibility in the last step, and returns the number of plausible longer tokenizations found.
  - `LMSYS_genration.py` is used to generate outputs to prompts from the LMSYS dataset across different languages.
  - `tokenizations.py` contains auxiliary functions for tokenization operations, including finding all possible tokenizations of a string, computing the cumulative autoregressive probability of a token sequence, or verifying if a token sequence is top-p/k plausible.
  - `utils.py` contains auxiliary functions.


## Instructions

### Downloading the models

Our experiments use LLMs from the Llama, Gemma and Mistral families, which are "gated" models, that is, they require licensing to use.
You can request to access it at: [https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), [https://huggingface.co/google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) and [https://huggingface.co/mistralai/Ministral-8B-Instruct-2410](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410).
Once you have access, you can download any model in the Llama, Gemma and Mistral families.
Then, before running the scripts you need to authenticate with your Hugging Face account by running `huggingface-cli` login in the terminal.
Each model should be downloaded to the `models/` folder.


### LMSYS experiment
The script [heuristic_misreporting.py](src/heuristic_misreporting.py) generates the output needed to reproduce all data obtained based on the heuristic misreporting policy. You can run it in your local python environment or use the Slurm submission script on a cluster, using [script_slurm_heur.sh](scripts/script_slurm_heur.sh) with your particular machine specifications. Using [script_slurm_heur.sh](scripts/script_slurm_heur.sh) to run the scripts automatically uses the LMSYS prompts in the file [LMSYS.txt](data/LMSYS.txt). You can use the flags ``--model`` to set a specific model, such as ``meta-llama/Llama-3.2-1B-Instruct``, the flag ``--temperature`` to set the temperature, ``--p`` to set top-p parameter, ``--prompts`` to use a list of string as prompts and ``splits`` to select how many iterations of the heuristic should be used.
Similarly, the script [energy.py](src/energy.py) is used to measure the energy consumption during generation and verification of token sequences. You can run it in your local python environment or use the Slurm submission script on a cluster, using [script_slurm_lmsys_ennergy.sh](scripts/script_slurm_lmsys_ennergy.sh) with your particular machine specifications, which automatically uses the LMSYS prompts in the file [LMSYS.txt](data/LMSYS.txt). The script [LMSYS_generation.py](src/LMSYS_generation.py) is used to generate model outputs to prompts from the LMSYS dataset, which we use to measure the effect of random splitting policies and pay-per-character in the paper. You can run it in your local python environment or use the Slurm submission script on a cluster, using [script_slurm_lmsys_generation_loop.sh](scripts/script_slurm_lmsys_generation_loop.sh) with your particular machine specifications, which automatically uses the LMSYS prompts in the file [LMSYS.txt](data/LMSYS.txt).
To reproduce all the figures, run the [notebooks](notebooks/).





