# Tokenizer Robustness Experiments
Scripts and outputs for tokenizer-level robustness tests.

This directory contains scripts and results for analyzing how different tokenizers process invisible Unicode characters and emojis, with a focus on tokenization stability and robustness.

## Scripts

### tokenize_all

Evaluates all supported tokenizers on invisible Unicode characters and emojis.  
Generates two main outputs:
- `emoji_tokenization_analysis_merged`
- `invisible_char_token_counts_merged`

These outputs record the number of tokens produced by each tokenizer for each character.

### aggregate_tokenization

Aggregates the results produced by `tokenize_all`:
- **Aggregate by character**: histograms of token counts per character.
- **Aggregate by tokenizer**: histograms of token counts per tokenizer.

### plot*

Generates visualizations comparing emojis and invisible characters, including:
- Token count histograms per tokenizer
- Cross-tokenizer comparisons
- Analysis of the most robust characters across tokenizers

## Files

### emoji_tokenization_analysis  
### invisible_char_token_counts

Outputs generated by `tokenize_all`.  
For each tokenizer–character pair, these files record:
- The number of tokens produced when tokenizing the character alone
- The increase in token count when the character is inserted between two normalized characters

