## Scripts Overview

- `split_toxic_vocab.py`: Splits toxic words into train/test sets by category.
- `generate_dataset_word.py`: Applies GlyphPerturber to toxic words and generates word-level images.
- `unsafe_text_generator_category.py`: Generates category-specific unsafe/safe multiline texts using GPT-4o based on fixed scenarios.
- `matching_multiline.py`: Matches raw words with their perturbed counterparts and applies them to the generated multiline texts.
- `generate_dataset_multiline.py`: Generates images from transformed multiline texts.
- `generate_words_safe.py`: Generates word-level images for safe (neutral) words.
- `generate_multiline_safe.py`: Generates images from safe multiline texts.
- `img_generator_easyocr.py`: Generates 32×32 center-aligned glyph images from a list of Unicode characters using mixed multilingual fonts. Used to prepare character-level images for embedding training.
- `vec_generator_easyocr.py`: Extracts 256-dimensional embedding vectors for glyph images using EasyOCR's encoder, and saves them in Word2Vec `.normalized` format. Used for training or applying glyph-level perturbation based on visual similarity.
- `glyphperturber.py`: Applies glyph-level perturbation by replacing characters with visually similar alternatives based on a pre-trained embedding (e.g., EasyOCR). Generates `transformed_words.txt` and `linked_words.txt` from a list of input words provided via stdin.
- `mychars.txt`: Contains a list of approximately 30,000 Unicode characters (one per line). Used as the character pool for glyph image generation and embedding extraction.
- `examples/`: Contains configuration (`config.yaml`) and template (`template.py`) files used for image generation via SynthTiger. Includes setups for both word-level and multiline image generation, including safe and perturbed versions.