\# BVV241 Tokenizer Benchmarking \& Frozen Embedding Sets



A research resource for cross-model Unicode-centric tokenization, \*frozen embedding\* LMs, and experimentation on the emergence of semantics and modular model fusion.



\## About This Repository



This repo provides:



\- Scripts, notebooks, and raw files for construction and benchmarking of \*Unicode-based tokenizers\* with n-gram/Wikipedia statistical enrichment,

\- Precomputed, \*\*L2-normalized, frozen embedding matrices\*\* (for direct plug-in as nn.Embedding),

\- Tools for building hybrid vocabularies (Unicode + bigram/trigram extensions + SOTA token string intersection),

\- Live benchmarking and visualization pipelines (SOTA vs custom models, t-SNE, BLEU/MMLU/ARC),

\- \*\*HuggingFace Hub\*\* integration, with all resources.



---



\## 📊 Benchmarks \& Research Notebooks

\*\*\_tokenizer-benchmarking-t-sne.ipynb\*\*



— Visualizes token/embedding distribution via t-SNE, comparing BVV tokenizers with SOTA baselines.



\*\*\_models\_benchmarking.py, \_models\_benchmarking.plot.ipynb, \_models\_benchmarking.code.ipynb\*\*



— Scripts and notebooks to benchmark models (BLEU, MMLU, ARC) using these tokenizers \& embeddings versus SOTA tokenizers.



\*\*\_n-gramms-from-wiki.ipynb\*\*



— Extraction of frequent n-grams from Wikipedia to fill Unicode private ranges, enriching token coverage.



\*\*\_tokenizer-builder-\*.ipynb\*\*



— Complete construction logic for each tokenizer/embedding variant.



\## 🗂️ File Structure

\### File/Notebook	Purpose

\_tokenizer-benchmarking-t-sne.ipynb	- t-SNE visualizations of token space and embedding overlap



\_n-gramms-from-wiki.ipynb	- Extracting n-grams for vocab extension



\_n-gramms-2-3-4-5.txt (etc)	- Precomputed n-gram lists (for reproducible vocab)



\_n-gramms-intersection.txt	- Common token strings across SOTA tokenizers



\_tokenizer-builder-\*	- Jupyter code for building each tokenizer/embedding set



\_models\_benchmarking.\*	- Benchmark scripts, plots, example use in LM evaluation



normalized\_embeddings\_weights.pt	- Main embedding matrix for each tokenizer version





\## Tokenizer and Embedding Variants



\### 1. \[bvv241-2-3]

\- \*\*Unicode plane\*\* (0–65535): All single Unicode codepoints (monograms).

\- \*\*Private/unused Unicode ranges\*\*: Wikipedia bigrams/trigrams.

\- \*\*Vocabulary\*\*: 65,536 tokens; \*\*Embedding\*\*: 1024-dim, L2-normalized, frozen (\*\*no semantics\*\*).

\- Suitable for: \*Baseline Unicode LM research, non-semantic embedding experiments.\*



\### 2. \[bvv241-max]

\- \*\*Unicode monograms\*\* + bigrams/trigrams + \*intersection of token strings\* across SOTA models (o200k\_base, cl100k\_base, Mistral-Nemo, DeepSeek-R1, etc).

\- \*\*Vocabulary\*\*: 131,072 tokens; \*\*Embedding\*\*: 1024-dim, frozen.

\- Suitable for: \*Unified tokenizer/embedding research; plug-and-play fusion across SOTA models.\*



\### 3. \[bvv241-nemo]

\- Vocabulary of Mistral-Nemo SOTA model with frozen \*surface-level\* (non-semantic) embeddings.

\- \*\*Vocabulary\*\*: 131,072; \*\*Embedding\*\*: 1024-dim, frozen.

\- Suitable for: \*Direct Mistral-Nemo token/embedding comparison.\*



\### 4. \[bvv241-abs]

\- As `bvv241-max`, but \*\*embedding size 4096\*\*.

\- Suitable for: \*Experiments on scaling embedding space.\*



\*\*All embedding matrices:\*\* L2-normalized, \*fixed/frozen\*, contain \*\*no semantic information\*\*.



\## ⚗️ Research Scope \& Scientific Context

\## Purpose:



These resources enable:



Investigation into semantic emergence when training transformers with fixed, non-semantic ("surface-level") embeddings.



Plug-and-play modular/MoE experiments: plug-in new "experts" or fuse LMs trained with different tokenizations, since embeddings are structurally identical and fixed.



Exploration of Unicode-standardized, reproducible vocabularies for multilingual and cross-model pipelines.



Scientific novelty:



These embeddings are never trained, encode no semantic information, and are suitable for research into meaning arising solely in transformer layers above embedding.





