{
  "title": "The EOS Token Problem in SFT Training",
  "turns": [
    {
      "turn": 1,
      "user": "\\# Cell 1: Installation and Setup\n\n\\# Step 1: Install PyTorch and ML libraries\n\n\\# !pip install -q torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124\n\n!pip install -q --upgrade transformers accelerate datasets trl bitsandbytes scipy deepspeed wandb\n\n\\# Step 2: Verification and imports\n\nimport os\n\nimport json\n\nimport torch\n\nimport transformers\n\nimport trl\n\nimport logging\n\nimport warnings\n\nimport wandb\n\nfrom dataclasses import dataclass, field, asdict\n\nfrom typing import Optional\n\nfrom datetime import datetime\n\nfrom datasets import load\\_dataset, DatasetDict\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nfrom huggingface\\_hub import HfApi, create\\_repo, upload\\_folder, login\n\nfrom dotenv import load\\_dotenv\n\nos.environ\\[\"CUDA\\_VISIBLE\\_DEVICES\"\\] = \"0\"\n\nload\\_dotenv()\n\n\\# Configure logging\n\nwarnings.filterwarnings(\"ignore\", category=FutureWarning)\n\nlogging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n\n\\# Print versions and GPU info\n\nprint(f\"✅ PyTorch: {torch.\\_\\_version\\_\\_}\")\n\nprint(f\"✅ Transformers: {transformers.\\_\\_version\\_\\_}\")\n\nprint(f\"✅ TRL: {trl.\\_\\_version\\_\\_}\")\n\nprint(f\"✅ CUDA: {torch.cuda.is\\_available()}\")\n\nif torch.cuda.is\\_available():\n\ngpu\\_count = torch.cuda.device\\_count()\n\nprint(f\"✅ GPUs detected: {gpu\\_count}\")\n\nfor i in range(gpu\\_count):\n\nprops = torch.cuda.get\\_device\\_properties(i)\n\nprint(f\" GPU {i}: {props.name} ({props.total\\_memory / 1024\\*\\*3:.1f}GB)\")\n\n\\# Step 3: Login to services\n\nwandb.login(key=os.getenv(\"WANDB\\_API\\_KEY\"), relogin=True)\n\nwb\\_api = wandb.Api()\n\nwb\\_user = wb\\_api.viewer\n\nprint(wb\\_user.username)\n\nlogin(token=os.getenv(\"HF\\_WRITE\\_TOKEN\"))\n\nhf\\_info = HfApi().whoami(token=os.getenv(\"HF\\_WRITE\\_TOKEN\"))\n\nprint(json.dumps({k: hf\\_info\\[k\\] for k in (\"name\", \"auth\") if k in hf\\_info}, indent=2))\n\nprint(\"🔄 Please restart kernel and continue with configuration\")\n\n\\# Cell 2: Training Configuration\n\n\\# Canonical IDs for both models\n\nBASE\\_MODEL\\_NAME = \"Qwen3-0.6B\"\n\nBASE\\_MODEL\\_REMOTE\\_PATH = \"Qwen/\" + BASE\\_MODEL\\_NAME\n\nBASE\\_MODEL\\_LOCAL\\_PATH = BASE\\_MODEL\\_NAME + \"-local\"\n\n\\# SFTD\\_MODEL\\_NAME = \"s1-slth-qwen3-0.6b\" # \"s1K1p1-Distill-Qwen3-0.6B\"\n\n\\# SFTD\\_MODEL\\_REMOTE\\_PATH = \"jaeh8nkim/\" + SFTD\\_MODEL\\_NAME\n\n\\# SFTD\\_MODEL\\_LOCAL\\_PATH = SFTD\\_MODEL\\_NAME + \"-local\"\n\nSFTD\\_MODEL\\_NAME = \"s1K4Q3p6Bs1p17BtUPFTstep1-Distill-Qwen3-0.6B\"\n\n\\# SFTD\\_MODEL\\_NAME = \"s1K4Q3p6Bs1p17BtUPFTstep2-Distill-Qwen3-0.6B\"\n\nSFTD\\_MODEL\\_REMOTE\\_PATH = \"jaeh8nkim/\" + SFTD\\_MODEL\\_NAME\n\nSFTD\\_MODEL\\_LOCAL\\_PATH = SFTD\\_MODEL\\_NAME + \"-local\"\n\n\\# DATASET\\_REMOTE\\_PATH = \"simplescaling/s1K-1.1\\_tokenized\"\n\nDATASET\\_REMOTE\\_PATH = \"jaeh8nkim/s1K4Q3p6Bs1p17BtUPFTstep1.db\"\n\n\\# DATASET\\_REMOTE\\_PATH = \"jaeh8nkim/s1K4Q3p6Bs1p17BtUPFTstep2.db\"\n\n@dataclass\n\nclass TrainingConfig:\n\n\"\"\"S1-faithful training configuration\"\"\"\n\nbase\\_model\\_name: str = field(default=BASE\\_MODEL\\_REMOTE\\_PATH) # load from here\n\noutput\\_model\\_name: str = field(default=SFTD\\_MODEL\\_NAME) # save to here\n\nblock\\_size: int = 8192\n\nwandb\\_project: Optional\\[str\\] = SFTD\\_MODEL\\_NAME\n\nwandb\\_entity: Optional\\[str\\] = \"jaeh8nkim\"\n\ntrain\\_file\\_path: Optional\\[str\\] = DATASET\\_REMOTE\\_PATH\n\ndagger: bool = False\n\ndef \\_\\_post\\_init\\_\\_(self):\n\nif self.wandb\\_project:\n\nos.environ\\[\"WANDB\\_PROJECT\"\\] = self.wandb\\_project\n\nif self.wandb\\_entity:\n\nos.environ\\[\"WANDB\\_ENTITY\"\\] = self.wandb\\_entity\n\n\\# Initialise\n\nconfig = TrainingConfig()\n\nprint(\"Training Configuration:\")\n\nprint(f\" Base model (loading from): {config.base\\_model\\_name}\")\n\nprint(f\" Output model (saving to): {config.output\\_model\\_name}\")\n\nprint(f\" Block size: {config.block\\_size}\")\n\nprint(f\" Dataset: {config.train\\_file\\_path}\")\n\n\\# Cell 3: Model and Dataset Loading (FIXED)\n\ndef load\\_model\\_and\\_tokenizer():\n\n\"\"\"Load Qwen3-0.6B model and tokenizer\"\"\"\n\nlogging.info(f\"Loading model: {config.base\\_model\\_name}\") # Changed this line\n\nmodel = AutoModelForCausalLM.from\\_pretrained(\n\nconfig.base\\_model\\_name, # Changed this line\n\ntorch\\_dtype=torch.bfloat16,\n\ntrust\\_remote\\_code=True\n\n)\n\ntokenizer = AutoTokenizer.from\\_pretrained(\n\nconfig.base\\_model\\_name, # Changed this line\n\nuse\\_fast=True,\n\ntrust\\_remote\\_code=True\n\n)\n\nif tokenizer.pad\\_token is None:\n\ntokenizer.pad\\_token = tokenizer.eos\\_token\n\n\\# Template definitions\n\ninstruction\\_template = \"\\<|im\\_start|\\>user\"\n\nresponse\\_template = \"\\<|im\\_start|\\>assistant\\\\n\"\n\ntotal = sum(p.numel() for p in model.parameters())\n\ntrain = sum(p.numel() for p in model.parameters() if p.requires\\_grad)\n\nprint(f\"✅ Model loaded: {total:,} total, {train:,} trainable parameters\")\n\nreturn model, tokenizer, instruction\\_template, response\\_template\n\ndef load\\_s1\\_dataset(qwen3\\_style=True):\n\n\"\"\"Load and preprocess S1 dataset\"\"\"\n\nlogging.info(f\"Loading dataset: {config.train\\_file\\_path}\")\n\ndataset = load\\_dataset(config.train\\_file\\_path)\n\nprint(f\"✅ Dataset loaded: {len(dataset\\['train'\\])} train samples\")\n\n\\# Filter to text column only\n\ntrain\\_dataset = dataset\\['train'\\].select\\_columns(\\['text'\\])\n\ntest\\_dataset = dataset\\['test'\\].select\\_columns(\\['text'\\]) if 'test' in dataset else train\\_dataset\n\nfiltered\\_dataset = DatasetDict(train=train\\_dataset, test=test\\_dataset)\n\n\\# Apply Qwen3 token style conversion\n\nif qwen3\\_style:\n\ndef swap\\_tokens(example):\n\ntxt = example\\['text'\\]\n\ntxt = txt.replace('\\<|im\\_start|\\>think', '\\<think\\>')\n\ntxt = txt.replace('\\<|im\\_start|\\>answer', '\\</think\\>')\n\nexample\\['text'\\] = txt\n\nreturn example\n\nfiltered\\_dataset = filtered\\_dataset.map(swap\\_tokens)\n\nprint(\"✅ Applied Qwen3 token style conversion\")\n\nreturn filtered\\_dataset\n\n\\# Load model and data\n\nmodel, tokenizer, instruction\\_template, response\\_template = load\\_model\\_and\\_tokenizer()\n\ndataset = load\\_s1\\_dataset(qwen3\\_style=False)\n\n\\# Cell 4: Training Configuration and Execution\n\ndef create\\_sft\\_config():\n\n\"\"\"Create SFTConfig with S1-faithful hyperparameters\"\"\"\n\n\\# Load the dataset to get its size\n\ncurrent\\_dataset = load\\_dataset(config.train\\_file\\_path)\n\ndataset\\_size = len(current\\_dataset\\['train'\\])\n\n\\# S1 hyperparameters\n\nlr = 1e-5\n\nepochs = 5\n\nweight\\_decay = 1e-4\n\nmicro\\_batch\\_size = 1\n\ngradient\\_accumulation\\_steps = 16\n\n\\# Calculate training steps using actual dataset size\n\neffective\\_batch\\_size = micro\\_batch\\_size \\* gradient\\_accumulation\\_steps\n\nsteps\\_per\\_epoch = dataset\\_size // effective\\_batch\\_size\n\ntotal\\_steps = steps\\_per\\_epoch \\* epochs\n\nwarmup\\_steps = int(total\\_steps \\* 0.05)\n\nprint(f\"🎯 Detected dataset size: {dataset\\_size}\")\n\nprint(f\"Training setup: {effective\\_batch\\_size} batch size, {total\\_steps} steps, {warmup\\_steps} warmup\")\n\nreturn trl.SFTConfig(\n\n\\# Model and data\n\nmax\\_seq\\_length=config.block\\_size,\n\ndataset\\_text\\_field='text',\n\n\\# Training schedule\n\nnum\\_train\\_epochs=epochs,\n\nper\\_device\\_train\\_batch\\_size=micro\\_batch\\_size,\n\nper\\_device\\_eval\\_batch\\_size=micro\\_batch\\_size,\n\ngradient\\_accumulation\\_steps=gradient\\_accumulation\\_steps,\n\n\\# Optimizer\n\nlearning\\_rate=lr,\n\nweight\\_decay=weight\\_decay,\n\nadam\\_beta1=0.9,\n\nadam\\_beta2=0.95,\n\nwarmup\\_ratio=0.05,\n\nlr\\_scheduler\\_type=\"cosine\",\n\n\\# Precision and memory\n\nbf16=True,\n\ngradient\\_checkpointing=True,\n\ndataloader\\_pin\\_memory=True,\n\ndataloader\\_num\\_workers=0,\n\n\\# Evaluation and logging\n\neval\\_strategy=\"no\",\n\nsave\\_strategy=\"no\",\n\nlogging\\_steps=1,\n\n\\# Output\n\noutput\\_dir=f\"./{SFTD\\_MODEL\\_NAME}\",\n\nlogging\\_dir=f\"./{SFTD\\_MODEL\\_NAME}/logs\",\n\nsave\\_only\\_model=True,\n\n\\# Data handling\n\nremove\\_unused\\_columns=True,\n\nreport\\_to=\\[\"wandb\"\\] if config.wandb\\_project else \\[\\],\n\nrun\\_name=SFTD\\_MODEL\\_NAME,\n\ndataloader\\_drop\\_last=True,\n\npacking=False,\n\n)\n\ndef train\\_model():\n\n\"\"\"Execute S1-faithful training\"\"\"\n\nsft\\_args = create\\_sft\\_config()\n\n\\# Create data collator\n\ncollator = trl.DataCollatorForCompletionOnlyLM(\n\ninstruction\\_template=instruction\\_template,\n\nresponse\\_template=response\\_template,\n\ntokenizer=tokenizer,\n\nmlm=False\n\n)\n\n\\# Create trainer\n\ntrainer = trl.SFTTrainer(\n\nmodel=model,\n\ntrain\\_dataset=dataset\\['train'\\],\n\neval\\_dataset=dataset\\['test'\\] if 'test' in dataset else dataset\\['train'\\],\n\nargs=sft\\_args,\n\ndata\\_collator=collator\n\n)\n\n\\# Monitor memory\n\nif torch.cuda.is\\_available():\n\nallocated = torch.cuda.memory\\_allocated(0) / 1024\\*\\*3\n\ntotal = torch.cuda.get\\_device\\_properties(0).total\\_memory / 1024\\*\\*3\n\nprint(f\"GPU memory before training: {allocated:.1f}GB / {total:.1f}GB\")\n\n\\# Train\n\nprint(\"🚀 Starting training...\")\n\ntrainer.train()\n\n\\# Save model\n\ntrainer.save\\_model(output\\_dir=sft\\_args.output\\_dir)\n\ntokenizer.save\\_pretrained(sft\\_args.output\\_dir)\n\nprint(f\"✅ Training completed! Model saved to: {sft\\_args.output\\_dir}\")\n\nreturn trainer, sft\\_args\n\n\\# Execute training\n\ntrainer, sft\\_args = train\\_model()\n\n\\# Cell 5: Model Upload and Download\n\ndef setup\\_sft\\_repository():\n\n\"\"\"Setup HuggingFace repository for SFT-ed model\"\"\"\n\nrepo\\_id = SFTD\\_MODEL\\_REMOTE\\_PATH\n\ntry:\n\ncreate\\_repo(repo\\_id=repo\\_id, repo\\_type=\"model\", private=False, exist\\_ok=True)\n\nprint(f\"✅ Repository ready: https://huggingface.co/{repo\\_id}\")\n\nreturn repo\\_id\n\nexcept Exception as e:\n\nprint(f\"⚠️ Repository setup issue: {e}\")\n\nreturn None\n\ndef upload\\_sft\\_model(repo\\_id):\n\n\"\"\"Upload SFT-ed model\"\"\"\n\nif repo\\_id:\n\ntry:\n\nprint(f\"📤 Uploading SFT-ed model to {repo\\_id}...\")\n\nupload\\_folder(\n\nfolder\\_path=sft\\_args.output\\_dir,\n\nrepo\\_id=repo\\_id,\n\nrepo\\_type=\"model\",\n\ncommit\\_message=\"Upload S1-faithful fine-tuned Qwen3-0.6B\"\n\n)\n\nprint(f\"✅ SFT-ed model uploaded successfully!\")\n\nexcept Exception as e:\n\nprint(f\"❌ Upload failed: {e}\")\n\ndef download\\_model(repo\\_name, local\\_folder\\_name, description):\n\n\"\"\"Download a single model and save locally\"\"\"\n\nprint(f\"📥 Downloading {description} from {repo\\_name}...\")\n\ntry:\n\nmodel = AutoModelForCausalLM.from\\_pretrained(\n\nrepo\\_name,\n\ntorch\\_dtype=torch.bfloat16,\n\ntrust\\_remote\\_code=True\n\n)\n\ntokenizer = AutoTokenizer.from\\_pretrained(\n\nrepo\\_name,\n\ntrust\\_remote\\_code=True\n\n)\n\n\\# Save locally\n\nlocal\\_path = f\"./{local\\_folder\\_name}\"\n\nmodel.save\\_pretrained(local\\_path)\n\ntokenizer.save\\_pretrained(local\\_path)\n\nprint(f\"✅ {description} ready at {local\\_path}\")\n\nreturn {\n\n\"model\": model,\n\n\"tokenizer\": tokenizer,\n\n\"path\": local\\_path\n\n}\n\nexcept Exception as e:\n\nprint(f\"❌ Download failed for {description}: {e}\")\n\nreturn None\n\n\\# Execute model management\n\nprint(\"🚀 Step 1: Setting up repository and uploading SFT-ed model...\")\n\nsft\\_repo\\_id = setup\\_sft\\_repository()\n\nupload\\_sft\\_model(sft\\_repo\\_id)\n\nprint(\"\\\\n🚀 Step 2: Downloading both models for local inference...\")\n\n\\# Download SFT model\n\nbase\\_model = download\\_model(\n\nrepo\\_name=BASE\\_MODEL\\_REMOTE\\_PATH,\n\nlocal\\_folder\\_name=BASE\\_MODEL\\_LOCAL\\_PATH,\n\ndescription=\"Official base Qwen3-0.6B\",\n\n)\n\nsftd\\_model = download\\_model(\n\nrepo\\_name=SFTD\\_MODEL\\_REMOTE\\_PATH,\n\nlocal\\_folder\\_name=SFTD\\_MODEL\\_LOCAL\\_PATH,\n\ndescription=\"SFT-ed model from HuggingFace\",\n\n)\n\n\\---\n\nthis is the content of my sft.ipynb file. the dataset used here, jaeh8nkim/s1K4Q3p6Bs1p17BtUPFTstep1.db contains sft data (reasoning traces) that ends abruptly, without proper \\<|im\\_end|\\> special token, following upft (unsupervised prefix finetuning) paper. but the training loss show sudden peaks and the trained model doesnt perform well as expected, so i'm wondering if the sft.ipynb code here automatically appends eos token or some other special token that makes the model to think that abrupt end in the training data is perceived as the proper end of the sequence, instead of the preview of a longer training data. verify this suspicion.",
      "assistant": "You've hit upon a very common and important issue in fine-tuning language models, especially when dealing with instruction-tuned or chat-formatted data: **the handling of EOS (End-Of-Sequence) tokens and padding.**\n\nLet's break down your suspicion and analyze your code:\n\n**Your Suspicion: \"the sft.ipynb code here automatically appends eos token or some other special token that makes the model to think that abrupt end in the training data is perceived as the proper end of the sequence, instead of the preview of a longer training data.\"**\n\nThis suspicion is likely correct, particularly due to how `trl.DataCollatorForCompletionOnlyLM` and the tokenizer's `pad_token` are configured in your script.\n\nHere's why:\n\n1.  **`tokenizer.pad_token = tokenizer.eos_token`:** In `load_model_and_tokenizer()`, you have this critical line:\n    ```\n    if tokenizer.pad_token is None:\n        tokenizer.pad_token = tokenizer.eos_token\n    ```\n    This is a common practice when a tokenizer doesn't have a pre-defined `pad_token`. It tells the tokenizer to use the `eos_token` (End-Of-Sequence token) for padding.\n2.  **`trl.DataCollatorForCompletionOnlyLM` and Padding:** The `DataCollatorForCompletionOnlyLM` is designed to compute loss only on the \"completion\" part of your input (i.e., the `response_template` and everything that follows it). However, it still needs to process the entire sequence, and to handle variable-length inputs in a batch, it performs padding.\n    When `tokenizer.pad_token` is set to `tokenizer.eos_token`, any padding tokens added by the collator _will be the EOS token_.\n3.  **Loss Masking and EOS Tokens:** Here's the crucial part where your suspicion gains strong ground. In many language modeling training setups, including `trl.DataCollatorForCompletionOnlyLM` (which inherits from `transformers.DataCollatorForLanguageModeling` or similar), **loss is often masked for padding tokens**. This means that the model does _not_ learn to predict the `pad_token` (which in your case is the `eos_token`) when it appears as padding.\n    If your training data ends abruptly without an explicit `eos_token`, and then the `trl` collator pads these sequences with the `eos_token`, the model effectively _doesn't learn to predict an `eos_token` at the end of a sequence_. It learns that the `eos_token` is simply a placeholder that should be ignored for loss calculation.\n    This can lead to the \"model doesn't know when to stop generating\" or \"generates too much\" problem during inference, as it hasn't been explicitly trained to output an `eos_token` to signal the end of its response.\n\n**Consequences of this setup with abruptly ending data:**\n\n*   **Training Loss Peaks:** The sudden peaks in training loss could be related to how the model is trying to interpret or generate content at the \"abrupt end\" of the sequences, especially if the `eos_token` is being used for padding and its loss is masked. The model might not have a clear signal to \"stop\" or \"conclude\" a thought process within the reasoning trace.\n*   **Poor Performance:** If the model doesn't learn to properly terminate sequences, its generation quality will suffer. It might ramble, provide incomplete answers, or just continue generating tokens indefinitely because it hasn't been taught the \"stop\" signal. Your reasoning traces, which are designed to show a complete thought process, might be cut short by the padding and loss masking, preventing the model from learning the full sequence behavior.\n\n**How to verify/address this:**\n\n1.  **Inspect Tokenized Data with Labels:** A robust way to verify is to grab a batch of data after it goes through the `DataCollatorForCompletionOnlyLM` and inspect the `input_ids`, `attention_mask`, and `labels`.\n    *   Look for sequences where your original `text` abruptly ends.\n    *   Check if `tokenizer.eos_token_id` is appended to the end of these sequences due to padding.\n    *   Crucially, check the `labels` array at the positions where the `eos_token_id` (used for padding) appears. If it's `-100`, then the loss for that token is ignored.\n2.  **Explicitly Add `eos_token` to your Dataset (Recommended for SFT):** For SFT, especially with instruction tuning or reasoning traces, it's generally best practice for _each training example_ to explicitly end with an `eos_token` if it represents a complete response. This teaches the model when to stop generating.\n    Modify your `load_s1_dataset` function to ensure an `eos_token` is appended to each `text` entry:\n    ```\n    def load_s1_dataset(qwen3_style=True):\n        \"\"\"Load and preprocess S1 dataset\"\"\"\n        logging.info(f\"Loading dataset: {config.train_file_path}\")\n        dataset = load_dataset(config.train_file_path)\n        print(f\"✅ Dataset loaded: {len(dataset['train'])} train samples\")"
    }
  ]
}