# Research Plan: Does Instruction Tuning Reduce Diversity? A Case Study Using Code Generation

## Problem

We aim to investigate how different instruction tuning techniques impact the diversity of content generated by Large Language Models (LLMs). Preliminary evidence suggests that preference-tuned language models may struggle to generate diverse content, which has important implications for model alignment strategies. However, research on this question has been limited by the difficulty of measuring diversity, particularly semantic diversity, which typically requires costly human evaluation.

The core research question explores whether instruction tuning methods like RLHF, DPO, and supervised fine-tuning reduce diversity in LLM outputs. We hypothesize that preference-tuned models may exhibit "mode collapse" and generate less diverse content compared to base models. Additionally, we expect that existing diversity metrics may not adequately capture semantic diversity, particularly in specialized domains like code generation.

Our motivation stems from the critical importance of diversity for many real-world applications, including reinforcement learning exploration, creative tasks, brainstorming, and handling ambiguous queries. Understanding how instruction tuning affects diversity will inform decisions about which alignment algorithms to use and help develop better methods for preserving diversity while improving model performance.

## Method

We propose to leverage code generation as a domain for studying semantic diversity, since programs have well-defined, executable semantics that can be automatically evaluated. This approach allows us to distinguish between three forms of diversity: lexical (token sequences), syntactic (Abstract Syntax Tree structure), and semantic (program execution outputs).

Our methodology involves creating a dataset of open-ended program synthesis tasks by adapting competitive programming problems into abstracted, canonicalized formats. We will remove direct references to specific programming tasks and standardize function signatures to ensure open-endedness while maintaining executability.

To measure diversity, we will implement several metrics:
- **Semantic diversity**: Programs are considered semantically distinct if their outputs differ across test cases
- **Lexical diversity**: Using Expectation-Adjusted Distinct n-grams (EAD) on tokenized programs
- **Syntactic diversity**: Adapting Distinct-N metrics to canonicalized Abstract Syntax Trees
- **Neural diversity**: Using existing code evaluation models like CodeBERTScore and ICE-Score

We will adopt a pairwise diversity metric that computes the average distinctness over all program pairs to ensure invariance to sample size, addressing potential confounds in diversity measurement.

## Experiment Design

We will conduct experiments across multiple model families and instruction tuning approaches:

**Models**: We plan to evaluate LLAMA-3, LLAMA-3.1, CodeLLAMA, and Qwen-Coder models, which represent different instruction tuning strategies (PPO+DPO+Rejection Sampling, DPO+Rejection Sampling, SFT only). We will also create additional SFT checkpoints using MagiCoder's OSS-Instruct dataset. Commercial models (OpenAI GPT variants, Anthropic Claude) will provide additional comparison points.

**Dataset**: We will create 21 open-ended programming problems by abstracting competitive programming tasks from CodeNet, ensuring each has sufficient test cases for reliable execution-based evaluation.

**Sampling Strategy**: For each problem, we will generate K=100 programs per model, yielding 2,100 total programs per model×prompt configuration. We will use temperature=1.0 without nucleus sampling for consistency.

**Prompting Variations**: We will test three prompt templates (zero-shot, two-shot, two-shot with chain-of-thought) to account for different model capabilities and ensure fair comparison between base and instruction-tuned models.

**Experimental Comparisons**: We will conduct paired statistical tests using Wilcoxon Signed-Rank Tests and Cohen's D effect sizes to compare:
1. Base models vs. instruction-tuned counterparts
2. Small vs. large models within families  
3. Zero-shot vs. few-shot prompting effects
4. Different instruction tuning methods (SFT vs. preference tuning)

**Evaluation Protocol**: We will execute all generated programs in isolated Docker containers for safety, extract relevant functions and imports, and compute diversity scores across all test cases. Statistical analysis will use correlation coefficients to assess relationships between different diversity metrics and paired tests to evaluate the effects of various factors on diversity.

The experimental design will allow us to systematically investigate how instruction tuning, model size, and prompting strategies affect different types of diversity while providing a rigorous, scalable methodology for semantic diversity evaluation.