# SMILE: Deep Submodular Function-Based Instruction and In-context Learning Example Selection

Prompt optimization is a key way to steer large language models when fine-tuning is impractical. However, instruction optimization (IO) and in-context learning (ICL) demonstration selection are often optimized separately and combined post hoc, implicitly assuming that a ''best'' instruction and a ''best'' demo set compose well. In practice, their interactions are strong, making such decoupled pipelines brittle.
We propose SMILE, an efficient method that *jointly* selects instructions and demonstrations. Our key observation is that ICL utility exhibits consistent diminishing returns across diverse instructions. Leveraging this structure, SMILE learns an instruction-conditioned surrogate aligned with LLM feedback and instantiates it as an Extended Deep Submodular Function that captures sample--sample coverage, sample--query relevance, and sample--instruction compatibility. SMILE then performs greedy, query-adaptive selection of the instruction--demo pair.
Experiments on six datasets and multiple LLM backbones show that SMILE consistently outperforms IO-only, ICL-only, and existing joint baselines, supporting a context-engineering view of prompting: jointly optimizing interacting components rather than tuning them in isolation.
## Directory Structure

The code expects the following directory structure:

```
SMILE_code/
├── data/
│   ├── gsm8k/
│   │   ├── train.json
│   │   ├── val.json
│   │   ├── test.json
│   │   ├── gsm8k_instrs.json
│   │   ├── centroids_k10.npz
│   │   ├── instructions_with_proto.jsonl
│   │   ├── test_rouge_qe.jsonl
│   │   └── train_ifgain.json
│   ├── gpqa/
│   ├── fp/
│   ├── xsum/
│   ├── date/
│   └── salient/
├── model/
│   ├── gsm8k/
│   ├── gpqa/
│   └── ...
├── original_data/
├── results/
└── smile_results/

```

## Configuration

### Environment Variables

You can configure model paths using environment variables. Copy `config_example.env` to `.env` and update the paths:

```bash
cp config_example.env .env
```

Available environment variables:
- `MODEL_CACHE_DIR`: Directory for caching downloaded models (default: `./model_cache`)
- `QWEN3_4B_PATH`: Path to Qwen3-4B-Instruct model (default: HuggingFace model ID)
- `LLAMA3_1_8B_PATH`: Path to Llama-3.1-8B-Instruct model (default: HuggingFace model ID)

### API Keys

If using OpenAI or Gemini APIs, update the following in `llm_client.py`:
- `OPENAI_API_KEY`: Your OpenAI API key
- `GEMINI_API_KEY`: Your Google Gemini API key

## Data Preparation

1. Place your original datasets in the `original_data/` directory
2. Run `get_data.py` to process and split the datasets
3. Run preprocessing scripts to generate required features:
   - `collect_traindata.py`: Generate training data
   - `ifgain.py`: Compute information gain
   - `precompute_caches.py`: Precompute centroids, prototypes, and ROUGE scores

## Model Training

Train the SMILE model:

```bash
python smile.py --tasks gsm8k gpqa fp xsum date salient --model qwen3-4b
```

## Inference

Run inference with trained models:

```bash
python smile_infer.py --tasks gsm8k --model qwen3-4b --K 10
```
You can substitute to ``smile_blackbox.py`` for testing with GPT 5.2 and Gemini 2.5 Flash.

## Baselines

Run baseline methods:

```bash
python icl_baselines.py
```



