# TSAR

A research framework for efficient large language model (LLM) inference with mixed-precision decoding and reasoning capabilities.

## Overview

This project implements and evaluates various techniques for efficient LLM inference, particularly focusing on:

- **Progressive Mixed-Precision Decoding (PMPD)**: Adaptive precision allocation during inference
- **Chain-of-Thought (CoT) reasoning**: Enhanced reasoning capabilities with token-level precision control
- **Reward-based evaluation**: Multiple reward models for assessing reasoning quality
- **Quantized model support**: Efficient inference with quantized models

## Key Features

### 🚀 Progressive Mixed-Precision Decoding
- **Phase-aware precision allocation**: High precision for prefill, reduced precision for decoding
- **Progressive precision reduction**: Strategic precision reduction as tokens are generated
- **Flexible scheduling**: Static and learned schedulers for different use cases

### 🧠 Enhanced Reasoning
- **Chain-of-thought support**: Structured reasoning with thinking chains
- **Multiple datasets**: GSM8K, MATH, MATH-500, AIME, AMC-23
- **Reward evaluation**: PRM and Skywork reward models for quality assessment

### ⚡ Efficiency Optimizations
- **Quantized models**: Support for 2-8 bit precision models
- **Memory optimization**: Efficient GPU memory usage
- **Batch processing**: Optimized for large-scale evaluation

## Project Structure

```
efficient-reasoning/
├── env/                    # Core environment and models
│   ├── er_model.py        # Main evaluation model
│   ├── dataset/           # Dataset implementations
│   └── reward/            # Reward model implementations
├── src/code/              # Evaluation scripts and experiments
│   ├── baseline/          # Baseline implementations
│   ├── cot_split/         # Chain-of-thought splitting
│   ├── descent/           # Descent-based methods
│   ├── naive/             # Naive implementations
│   └── evaluate.py        # Main evaluation script
├── tools/                 # External tools and dependencies
│   ├── PMPD/             # Progressive Mixed-Precision Decoding
│   ├── any-precision-llm/ # Any-precision LLM implementation
│   └── xVerify/          # Verification tools
└── conda_activate.sh     # Environment setup script
```


## Supported Models

- **Qwen Models**: qwen7b, qwen38 (quantized versions)
- **Precision Levels**: 2-8 bit quantization
- **Reward Models**: PRM, Skywork

## Supported Datasets

- **GSM8K**: Grade school math problems
- **MATH**: Competition mathematics
- **MATH-500**: Extended math dataset
- **AIME**: American Invitational Mathematics Examination
- **AMC-23**: American Mathematics Competition

## Configuration Options

### Model Parameters
- `--model`: Model name (qwen7b, qwen38)
- `--prefill_bit`: Precision for prefill phase (default: 8)
- `--naive_bit`: Precision sequence for decoding (e.g., "7,6,5")
- `--high_bit_steps`: Number of high-precision steps (default: 512)

### Evaluation Parameters
- `--dataset`: Dataset name (gsm8k, math, math500, aime, amc23)
- `--reward_model`: Reward model (prm, skywork)
- `--max_steps`: Maximum generation steps (default: 2048)
- `--temperature`: Sampling temperature (default: 0.6)

### Advanced Options
- `--scheduler`: Scheduler type (part_split, naive)
- `--xverify`: Enable xVerify evaluation
- `--device`: CUDA device (default: cuda:1)

## Research Areas

### 1. Progressive Mixed-Precision Decoding
- Phase-aware precision allocation
- Progressive precision reduction
- Flexible scheduling strategies

### 2. Chain-of-Thought Reasoning
- Structured reasoning evaluation
- Token-level precision control
- Reward-based quality assessment

### 3. Efficiency Optimization
- Quantized model inference
- Memory-efficient processing
- Batch evaluation capabilities
