

## Installation
```
pip install -r requirements.txt
```

Required environment variables:
```
export HUGGINGFACE_TOKEN="your-token-here"
huggingface-cli login
```

## Implementation Details

The system consists of several key modules:

### main.py
Contains the core training loop implementing GRPO (Generalized Reward-Powered Optimization). Features advanced generation capabilities including:
- Mixture-of-token generation with phase transitions
- Multiple sampling strategies (nucleus, Dirichlet, element-wise max)
- Comprehensive logging of generation steps and token selection
- Enhanced loss computation with embedding-based token selection

### llms.py
Manages model loading and configuration with enhanced features:
- Support for LLaMA and Qwen models through Hugging Face's transformers library
- Conditional flash attention support based on model type
- Optimized model loading for different architectures

### rldatasets.py
Handles dataset loading and preprocessing with expanded support:
- GSM8K, Math500, MBPP, and LeetCode datasets
- Full Reasoning Gym integration with task-specific configurations
- Flexible data loaders with custom preprocessing for different data formats

### evaluator.py
Contains evaluation metrics and reward functions with major enhancements:
- LLM-based answer evaluation using OpenAI API
- Safe subprocess execution for programming tasks with crash protection
- Reasoning Gym dataset scoring with custom reward functions
- Timeout handling and robust error management

### utils.py
Utility functions supporting advanced token processing:
- Memory-efficient selective log softmax operations
- Multiple token embedding and log probability computation methods
- Enhanced generation logging with dataset-specific formatting


## Soft Thinking Mode 🧠

This implementation includes an experimental "soft thinking" feature inspired by recent research on preserving information during token generation. Instead of always sampling a single token from the probability distribution (which can lose information), soft thinking:

1. **Samples top-k tokens** with their probabilities
2. **Creates weighted embeddings** by mixing the top-k token embeddings based on their probabilities
3. **Feeds mixed embeddings** to the next layer, preserving the superposition of likely tokens
4. **Exits to normal generation** when `</reasoning>` becomes the most likely token
