<h1 style="text-align: center;">Sphere-Prover-V1: Advancing Language Models for Formal
Mathematical Reasoning via Exploration-based Reinforcement Learning</h1>

## Deploy Kimina Server for Verification
```markdown
cd ~/kimina-lean-server/ # git clone kimina first
python3 -m server &
echo "Setting LEAN4_API_URL to http://$(hostname -i):12332"
export LEAN4_API_URL="http://$(hostname -i):12332"
```

## Running RL Training
Our code use verl as training engine for RL 
```markdown
bash numina-rl/examples/numina_grpo.sh
```
## PKPO Implementation

### Configuration
Add `algorithm.k` parameter in `verl/trainer/config/ppo_trainer.yaml` to set the optimal goal.

### Code Changes

**File: `verl/trainer/ppo/ray_trainer.py`**
- Line 213: Modify `compute_advantage` function
- Line 400: Add PKPO logic
- Line 1298: Add parameter `k`

**File: `verl/trainer/ppo/core_algos.py`**
- Add import: `import verl.trainer.ppo.pkpo as pkpo`
- Line 117: Register PKPO algorithm
- Line 260: Integrate PKPO

**New File: `verl/trainer/ppo/pkpo.py`**
Create this file with the PKPO implementation.

### Usage
Run the training script:
```bash
bash numina-rl/examples/pkpo.sh
```
