# op_eval

Standalone evaluation-only package extracted from the RL/codegen stack. It ships the backends, reference PyTorch implementations, and helpers needed to compile and validate generated operator kernels without the RL/generation code.

## Install (editable)
```
pip install -e op_eval_framework
```

## CLI
Evaluate every `*.txt` under a directory:
```
python -m op_eval.cli \
  --input-dir output/gpt-5-codex/ascendc/add_shot/0.0-1.0/run0 \
  --language ascendc \
  --build-workers 8 \
  --num-devices 8 \
  --run-name playground \
  --result-path results/ascendc_eval.json
```
Required args: `--input-dir`, `--language`. The CLI infers op names from filenames, skips any missing references, and writes `results.json` into the input dir by default. When `runs > 1`, the JSON groups results per op as a list tagged with `run` indices.

## Python API
```python
from pathlib import Path
from op_eval import build_requests_from_dir, evaluate_requests, evaluate_code

# Batch evaluate a folder of saved responses
requests = build_requests_from_dir(
    input_dir=Path("output/gpt-5-codex/ascendc/add_shot/0.0-1.0/run0"),
    language="ascendc",
    run_name="playground",
)
results = evaluate_requests(
    requests,
    runs=1,
    build_workers=8,
    device_ids=list(range(8)),
    result_path=Path("results/ascendc_eval.json"),
)

# Evaluate an in-memory code string (e.g., from an RL loop)
code = Path("output/.../relu.txt").read_text()
single_result = evaluate_code(
    op="relu",
    language="ascendc",
    code=code,
    run_name="inline",
    runs=1,
    device_id=0,
)
```

## MCP Server
The package includes a Model Context Protocol (MCP) server, allowing LLM agents to safely compile and evaluate operators as a tool.

### Usage
After installation, the server is available via the `op_eval_mcp` command.
```bash
pip install .
op_eval_mcp
```
(It hangs on stdio, waiting for a JSON-RPC connection. This is normal.)

Configure your MCP client (e.g. Codex/Claude Desktop) to run:
- Command: `op_eval_mcp`
IDE) to use it:

```json
{
  "mcpServers": {
    "op_eval": {
      "command": "op_eval_mcp",
      "args": []
    }
  }
}
```

The server exposes a tool `evaluate_ascend_operator(op_name, file_path, language="ascendc")`. You must provide the absolute `file_path` to the source file. It returns a structured summary suitable for agents.

## Contents
- `op_eval/backends`: backend implementations for `ascendc`, `tilelang_ascend`, `cuda`, `triton`, `pallas`, `sycl`.
- `op_eval/reference`: ground-truth PyTorch implementations used for correctness checks.
- `op_eval/utils`: correctness/performance harness and Ascend build pipeline helpers.

Each Ascend run is isolated under `ASCEND_OP_RUNS_ROOT/<run_name>/<language>/run<idx>` (default root: `./op_eval_output` relative to your working directory). A copy of the operator template lives at `op_eval/ascend_op_template`; override `ASCEND_OP_TEMPLATE_ROOT`/`ASCEND_OP_RUNS_ROOT` if you want to use external toolchain assets or a different workspace location.

## TODO (known issues)
- Batch runs can emit `resource_tracker: leaked semaphore` warnings when many tasks fail. Fix by guaranteeing manager/queue shutdown in all error paths or by replacing the Manager-backed device queue with a simpler binding strategy.

## CLI Usage
To see all available options:
```bash
pip install .
op_eval --help
```

Evaluation from a pre-generated folder (e.g. from Codex/ChatGPT):
```bash
op_eval --input-dir /path/to/responses --language ascendc --ops abs_custom add_custom --build-workers 32 --num-devices 8 --enable-npu-parallelism
```

### Filtering operators
You can filter the evaluation scope using categories or difficulty levels defined in the dataset:
```bash
op_eval \
  --input-dir output/... \
  --language ascendc \
  --categories activation normalization \
  --levels level1
```
This evaluates only operators that match the given categories AND levels.

### Pass@k / Multi-run Evaluation
If your input directory contains multiple run subdirectories (e.g., `run0`, `run1`, `run2`...) and no `*.txt` files at the root, `op_eval` automatically detects this structure.

It will:
1.  Recursively scan all `run*` subdirectories for `*.txt` candidates.
2.  Assign an isolated workspace for each run (e.g., `default_run0`, `default_run1`) to prevent build conflicts.
3.  Evaluate all candidates and return a list of results for each operator in the JSON output, facilitating Pass@k calculation.

**Result Aggregation (Merging with Code):**
You can automatically merge the results with the source code into a single `all_result@k.json` file using the `--merge-code` flag. This format is ideal for downstream analysis.

```bash
op_eval \
  --input-dir output/ModelName/ascendc/add_shot/ \
  --language ascendc \
  --merge-code all_results.json
```

**Resume & Crash Recovery:**
`op_eval` supports incremental saving and automatic resumption.
- **Incremental Save**: The intermediate `results.json` is updated after every run batch (e.g., after `run0` finishes, then `run1`...).
- **Auto-Resume**: If the tool crashes or is interrupted, simply re-run the **same command**. The tool will load the existing results, identify which (op, run) pairs are already completed, and skip them. It will only execute the remaining tasks and then perform the final merge.
- **Safety**: File writing is performed sequentially by the main process, ensuring no race conditions even with high parallelism.

### NPU Parallelism & Concurrency
By default, the framework strictly enforces **1 operator per NPU device** at a time to ensure hardware stability.
- **Build Parallelism**: Controlled by `--build-workers` (default: 4). Compilation happens on the CPU and is fully parallel.
- **Eval Parallelism**: Controlled by `--num-devices` (default: 8). 

**High-Throughput Mode**:
If you possess a high core-count CPU and want to saturate the NPUs, you can use `--enable-npu-parallelism`.
- Logic: If `--build-workers` > `--num-devices`, the system will allow multiple evaluations to run on the same NPU concurrently.
- Ratio: `concurrency = floor(build_workers / num_devices)`. e.g., 32 workers / 8 devices = 4 ops/device.
- **Warning**: Enabling this may cause "Vector Core Exception" or "Sync Failed" error logs from the NPU driver due to hardware contention. However, the process isolation usually ensures that successful kernels still produce correct results.

### NPU Runtime Failures (Server/CLI)
Occasional errors like `vector core exception`, `rtDeviceSynchronizeWithTimeout`, or `kernel pointer null` typically indicate a kernel build/tiling/runtime issue rather than a server failure. In these cases:
- The request may still return HTTP 200 with `correctness=false` (the runtime exception is caught during correctness checks).
- The affected NPU context can remain in a bad state, so follow-up ops may also fail until the process or device is reset.

## Docker Support

You can run the `op_eval` server inside a Docker container using the provided scripts. This is recommended for isolating the Ascend CANN environment.

### Optimized Docker Images

For testing on other CANN toolchains, you can find official Docker images here:
- **Docker Hub (Recommended)**: [ascendai/cann](https://hub.docker.com/r/ascendai/cann/tags)
- **Ascend Hub**: [Ascend Hub Detail](https://www.hiascend.com/developer/ascendhub/detail/17da20d1c2b6493cb38765adeba85884)

If you need to build your own images, reference the official CANN Dockerfiles:
- [Ascend/cann-container-image](https://github.com/Ascend/cann-container-image/tree/main/cann)


### Using a different CANN Version

To use a specific CANN version (e.g., 7.0, 8.0, or a specific Community Edition), modify the first line of `op_eval_framework/Dockerfile`:

```dockerfile
# Default
FROM swr.cn-south-1.myhuaweicloud.com/ascendhub/cann:8.0.rc3-910b-openeuler22.03-py3.11

# Example: Change to a different tag found on Docker Hub / SWR
FROM swr.cn-south-1.myhuaweicloud.com/ascendhub/cann:7.0.0-910-openeuler22.03-py3.9
```

Then rebuild the image:
```bash
./scripts/docker_build.sh
```



### 1. Build the Docker Image
```bash
./scripts/docker_build.sh
```

### 2. Run the Server
The run script automatically maps NPU devices and Ascend drivers from the host.
```bash
./scripts/docker_run.sh
```
The server will start on port 5000 (net=host).

### CUDA (NVIDIA) Docker Server
If you want to run the server against NVIDIA GPUs, use the CUDA run script and pass `--backend cuda`.

This repo does not ship a CUDA base image. Build your own CUDA-enabled image that installs `op_eval` and tag it as `op_eval_server_cuda:latest`, then:
```bash
./scripts/docker_run_nvidia.sh
```
Defaults assume 8xA100 (`CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`). Override with `CUDA_VISIBLE_DEVICES` and `DEVICES` as needed.

### 3. Usage
Once the server is running in Docker, you can use the `op_eval` CLI from your local machine (or another container) by pointing to the server URL:
```bash
op_eval ... --remote-url http://127.0.0.1:5000
```

### Server Concurrency Model
When running as a server (`op_eval_server`), the parallelism architecture is:
1.  **Request Handling (Threading)**: The Flask server is multi-threaded (`threaded=True`). It accepts incoming HTTP requests concurrently, preventing blocking on the network layer.
2.  **Process Isolation (Spawn)**: For *each* evaluation request, the server spawns a dedicated `multiprocessing.Process` (using `spawn` context). this ensures that:
    -   Host compilation (CPU-bound) runs in parallel across processes.
    -   Environment state (env vars, backend imports) is isolated per-operator.
    -   Process crashes (segfaults) do not bring down the main server.
3.  **Device Serialization**: NPU execution is strictly serialized per-device using a shared global `DeviceQueue`. Even if 100 requests arrive at once, only `N` kernels (where `N=num_devices`) will execute on the hardware simultaneously, preventing race conditions or resource exhaustion.

This means you can aggressive set `--build-workers` on the client (e.g., 32 or 64) to maximize CPU compilation throughput, while the server automatically throttles execution to match the available NPUs.


## License
This project is licensed under the Apache-2.0 License.

## On TXT translation
For an Ascend TXT response, ascend_compile_pipeline.ascend_compile() pulls specific fields from the generated context and writes them into the per-op workspace. Under the operator root (e.g. op_eval_output/default/ascendc/run0/operators/\<OpCapitalTimestamp\>/\<OpCapital\>/…), the mapping is:

- project_json_src → \<workspace_root\>/\<op\>.json (used by msopgen gen …)
- host_tiling_src → \<op_root\>/op_host/\<op\>_tiling.h
- host_operator_src → \<op_root\>/op_host/\<op\>.cpp
- kernel_src → \<op_root\>/op_kernel/\<op\>.cpp
- python_bind_src → \<workspace_root\>/CppExtension/csrc/op.cpp

Everything else in that directory tree (CMake files, build_out, template headers, etc.) comes from the template/msopgen, not from the TXT.
