## Project Overview

This repository provides three scripts for multimodal inference, polygon-based evaluation, and 3D bounding box evaluation.

### inference.py (Inference & Logging)
- Entry point for multimodal inference; supports single/multi-image and video frames with conversation templates.
- Runs single-/multi-round generation; in single-round mode, logs each batch (Q/A, optional bbox/segmentation/3D info, reasoning steps, image path, camera intrinsics) to JSON files for traceability.
- Includes likelihood evaluation of target continuations and a greedy-consistency check.

### eval_poly_rec.py (Polygon Evaluation)
- Toolkit for polygon-based localization/segmentation evaluation.
- Preprocesses samples to include image size and fixed-length normalized polygons; can fall back from bbox and resample/interpolate to a target number of points.
- Parses fixed-length polygon coordinates from model text output; computes polygon IoU, thresholded accuracy, and center-containment; provides metric aggregation.

### utils_bbox3d_rec.py (3D Box Evaluation)
- Utilities for 9-DoF 3D boxes (x, y, z, w, h, l, r1, r2, r3) parsing and evaluation.
- Parses text outputs into parameter sequences; back-projects pixel coordinates with depth into camera coordinates using image size and intrinsics; converts 9-DoF to 8 3D corners.
- Computes 3D IoU and multi-threshold accuracy; packages sample results and supports overall/per-dataset aggregation.

