Abstract: There is growing interest in AI systems that support human decision-making in high-stakes domains (e.g., medical diagnosis) to improve decision quality and reduce cognitive load. Mainstream approaches pair human experts with a machine-learning model, offloading low-risk decisions to the model so that experts can focus on cases that require their judgment.
This $\textbf{\textit{separation of responsibilities}}$ setup, however, is inadequate for high-stakes scenarios. The expert may end up over-relying on the machine's decisions due to $\textit{anchoring bias}$, thus losing the human oversight that is increasingly being required by regulatory agencies to ensure trustworthy AI. On the other hand, the expert is left entirely unassisted on the (typically hardest) decisions on which the model abstained.
As a remedy, we introduce $\textbf{\textit{learning to guide}}$ (LTG), an alternative framework in which -- rather than taking control from the human expert -- the machine provides $\textit{guidance}$ useful for decision making, and the human is entirely responsible for coming up with a decision.
In order to ensure guidance is $\textit{interpretable}$ and $\textit{task-specific}$, we develop Slog, an approach for turning $\textit{any}$ vision-language model into a capable generator of textual guidance by leveraging a modicum of human feedback.
Our empirical evaluation highlights the promise of Slog on both on a synthetic dataset and a challenging, real-world medical diagnosis task.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=JAW1C8RNth
Changes Since Last Submission: We sincerely appreciate the constructive feedback provided by the reviewers and the AE and based upon that, we have made substantial amendments to the paper.
**Human Evaluation.** We were strongly encouraged to include a human evaluation (even with a small subset) along with machine driven experiments. We collaborated with a professional pulmonologist and they evaluated a subset of the outputs generated with SLOG and one of its competitors. In Table 8 of the paper, we present the results based on the quality of the guidance when presented to a doctor.
| Pathology | Pr (Base) | Rc (Base) | F1 (Base) | Pr (Method) | Rc (Method) | F1 (Method) |
|------------------------|-----------|-----------|-----------|-------------|-------------|-------------|
| No Findings | 11.11 | 33.33 | 16.67 | 0.00 | 0.00 | 0.00 |
| Enlarged Cardiomediastinum | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Cardiomegaly | 11.11 | 33.33 | 16.67 | 25.00 | 50.00 | **33.33** |
| Lung Lesion | 0.00 | 0.00 | 0.00 | 16.67 | 33.33 | **22.22** |
| Lung Opacity | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Edema | 28.57 | 50.00 | **36.36** | 14.29 | 20.00 | 16.67 |
| Consolidation | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Pneumonia | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Atelectasis | 9.09 | 33.33 | 14.29 | 31.25 | 100.00 | **47.62** |
| Pneumothorax | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Pleural Effusion | 57.14 | 57.14 | **57.14** | 55.56 | 50.00 | 52.63 |
| Pleural Other | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Fracture | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Support Devices | 22.22 | 100.00 | **36.36** | 15.38 | 66.67 | 25.00 |
| **MACRO** | 9.95 | 21.94 | 12.68 | 11.30 | 22.86 | **14.11** |
| **MICRO** | 15.94 | 28.95 | 20.56 | 20.48 | 29.82 | **24.29** |
Results show how the advantage of SLOG carries over to the setting involving a real human expert, in terms of both macro and micro averaged F_1 over the different pathologies.
**Different dataset.** We were also advised to try with a different dataset. We addressed this concern by tackling by using ClevR, a synthetic dataset tailored for tasks related to vision-language models. Here the idea is that SLOG should provide information about a scene allowing a user to discover rules characterizing positive and negative instances. Results, reported in Table 4 and Table 5, confirm the advantage of SLOG in improving the performance of the downstream decision task.
**Experiments with pre-trained VLMs.** We did try to fine-tune pretrained state-of-the-art VLMs (namely Llava 7B) for our medical decision making task. However, fine-tuning failed to give sensible results, with the fine-tuned Llava performing much worse than our R2Gen model specialized for medical diagnosis. We thus decided to omit these results from the revised version of the manuscript.
Assigned Action Editor: ~Haoliang_Li2
Submission Number: 6711
Loading