ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response

ACL ARR 2026 January Submission7823 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chemistry, Healthcare, Evaluation, LLM, Medicine, Medical, Public Health, Emergency Response, Safety, Risk Assessment, Incident Management

Abstract: Emergency responders managing hazardous material (HAZMAT) incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and providing recommendations. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising questions on 1,035 HAZMAT chemicals from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into four tasks: (i) domain knowledge question answering from chemical safety and certification exams, (ii) identification of chemical names between common names and synonyms (e.g., ethanol to ``C$_2$H$_6$O''), (iii) fill-in-the-blank emergency response guide generation (e.g., recommending appropriate evacuation distances), and (iv) classifying risk features from real-world HAZMAT incident response reports. Across all four tasks, the strongest models achieve 97.0% exact match on incident response recommendations with retrieval support, but only 63.9% accuracy on HAZMAT examination questions, falling short of the reliability required for safety-critical use. These results indicate that while language models can assist with retrieving information and high-level reasoning, they require human oversight before deployment in real-world HAZMAT emergency response.

Paper Type: Long

Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good

Research Area Keywords: NLP for Social Good; Human-computer interaction; NLP tools for social analysis; Computational Social Science

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7823

Loading