Keywords: Chemistry, Healthcare, Evaluation, LLM, Medicine, Medical, Public Health, Emergency Response, Safety, Risk Assessment, Incident Management
Abstract: Emergency responders managing hazardous material (HAZMAT) incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and providing recommendations. We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising questions on 1,035 HAZMAT chemicals from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into four tasks: (i) domain knowledge question answering from chemical safety and certification exams, (ii) identification of chemical names between common names and synonyms (e.g., ethanol to ``C$_2$H$_6$O''), (iii) fill-in-the-blank emergency response guide generation (e.g., recommending appropriate evacuation distances), and (iv) classifying risk features from real-world HAZMAT incident response reports. Across all four tasks, the strongest models achieve 97.0% exact match on incident response recommendations with retrieval support, but only 63.9% accuracy on HAZMAT examination questions, falling short of the reliability required for safety-critical use. These results indicate that while language models can assist with retrieving information and high-level reasoning, they require human oversight before deployment in real-world HAZMAT emergency response.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: NLP for Social Good; Human-computer interaction; NLP tools for social analysis; Computational Social Science
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7823
Loading