Industrial Benchmarking of LLMs: Assessing Hallucination in Traffic Incident Scenarios with a Novel Spatio-Temporal Dataset
Large language models (LLMs) hold revolutionary potential to digitize and enhance the Health & Public Services (H&PS) industry. Despite their advanced linguistic abilities, concerns about accuracy, stability, and traceability still persist, especially in high-stakes areas such as transportation systems. Moreover, the predominance of English in LLM development raises questions about how they perform in non-English contexts.
This study introduces a novel cross-lingual benchmark dataset comprising nearly 99,869 real traffic incident records from Vienna (2013-2023) to assess the robustness of state-of-the-art LLMs (>9) in the spatio and temporal domain of traffic incident classification. We then explored three hypotheses — sentence indexing, date-to-text conversion, and German-to-English translation — and incorporated Retrieval Augmented Generation (RAG) to further examine the models' ability to handle hallucinations in both spatial and temporal contexts.
Our experiments with GPT-4 and Llama models reveal significant performance disparities across these hypotheses in the spatio-temporal domain and also demonstrate how RAG can mitigate what types of hallucinations. These findings underscore the need for enhanced cross-lingual capabilities and improved explainability in LLMs. We provide open access to our Health & Public Services (H&PS) traffic incident dataset, with the project demo and code available at Website.