Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI

Alam, M. S., & Bazilinskyy, P.

Submitted (2025)
ABSTRACT This study examines the effectiveness of using large language model-based personas to evaluate external human-machine interfaces (eHMIs) in automated vehicles. Various models including miniCPM-V, LLaVA, LLaVA-LLaMA-3, Llama3.2 vision, Moondream, BakLLaVA, Granite3.2 vision, LLaVA-Phi3, Gemma 3, Deepseek-vl2, and ChatGPT-4o were used to simulate pedestrian perspectives. Models assessed vehicle images with eHMIs, assigning scores from 0 (completely unwilling) to 100 (fully confident) regarding crossing decisions. Each model was compared with 15 trials with randomised image sequences, both with and without prior chat context, and the results were compared with crowdsourced human ratings. The findings indicate Gemma3: 27B performed better without chat history (r = 0.85), while ChatGPT-4o was superior when the historical context was included (r = 0.81). In contrast, models such as Deepseek-vl2 and BakLLaVA provided uniform confidence scores with memory context, while Llama3.2 vision failed entirely to produce outputs.