This study examines the effectiveness of using large language model-based personas to evaluate external Human-Machine Interfaces (eHMIs) in automated vehicles. 13 different models namely BakLLaVA, ChatGPT-4o, DeepSeek-VL2, Gemma 3: 12B, Gemma 3: 27B, Granite Vision 3.2, LLaMA 3.2 Vision, LLaVA-13B, LLAVA-34B, LLaVA-LLaMA-3, LLaVA-Phi3, MiniCPM-V, and Moondream were used to simulate pedestrian perspectives. Models assessed vehicle images with eHMI, assigning scores from 0 (completely unwilling) to 100 (fully confident) regarding crossing decisions. Each model was run 15 times across the full set of images, both with and without prior conversational context. The resulting confidence scores were then compared with crowdsourced human ratings. The findings indicate Gemma3: 27B performed better without chat history (r = 0.85), while ChatGPT-4o was superior when the historical context was included (r = 0.81). In contrast, DeepSeek-VL2 and BakLLaVA gave similar scores regardless of context, while LLaVA-LLaMA-3, LLaVA-Phi3, LLaVA-13B, and Moondream produced only limited-range outputs in both cases.