We Need More LLM Emotion Benchmarks
Much of the focus during LLM model releases is on the model’s factual understanding of the world (GPQA and other quiz questions), coding abilities (Aider Polyglot or SWE-bench), or agentic tool use (web browsing benchmarks and SWE-bench). This makes sense because as I’ve said before, this is what is easiest to quantify. However, I don’t think this is what reflects the wants/needs of most regular consumers.
I would be willing to bet that there are far more people who would use AI for advice, general chatting, flirting advice, etc. than engineers who want to use AI to generate code. This makes a model’s style and emotional abilities probably more important than code generation abilities to decide “how big of a deal” it is, especially as a consumer product. In spite of this there are extremely few emotional intelligence benchmarks. The only remotely comprehensive benchmarks I could find are EQ-Bench and EmoBench. EmoBench doesn’t seem particularly high quality or have much traction and EQ-Bench seems mostly targeted at role-playing. The only other benchmark that I could find that was even close is LMArena’s Creative Writing section.
Overall there seems to be a dearth of good emotional intelligence benchmarks that have been publicly released. I hope more benchmarks measuring emotional intelligence are released, otherwise I’ll have to put them together myself, and that costs a lot of money.