Can LLMs Recognize Emotion Expressions?

Modern Large Language Models (LLMs) excel at text-based tasks, but their performance on image-based tasks—particularly those involving fundamental aspects of human communication—remains uncertain. This investigation evaluates whether state-of-the-art large language models—OpenAI GPT-5, Google Gemini 2.5 Flash, Claude Sonnet 4, and xAI Grok 2 Vision—can accurately recognize nonverbal emotion expressions across six validated databases presenting facial expressions, bodily expressions, or combined face+body displays, encompassing both posed and ecologically valid naturalistic stimuli.

Methods

Four leading multimodal Large Language Models (LLMs; OpenAI’s GPT-5, Google AI’s Gemini 2.5 flash, Claude AI’s Sonnet 4, and xAI’s Grok 2 vision) were queried via REST API using R Studio. To investigate the out-of-the-box performance of LLMs using default settings, we queried models without tuning any parameters (including temperature), to ensure our findings represent baseline capabilities rather than optimized performance. Each model received an image paired with a standardized prompt

 

“What is the emotion label that best characterizes the expression being displayed in this image? Select from these options, and only respond with one word: Fear, Anger, Sadness, Surprise, Happiness, Disgust, Neutral, None of the above”

 

A total of 3,350 distinct emotion expressions were retrieved from 6 validated emotion expression databases: 2,444 facial expressions of emotion (52% female, 48% male), 823 bodily expressions of emotion (57% female, 43% male), and 83 naturalistic expressions with both the face and body visible (50% female, 50% male). These expressions were retrieved from 6 different databases:

  • Facial expressions of emotion:

    • The Warsaw Set of Emotional Facial Expressive Pictures (WSEFEP; Olszanowski et al., 2014)

    • The FACES Database (Ebner, Riediger, & Lindenberger, 2010)

    • The Chicago Face Database (CFD; Ma, Kantner, & Wittenbrink, 2015)

  • Bodily expressions of emotion:

    • The Bodily Expressive Action Stimulus Test (BEAST; de Gelder & Van den Stock, 2011)

    • The Bochum Emotion Stimulus Set (BESST; Thoma, Bauser, & Suchan, 2012)

  • Face + Body in-the-wild expressions:

    • Abramson et al. (2017) naturalistic bodily expressions

 

Examples of fear stimuli from each database.

 

Finding 1: Accuracy for each emotion expression in each database

Grand-mean recognition accuracy with 95% confidence intervals for each database and emotion expression, averaged across all four LLMs (with individual LLM performance shown as points), is presented below. Two important findings emerge:

  • Across all databases with human benchmark data available, pooled LLM accuracy for fear and anger expressions, including from either the face, body, or face+body expressions, was at least 12.7 percentage points lower than human performance, with exact binomial tests confirming significantly lower recognition rates in every case (ps < .001).

  • In both bodily expression databases, LLMs demonstrated consistent deficits in recognizing all emotions: pooled accuracy for all emotions across LLMs fell at least 28.6 percentage points below human benchmarks in all cases (all ps < .001).

    • A similar deficit was observed in the face+body naturalistic stimuli, where pooled LLM recognition rates for anger and fear were at least 12.7 percentage points below human benchmarks (ps < .001).

 

Grand-mean recognition accuracy (bars) and 95% CIs for each database, separated by nonverbal display (x axis), LLM (shape), and nonverbal channel (color). Red triangles represent human benchmarks extracted from validation data.

 

Finding 2: Accuracy for each Nonverbal Channel

Below I present recognition accuracy for each nonverbal channel (left), and each nonverbal channel separated by the emotion expression (right). Accuracy was highest for facial expressions (81%; 95% CI [80%, 81%]), followed by face+body expressions (49%; 95% CI [43%, 54%]) and bodily expressions (43%; 95% CI [41%, 44%]). A multilevel binomial logistic regression model revealed that recognition of facial expressions was significantly higher than recognition of face+body expressions (OR = 1.91, Z = 2.29, p = .022) and body expressions (OR = 5.55, Z = 8.15, p < .001). Body expressions were recognized less accurately than face+body expressions (OR = 0.35, Z = −3.56, p < .001).

 

Finding 3: Differences in accuracy by LLM

GPT yielded the highest performance (78.3%), followed by Gemini (74.5%), Grok2 (70.4%), and Claude (56.7%). To assess performance differences between different LLMs, we fit a multilevel binomial logistic regression predicting recognition accuracy from LLM (dummy-coded with GPT as the reference), including random intercepts for database, nonverbal channel, and emotion expression. All Tukey corrected pairwise comparisons were statistically significant (ORs either <.46 or > 1.31, Zs > 4.11, ps < .0002). Thus, OpenAI’s ChatGPT-5 yielded the highest raw accuracy across all trials.

 

Recognition rates for each LLM.

 

Finding 4: Improving LLM accuracy with machine learning

Note: this section is currently under embargo while this research goes through peer review.

General discussion

The present study provides the most comprehensive evaluation to date of LLMs’ ability to recognize emotion expressions from visual stimuli, spanning multiple models, databases, nonverbal channels, and emotion categories. While current LLMs demonstrate moderate accuracy and in some cases match human benchmarks for facial expressions of certain emotions, they show consistent and pronounced deficits—particularly for expressions of fear and anger, and for bodily expressions—across all relevant datasets and modalities. Notable findings include:

  1. Consistently poor performance of LLMs for recognizing fear and anger expressions

  2. Facial expressions yielded the strongest recognition rates, whereas bodily expressions resulted in unsatisfactory performance that fell significantly below human benchmarks in every database and for every emotion display.

  3. ChatGPT5 yielded the highest performance (however, variability attributable to differences between LLMs explained only a very small fraction of the overall variance in LLM accuracy)

Regulatory and policy Implications

These findings carry profound implications for AI governance and policy. The EU AI Act treats specialized emotion recognition systems as high-risk technology, which are likely to become prohibited in sensitive contexts (e.g., law enforcement, workplaces, public spaces, border control, education). However, general-purpose AI models including LLMs remain largely exempt from such oversight. This distinction creates a potential loophole. Here, we demonstrate that general-purpose LLMs, in some cases, can achieve meaningful emotion recognition accuracy. As such, LLMs could effectively function as emotion recognition platforms, but due to their broader function and denotation as “general-purpose” tools, they may avoid the regulatory constraints applied to purpose-built alternatives.