Logo CII-Bench

Can MLLMs Understand the Deep Implication Behind Chinese Images?

composition

Overview of CII-Bench: CII-Bench comprises 698 images, spanning six domains: Life, Art, Society, Politics, Environment, and Chinese Traditional Culture.

Introduction

As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the Chinese Image Implication understanding Benchmark, CII-Bench, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI).

Key Insights

  • We introduce CII-Bench, the first benchmark designed to assess the understanding of meanings in Chinese images, which poses a significant challenge to current MLLMs.
  • We design a comprehensive evaluation metric based on GPT-4o to evaluate Chinese traditional culture. This metric aligns more closely with human annotations and is better suited for evaluating Chinese traditional painting.
  • Our experimental findings are as follows:
    1. There is a notable performance gap between MLLMs and humans. Models demonstrate the highest accuracy of 64.4%, while human accuracy average at 78.2% and best at 81.0%.
    2. Closed-source models generally outperform open-source models, but the best-performing open-source model surpasses the top closed source model, with a difference of more than 3%.
    3. Models perform significantly worse in Chinese traditional culture compared to other domains, indicating that current models still lack sufficient understanding of Chinese culture. Further analysis shows that GPT-4o can only observe the surface-level information, itā€™s difficult to deeply interpret the complex cultural elements contained in Chinese traditional painting.
    4. Incorporating image emotion hints into prompts generally improves model scores, indicating that models struggle with emotional understanding, leading to misinterpretation of the implicit meanings in the images.

Logo CII-Benchmark

Overview

We introduce the Chinese Image Implication Understanding Benchmark CII-Bench, a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images. These images, including abstract artworks, comics and posters, possess visual implications that require an understanding of visual details and reasoning ability. CII-Bench reveals whether current MLLMs, leveraging their inherent comprehension abilities, can accurately decode the metaphors embedded within the complex and abstract information presented in these images. CII-Bench-sample

CII-Bench contains a total of 698 various Chinese images. These images are manually collected and annotated by 30 undergraduate students from various disciplines and institutions, with sources from multiple renowned Chinese illustration websites. Each image is manually designed with one to three multiple-choice questions, each with six options and only one correct answer. The questions cover the metaphors, symbolism, and detailed understanding of the images. The benchmark includes a total of 800 multiple-choice questions, with 765 questions used to construct the test set and 35 questions used to construct the development and validation set for few-shot tasks.

Statistics

Experiment Results

Leaderboard

We conduct systematic experiments on both open-source and closed-source MLLMs using CII-Bench. For each model, we employ eight different configurations: None (zero-shot), 1-shot, 2-shot, 3-shot, CoT, Domain, Emotion, and Rhetoric. ā€œNoneā€ represents the use of a standard prompt without any additional information. ā€œEmotionā€ indicates the inclusion of information related to the emotional polarity of the image (e.g., positive, negative) in the prompt, ā€œDomainā€ involves adding information about the imageā€™s domain (e.g., life, art), and ā€œRhetoricā€ refers to including details about the rhetorical devices used in the image (e.g., metaphor, contrast) in the prompt. Additionally, to verify the necessity of images in problem-solving, we select a portion of LLMs to complete tasks without image input.

Closed-Source Open-Source Text-Only Human
Model Overall Life Art Society Politics Environment Chinese Traditional Culture Positive Negative Neutral
GLM-4V 60.9 55.0 59.9 66.5 66.7 79.3 55.5 58.5 64.5 59.4
Gemini-1.5 Pro 60.1 60.0 63.3 62.4 70.8 62.1 51.1 54.8 65.6 59.4
Qwen-VL-MAX 56.9 53.3 59.2 58.8 62.5 67.2 52.6 53.9 58.3 58.0
Claude-3.5-Sonnet 54.1 52.1 61.9 52.6 62.5 46.6 53.3 52.7 56.5 53.0
GPT-4o 54.1 54.1 55.8 52.1 50.0 63.8 51.8 51.9 56.2 54.1
Qwen2-VL-72B 64.4 61.7 61.2 68.0 79.2 75.9 59.9 62.7 63.8 66.4
InternVL2-40B 57.9 55.8 55.1 61.9 62.5 70.7 52.6 54.4 58.0 60.8
InternVL2-8B 53.1 49.2 53.1 55.7 62.5 63.8 50.4 50.6 53.3 55.1
InternVL2-Llama3-76B 52.9 50.8 53.7 51.0 58.3 67.2 51.1 54.8 51.8 52.3
GLM-4V-9b 50.3 46.7 48.3 53.6 54.2 62.1 48.2 51.9 52.9 46.3
Qwen2-VL-7B 49.6 42.5 51.7 54.1 62.5 65.5 44.5 50.2 47.5 51.2
LLaVA-1.6-72B 48.0 43.8 48.3 49.5 70.8 60.3 43.8 41.5 52.5 49.2
LLaVA-1.6-34B 46.0 40.8 55.1 42.8 45.8 62.1 43.1 44.4 48.2 45.2
MiniCPM-v2.6 45.0 37.5 47.6 49.5 58.3 55.2 42.3 45.6 44.6 44.9
CogVLM2-Llama3-Chinese-Chat 43.4 37.1 48.3 42.3 54.2 63.8 40.2 40.3 45.7 43.8
MiniCPM-Llama3-2.5 40.4 36.3 45.6 37.1 50.0 51.7 40.2 43.2 37.0 41.3
idefics2-8b 36.3 25.0 46.3 38.1 41.7 56.9 32.9 32.8 39.1 36.4
Qwen-VL-Chat 34.3 27.9 34.7 32.5 45.8 55.2 36.5 34.0 35.1 33.6
Qwen2-7B-Instruct 32.5 33.2 34.6 30.9 35.0 40.7 28.5 33.6 30.4 33.6
DeepSeek-67B-Chat 27.1 26.6 32.7 30.9 20.0 35.2 18.2 25.7 22.2 33.2
Llama-3-8B-Instruct 21.7 22.2 26.9 18.6 25.0 27.8 20.4 21.2 24.4 19.5
Human_avg 78.2 81.0 67.7 82.7 87.7 84.0 65.9 77.9 75.2 81.6
Human_best 81.0 83.2 73.6 87.2 89.5 86.0 66.7 78.2 78.8 83.3

Overall results of different MLLMs, LLMs and humans on different domains and emotions. The best-performing model in each category is in-bold, and the second best is underlined.

Different Prompt Skills

Analysis of Chain-of-Thought (CoT). The results indicate that CoT does not significantly improve the accuracy of the models. In some cases, particularly with smaller open-source models, the accuracy even declined when CoT was used. For example, MiniCPM-v2.6 scores 45.0% without CoT, but this drops to 38.9% with CoT; similarly, LLaVA-1.6-72B scores decrease from 48.0% to 45.3%. Upon analyzing the modelsā€™ responses, we find that those models showing a decrease in accuracy with CoT often suffer from overinterpretation, where questions that were initially answered correctly are misinterpreted after CoT is applied. Additionally, for questions that were originally answered incorrectly, CoT does not lead to significant improvements and sometimes even causes confusion, such as selecting multiple options. However, for most models, the probability of failing to extract an answer option from the response decreases after using CoT, which explains why some models show improved accuracy with CoT.

Analysis of Different Types and Domains. To evaluate the impact of different label information on model accuracy, we conduct an ablation study by providing relevant label information(Emotion, Domain, Rhetoric) in the prompts. The results show that emotion labels significantly improve model accuracy, followed by domain and rhetoric labels, both of which exhibit similar effectiveness. This result aligns with human intuition. The answer options typically include negative, positive, and neutral choices. When the model receives emotional information, it can eliminate some irrelevant options, naturally leading to higher accuracy. In contrast, domain and rhetoric information generally do not effectively help the model eliminate options, resulting in more limited improvements. Additionally, from a model training perspective, models tend to have a more mature understanding of emotions, while specific nouns in rhetoric and domain labels are often custom-defined. During pre-training, the model may not have encountered a large number of descriptions for such specific nouns, making these labels less helpful in improving accuracy.

Overall results of different prompts on CII-Bench. The label(Emotion, Domain, Rhetoric) means providing corresponding information for the images in the prompt. The best-performing model in each category is in-bold, and the second best is underlined .

Analysis of Few-shot Examples. The results indicate that few-shot examples do not improve the modelsā€™ accuracy. Specifically, performance declines as the number of examples increases. This decline can be attributed to the modelsā€™ inferior capabilities in handling multiple images compared to single images, leading to a decrease in accuracy with a higher number of shots. Furthermore, as the number of shots increases, the input length also extends, and the modelsā€™ ability to process long texts is inadequate, resulting in suboptimal performance with long contexts.

Few-shot results of different models on the CII-Bench.

Chinese Traditional Culture Evaluation

Why Choose to Evaluate Chinese Traditional Culture? The Chinese traditional culture category is a distinctive feature of the CII-Bench dataset, where MLLMs consistently score the lowest. Therefore, we need a deeper evaluation of this field to analyze the extent to which MLLM understands Chinese traditional culture. We choose to deeply analyze MLLMā€™s understanding of Chinese traditional culture by evaluating Chinese traditional paintings.

Why Choose Chinese Traditional Paintings? The imagery associated with Chinese traditional culture often embodies complex implications, encompassing customs, historical anecdotes, and legendary tales, making direct evaluation particularly challenging. Chinese traditional painting, intrinsically intertwined with Chinese traditional culture, offers a viable proxy for this assessment. The unique value of Chinese traditional painting lies in its embodiment of Chinese cultural connotations, aesthetic implications, and distinctive artistic expression. The core philosophical concepts of Confucianism, Taoism, and Buddhism, along with their humanistic essence, have consistently permeated the entire trajectory of Chinese painting history. Consequently, we choose to evaluate MLLMsā€™ comprehension of Chinese traditional culture through an in-depth analysis of their understanding of Chinese traditional paintings.

Evaluation Metric. Chinese traditional painting, a cornerstone of Chinese traditional culture, encompasses a rich tapestry of styles and techniques developed over millennia. These paintings are typically categorized based on their subject matter (e.g., landscape paintings, flower-and-bird paintings, figure paintings, and New Year paintings) or their stylistic and skill (e.g., court paintings, meticulous brush paintings, freehand brush paintings, and color-and-ink paintings). Each category embodies unique characteristics that reflect Chinaā€™s artistic evolution and philosophical underpinnings. To comprehensively assess MLLMsā€™ understanding of Chinese traditional paintings, we develop a multifaceted evaluation metric. This metric is designed to probe both the surface-level information readily apparent in the artwork and the deeper culture and history that informs its creation and interpretation. Our evaluation metric encompasses five key perspectives: Surface-level Information, Aesthetic Characteristics, Brush and Ink Skills, Culture and History, and Deep Implications.

Evaluation metric and evaluation standard of Chinese traditional painting.

LLM-Based Chinese Traditional Painting Automatic Evaluation. Our experiment utilize the CTC domain data from CII-Bench, comprising 130 Chinese traditional paintings. We employ human-written descriptions and implication interpretations as ground truth. We choose GPT-4o to generate descriptions for these images, which are subsequently scored using GPT-4o and our evaluation standard. To validate the modelā€™s scoring efficacy, we enlist three PhD students well-versed in Chinese metaphorical imagery to independently score the 130 paintings. The model-human scoring consistency reached 98%, affirming the methodā€™s validity for assessing Chinese traditional painting comprehension. Analysis of these results, in conjunction with our evaluation standard, reveals insights across three dimensions: overall performance, difficulty levels, and emotions. The overall score of 2.71 indicates that while MLLM is able to observe the surface-level information of paintings, it has a large gap with humans in deeply interpreting the complex cultural elements contained in Chinese traditional art. In terms of difficulty evaluation, the model is consistent with human cognition, while in terms of emotion, the model has a higher negative score, indicating that the model can identify negative implications in paintings, such as using the past to satirize the present, and not appreciating talents.

Overall result of Chinese traditional painting.

Error Analysis

To conduct a comprehensive error analysis of GPT-4o's performance on CII-Bench, we randomly select a total of 100 erroneous samples from various domains, distributed according to their proportions in the dataset. These samples are subjected to in-depth analysis by expert annotators. GPT-4o's errors can be categorized into the following types: Information Neglect, Misunderstanding of Visual Information, Over-Inference, Superficial Reasoning, and Lack of Cultural Background Knowledge.

error distribution

GPT-4o error response distribution.

Error Examples

Interpretability Analysis of Chinese Image Implications

The essence of Chinese image implications is deeply rooted in deep cultural heritage and complex contextual associations, which enables them to convey profound messages through nuanced expressions. For example, in traditional Chinese art forms such as landscape and New Year paintings, the imagery transcends mere depiction of nature or daily occurrences. Instead, it embodies emotions, philosophical insights, and societal norms through metaphorical and highly symbolic expressions. These symbols, like the pine tree, plum blossom, and crane, are not superficial meaning but are steeped in centuries of cultural tradition, representing resilience, purity, and longevity. However, deciphering these complex messages can be challenging, particularly for those unfamiliar with the cultural and historical narratives tied to these symbols. This contrasts with English image implications, which often convey messages through more straightforward and explicit symbolism. As a result, the interpretability of Chinese image implications depends to some extent on reconstructing and resonating with the cultural context, which is what makes them unique: their meaning is not only visual but also culturally resonant, bridging across time and space. Moreover, the interpretability of Chinese image implications has new changed in the modern era. Globalization and the surge of internet culture have intertwined foreign elements with traditional Chinese culture, birthing new symbols and implications. This intersection introduces additional layers of meaning, complicating the interpretation of traditional symbols.

Comparision of image

Comparision of Chinese and English image implications. Chinese images often embody richer scenes and deeper implications with Chinese traditional culture compared with the straightforward and explicit symbolism in English images.

BibTeX

@misc{zhang2024mllmsunderstanddeepimplication,
      title={Can MLLMs Understand the Deep Implication Behind Chinese Images?}, 
      author={Chenhao Zhang and Xi Feng and Yuelin Bai and Xinrun Du and Jinchang Hou and Kaixin Deng and Guangzeng Han and Qinrui Li and Bingli Wang and Jiaheng Liu and Xingwei Qu and Yifei Zhang and Qixuan Zhao and Yiming Liang and Ziqiang Liu and Feiteng Fang and Min Yang and Wenhao Huang and Chenghua Lin and Ge Zhang and Shiwen Ni},
      year={2024},
      eprint={2410.13854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.13854}, 
}