eyes_forest

Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Sangmin Woo*, Donguk Kim*, Jaehyuk Jang*, Yubin Choi, Changick Kim
KAIST
*Indicates Equal Contribution

Abstract

This study addresses the issue observed in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects. We found that tokens receiving lower attention weights often hold essential information for identifying nuanced object details — ranging from merely recognizing object existence to identifying their attributes (color, position, etc.) and understanding their relationships. To counteract the over-emphasis on blind tokens and to accurately respond to user queries, we introduce a technique called Attentional Vision Calibration (AVISC). During the decoding phase, AVISC identifies blind tokens by analyzing the image-related attention distribution. It then dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This effectively lowers the dependency on blind tokens and promotes a more balanced consideration of all tokens. We validate AVISC on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs.

Observation

Observation

Attention bias in LVLMs. Even when the image (V) does not contain information relevant to the query (Q), LVLMs exhibit a tendency for attention to be biased towards a few image tokens (i.e., blind tokens). This phenomenon is observed by averaging the attention weights across all layers when generating the first response token.

Motivation

Motivation

Impact of blind/non-blind tokens on prediction logits. (Left) Zeroing out image tokens with attention weights higher than the mean + standard deviation, i.e., blind tokens, does not significantly affect the original prediction logits, suggesting that LVLMs may assign high attention weights to tokens that do not carry significant object-discriminative information. Conversely, zeroing out non-blind tokens drastically disrupts the logits, often leading to near 50:50 probabilities, indicating a loss of object-discriminative information. (Right) Similarly, examples demonstrate that zeroing out non-blind tokens results in a loss of discriminative power for previously well-classified instances or produces entirely incorrect predictions, causing a significant drop in performance.

Method: AvisC

Overview

We propose a straightforward method, called AVISC, to enhance visual object understanding in LVLMs during the decoding phase. AVISC dynamically calibrates the over-emphasis on blind tokens on-the-fly at every token generation step. The calibration is guided by the attention patterns of image tokens in response to the given image and textual query. Importantly, AVISC operates without additional training, external models, or complex self-feedback mechanisms.

POPE Results

POPE

AVISC consistently outperforms base decoding and other methods: VCD and M3ID. We reimplemented VCD and M3ID in our evaluation setup.

MME-Fullset Results

MME-Fullset

AVISC achieves top performance in 7 of 14 categories with InstructBLIP and in 11 categories with LLaVA-1.5. Beyond minimizing hallucinations, AVISC also boosts the general functionality of LVLMs.

MME-Hallucination Results

MME-Hallucination

Our method effectively reduces hallucinations at both object and attribute levels, surpassing VCD and M3ID in Total Score.

AMBER Results

AMBER

AVISC outperforms contrastive decoding baselines in both generative and discriminative tasks, achieving the highest AMBER score.

AMBER Discriminative

Our demonstrates superior performance overall, particularly excelling in the Existence and Action categories in both InstructBLIP and LLaVA-1.5.

Qualitative Results

BibTeX


@article{woo2024dont,
  title={Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models}, 
  author={Woo, Sangmin and Kim, Donguk and Jang, Jaehyuk and Choi, Yubin and Kim, Changick},
  journal={arXiv preprint arXiv:2405.17820},
  year={2024},
}