ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

Attribution MLLM Uncertainty Benchmark Suite

Yahia Alqurnawi^*, Preetom Biswas^*, Anmol Rao^*, Tejas Anvekar, Chitta Baral, Vivek Gupta

Arizona State University
^*Equal contribution.

ICLR 2026 (Under Review)

Paper arXiv Code

What it explores

How accurately do mLLMs identify table cells that support a given answer?

Does a model's confidence score reliably reflect the correctness of its attribution?

What it delivers

A unified benchmark for structured data attribution that evaluates question answering, fine-grained row and column localization, and confidence calibration across text, JSON, and image-based tables—providing a clear measure of grounding and traceability.

Abstract

Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.

Introduction

Multimodal Large Language Models (mLLMs) can answer questions over structured data such as tables and JSON with reasonable accuracy, but they often fail to identify the exact rows and columns that support their answers. This gap between answer correctness and evidence localization limits their reliability in settings that require traceability. To address this, we introduce ViTaB-A, a benchmark designed to evaluate structured data attribution across text, JSON, and rendered table images. ViTaB-A systematically measures whether models can not only answer correctly, but also precisely ground their responses in the underlying table fields.

Major Contributions

Representation-Wise Evaluation

A comprehensive benchmark for evaluating multimodal LLMs on visual table attribution across text tables, JSON files, and rendered table images.

Unified Tasks + Metrics

The first large-scale evaluation of open-source mLLMs that jointly measures table question answering, fine-grained row/column attribution, and confidence alignment.

Spatial Grounding Analysis

Empirical analysis revealing systematic gaps between answer accuracy and evidence localization, highlighting weaknesses in spatial grounding and traceability.

ViTaB-A Pipeline

ViTaB-A provides a structured evaluation setup consisting of question–answer pairs, table representations (JSON, Markdown, or rendered images), and ground-truth row and column citations. These inputs are evaluated under multiple prompting paradigms—including zero-shot, few-shot, and chain-of-thought—to systematically assess attribution quality. Performance is measured not only through answer and citation accuracy, but also through model certainty and calibration metrics, enabling a comprehensive analysis of grounding reliability.

Framework overview of ViTaB-A Dataset Generation and Evaluation Pipeline.

Benchmark Details

ViTaB-A evaluates structured data attribution using curated question–answer pairs grounded in tabular data with explicit row and column citations. Each instance includes a table and its corresponding ground-truth supporting cells.

Tables are presented across multiple modalities to assess robustness of attribution under representational and visual variation. For Image-based representation, additional visual perturbations (such as varying header color, font-style) are added for generalized evaluation.

Table Formats

JSON · Markdown · Rendered Images

Image Perturbation

Red · Green · Blue · Arial · Times New Roman

Evaluation Tasks

ViTaB-A evaluates multimodal LLMs along two complementary dimensions: answer and attribution correctness, and the alignment of model certainty.

Task 1: QA & Attribution Accuracy

Models are evaluated on their ability to generate the correct answer and to precisely identify the supporting table cells (row and column). We measure answer accuracy, row accuracy, column accuracy, and cell accuracy to quantify fine-grained grounding performance across JSON, Markdown, and rendered table inputs.

Task 2: Internal & Verbalized Certainty

Beyond correctness, we assess whether a model's expressed confidence aligns with its internal certainty measure. We compare verbalized certainty (self-reported confidence scores) with internal token probability-based measures to evaluate calibration and uncertainty alignment in structured attribution settings.

Task 3: Confidence-Accuracy Alignment

Beyond correctness, we assess whether a model's confidence (both internal and verbalized) can be a reliable indicator of its true accuracy . We measure alignment through Brier Score for accuracy and confidence. Higher alignment would denote that confidence is an acceptable reflection of the model's performance.

Properties Measured

ViTaB-A measures both grounding correctness and uncertainty reliability. We evaluate fine-grained attribution performance at the cell level and assess whether a model’s expressed confidence aligns with its true accuracy.

Cell Accuracy

Percentage of predictions where both the predicted row and column exactly match the ground-truth supporting cell.

Row Accuracy

Percentage of instances where the predicted row matches the ground-truth row, isolating record-level localization ability.

Column Accuracy

Percentage of instances where the predicted column matches the ground-truth column, isolating field-level localization ability.

Internal Confidence

Probability-based confidence derived from the model’s token-level output distribution for its predicted answer or citation.

Verbalized Confidence

Self-reported confidence score of the given attribution response explicitly evaluated by the model.

Alignment Score

Calibration between confidence and correctness measured via the Brier Score, computed as the mean squared difference between confidence and accuracy.

Experiments & Setup

The evaluation spans four model families where each model is evaluated on the ViTaB-A benchmark through three prompting paradigms: zero-shot, few-shot, chain-of-thought.

Model Families

Qwen3-VL → 2B · 4B · 8B · 32B Gemma3 → 4B · 12B · 27B Molmo2 → 4B · 8B InternVL-3.5 → 4B · 8B · 14B · 38B

Results & Discussion

Four consistent themes emerge from our evaluation: QA accuracy remains moderate (≈50–60%) across models and formats, but attribution accuracy drops sharply—near random for JSON and only ~30% for images. Attribution is consistently stronger for image-based tables than textual formats, revealing a modality-dependent grounding gap. Models are substantially better at identifying rows than columns, exposing weaknesses in fine-grained field localization. Finally, confidence—both internal and verbalized—shows no consistent alignment with attribution accuracy, indicating that certainty is an unreliable signal of grounding quality.

Implications

Current results reveal a structural gap between answer generation and attribution quality, exposing where multimodal LLMs produce correct responses without reliable grounding in table fields. ViTaB-A provides a effective pipeline for evaluating traceability and confidence reliability in structured data settings.

QA–Attribution Gap Localization Failures Unreliable Confidence Measures

Figure 5: Robustness vs gap for Task 3(a) — Table 1: Model Accuracy in QA vs Attribution (%) Across Prompting Strategies and Open-source Models (4B). Note: green indicates the overall best model and red indicates the worst.

Table 1: Robustness of MLLMs for ID vs OOD across tasks — Table 2: Row vs Column Accuracy (%) Across Modalities and Prompting Strategies and Open-source Models (for 4B Parameter). Note: green indicates the overall best model and red indicates the worst.

Figure 6: Multidimensional insights for Task 2 across datasets — Table 3: Confidence-Accuracy correlation for Internal and Verbal; Across Multiple Modalities. Results show no direct correlation.

Figure 7: Celeb chain length vs. outcome — Figure 2: Radar charts comparing model families across attribution accuracy, QA accuracy, and confidence gap (1 - |verbal - internal|) under different prompting strategies. InternVL-3.5 shows best performance.

BibTeX

@misc{,
      title={ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution}, 
      author={Yahia Alqurnawi and Preetom Biswas and Anmol Rao and Tejas Anvekar and Chitta Baral and Vivek Gupta},
      year={2026},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/}, 
}