What it explores
How accurately do mLLMs identify table cells that support a given answer?
Does a model's confidence score reliably reflect the correctness of its attribution?
What it delivers
A unified benchmark for structured data attribution that evaluates question answering, fine-grained row and column localization, and confidence calibration across text, JSON, and image-based tables—providing a clear measure of grounding and traceability.
Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.
Multimodal Large Language Models (mLLMs) can answer questions over structured data such as tables and JSON with reasonable accuracy, but they often fail to identify the exact rows and columns that support their answers. This gap between answer correctness and evidence localization limits their reliability in settings that require traceability. To address this, we introduce ViTaB-A, a benchmark designed to evaluate structured data attribution across text, JSON, and rendered table images. ViTaB-A systematically measures whether models can not only answer correctly, but also precisely ground their responses in the underlying table fields.
A comprehensive benchmark for evaluating multimodal LLMs on visual table attribution across text tables, JSON files, and rendered table images.
The first large-scale evaluation of open-source mLLMs that jointly measures table question answering, fine-grained row/column attribution, and confidence alignment.
Empirical analysis revealing systematic gaps between answer accuracy and evidence localization, highlighting weaknesses in spatial grounding and traceability.
ViTaB-A provides a structured evaluation setup consisting of question–answer pairs, table representations (JSON, Markdown, or rendered images), and ground-truth row and column citations. These inputs are evaluated under multiple prompting paradigms—including zero-shot, few-shot, and chain-of-thought—to systematically assess attribution quality. Performance is measured not only through answer and citation accuracy, but also through model certainty and calibration metrics, enabling a comprehensive analysis of grounding reliability.
ViTaB-A evaluates structured data attribution using curated question–answer pairs grounded in tabular data with explicit row and column citations. Each instance includes a table and its corresponding ground-truth supporting cells.
Tables are presented across multiple modalities to assess robustness of attribution under representational and visual variation. For Image-based representation, additional visual perturbations (such as varying header color, font-style) are added for generalized evaluation.
Table Formats
JSON · Markdown · Rendered Images
Image Perturbation
Red · Green · Blue · Arial · Times New Roman
ViTaB-A evaluates multimodal LLMs along two complementary dimensions: answer and attribution correctness, and the alignment of model certainty.
Models are evaluated on their ability to generate the correct answer and to precisely identify the supporting table cells (row and column). We measure answer accuracy, row accuracy, column accuracy, and cell accuracy to quantify fine-grained grounding performance across JSON, Markdown, and rendered table inputs.
Beyond correctness, we assess whether a model's expressed confidence aligns with its internal certainty measure. We compare verbalized certainty (self-reported confidence scores) with internal token probability-based measures to evaluate calibration and uncertainty alignment in structured attribution settings.
Beyond correctness, we assess whether a model's confidence (both internal and verbalized) can be a reliable indicator of its true accuracy . We measure alignment through Brier Score for accuracy and confidence. Higher alignment would denote that confidence is an acceptable reflection of the model's performance.
ViTaB-A measures both grounding correctness and uncertainty reliability. We evaluate fine-grained attribution performance at the cell level and assess whether a model’s expressed confidence aligns with its true accuracy.
Percentage of predictions where both the predicted row and column exactly match the ground-truth supporting cell.
Percentage of instances where the predicted row matches the ground-truth row, isolating record-level localization ability.
Percentage of instances where the predicted column matches the ground-truth column, isolating field-level localization ability.
Probability-based confidence derived from the model’s token-level output distribution for its predicted answer or citation.
Self-reported confidence score of the given attribution response explicitly evaluated by the model.
Calibration between confidence and correctness measured via the Brier Score, computed as the mean squared difference between confidence and accuracy.
The evaluation spans four model families where each model is evaluated on the ViTaB-A benchmark through three prompting paradigms: zero-shot, few-shot, chain-of-thought.
Four consistent themes emerge from our evaluation: QA accuracy remains moderate (≈50–60%) across models and formats, but attribution accuracy drops sharply—near random for JSON and only ~30% for images. Attribution is consistently stronger for image-based tables than textual formats, revealing a modality-dependent grounding gap. Models are substantially better at identifying rows than columns, exposing weaknesses in fine-grained field localization. Finally, confidence—both internal and verbalized—shows no consistent alignment with attribution accuracy, indicating that certainty is an unreliable signal of grounding quality.
Current results reveal a structural gap between answer generation and attribution quality, exposing where multimodal LLMs produce correct responses without reliable grounding in table fields. ViTaB-A provides a effective pipeline for evaluating traceability and confidence reliability in structured data settings.
@misc{,
title={ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution},
author={Yahia Alqurnawi and Preetom Biswas and Anmol Rao and Tejas Anvekar and Chitta Baral and Vivek Gupta},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/},
}