MetaVQA

A Benchmark for Embodied Scene Understanding of Vision-Language Models

Image

What is MetaVQA?

MetaVQA a visual question-answering benchmark for improving and evaluating the embodied scene understanding of VLMs.

  • MetaVQA designs a scalable pipeline to generate visual question answer (VQA) pairs relating to traffic scenarios imported from various sources, including nuScenes dataset, Waymo Open Motion Dataset, and a synthetic dataset of safety-critical scenes.
  • MetaVQA provides a large-scale VQA dataset containing 2.7M questions for 291K frames related to spatial, visual, dynamic, and safety-critical counterfactual scene understandings.
  • MetaVQA establishes the baseline performance of VLMs on the dataset and show that the VLMs achieve remarkable embodied scene understanding capabilities through instruction tuning, especially when handling safety-critical situations.
  • MetaVQA's QA Generation Pipeline

    Image
    We first modify a question-type-dependent context-free grammar (CFG) according to the scene graph. Then, we randomly sample a CFG tree and instantiate the referral in the question template by traversing leaf nodes according to the tree.Meanwhile, we compile a set of functional programs for all <o> token bottom-up. Each program enforces constraints from child nodes and filters the objects in the scene into an object set that satisfies these constraints. Once a program finishes, its return value informs the next program's constraints. Once all programs are terminated, the final set contains all objects grounded by the referral, and this set is post-processed to retrieve the final answer.

    MetaVQA's Baseline

    We conduct benchmark experiments on VQA tasks. The model receives questions and corresponding visual data, which may consist of either a single frame (Static) or five frames (Dynamic and Safety) captured from six different views. The model is trained to predict the correct answer using token strings, and it is evaluated with questions from all three supertypes. For detailed explainations, please refer to the paper. Note that we are continue to expand this baseline to more models and more tasks.
    Benchmarks with different data composition. All metrics are evaluated on a held-out test set with all supertypes. S: spatial; T: trajectory; A: attributes; N: numerical; L: logical. Overall: main metric.
    Data Composition Averaged Static Dynamic
    Model Static Dynamic Safety Overall S↑ T↓ A↑ N↑ L↑ Safe↑ S↑ A↑ N↑ L↑ S↑ T↓ A↑ N↑ L↑
    BLIP-2 0.093 0.160 N/A 0.431 0.359 0.109 0.164 0.249 0.499 0.565 0.103 0.070 N/A 0.363 0.152 0.116
    0.385 0.280 1.423 0.454 0.501 0.129 0.000 0.221 0.463 0.545 0.173 0.338 1.423 0.446 0.457 0.085
    0.489 0.204 4.563 0.420 0.486 0.005 0.822 0.182 0.419 0.557 0.000 0.225 4.563 0.422 0.415 0.010
    ELM 0.385 0.184 N/A 0.483 0.400 0.395 0.108 0.190 0.499 0.564 0.598 0.178 N/A 0.466 0.235 0.193
    0.710 0.208 1.710 0.485 0.515 0.412 0.150 0.183 0.502 0.576 0.514 0.232 1.710 0.469 0.454 0.310
    0.897 0.206 1.662 0.495 0.516 0.419 0.825 0.179 0.515 0.574 0.538 0.232 1.662 0.476 0.459 0.301