MetaVQA
A Benchmark for Embodied Scene Understanding of Vision-Language Models
What is MetaVQA?
MetaVQA a visual question-answering benchmark for improving and evaluating the embodied scene understanding of VLMs.
MetaVQA's QA Generation Pipeline
We first modify a question-type-dependent context-free grammar (CFG) according to the scene graph. Then, we randomly sample a CFG tree and instantiate the referral in the question template by traversing leaf nodes according to the tree.Meanwhile, we compile a set of functional programs for all <o> token bottom-up. Each program enforces constraints from child nodes and filters the objects in the scene into an object set that satisfies these constraints. Once a program finishes, its return value informs the next program's constraints. Once all programs are terminated, the final set contains all objects grounded by the referral, and this set is post-processed to retrieve the final answer.
MetaVQA's Baseline
We conduct benchmark experiments on VQA tasks. The model receives questions and corresponding visual data, which may consist of either a single frame (Static) or five frames (Dynamic and Safety) captured from six different views. The model is trained to predict the correct answer using token strings, and it is evaluated with questions from all three supertypes. For detailed explainations, please refer to the paper. Note that we are continue to expand this baseline to more models and more tasks.
Data Composition | Averaged | Static | Dynamic | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Static | Dynamic | Safety | Overall | S↑ | T↓ | A↑ | N↑ | L↑ | Safe↑ | S↑ | A↑ | N↑ | L↑ | S↑ | T↓ | A↑ | N↑ | L↑ |
BLIP-2 | ✓ | 0.093 | 0.160 | N/A | 0.431 | 0.359 | 0.109 | 0.164 | 0.249 | 0.499 | 0.565 | 0.103 | 0.070 | N/A | 0.363 | 0.152 | 0.116 | ||
✓ | ✓ | 0.385 | 0.280 | 1.423 | 0.454 | 0.501 | 0.129 | 0.000 | 0.221 | 0.463 | 0.545 | 0.173 | 0.338 | 1.423 | 0.446 | 0.457 | 0.085 | ||
✓ | ✓ | ✓ | 0.489 | 0.204 | 4.563 | 0.420 | 0.486 | 0.005 | 0.822 | 0.182 | 0.419 | 0.557 | 0.000 | 0.225 | 4.563 | 0.422 | 0.415 | 0.010 | |
ELM | ✓ | 0.385 | 0.184 | N/A | 0.483 | 0.400 | 0.395 | 0.108 | 0.190 | 0.499 | 0.564 | 0.598 | 0.178 | N/A | 0.466 | 0.235 | 0.193 | ||
✓ | ✓ | 0.710 | 0.208 | 1.710 | 0.485 | 0.515 | 0.412 | 0.150 | 0.183 | 0.502 | 0.576 | 0.514 | 0.232 | 1.710 | 0.469 | 0.454 | 0.310 | ||
✓ | ✓ | ✓ | 0.897 | 0.206 | 1.662 | 0.495 | 0.516 | 0.419 | 0.825 | 0.179 | 0.515 | 0.574 | 0.538 | 0.232 | 1.662 | 0.476 | 0.459 | 0.301 |