Scene Graph Question Answering (SGQA)

Reasoning over scene graphs to answer a given question

This task involves reasoning over scene graphs to answer a given question. Formally, given a question and scene graphs \( G=(V, E) \), the model must predict an answer, which corresponds to an element in \( V \). As shown in Figure, questions in our benchmark require logically or temporally connecting a sequence of actions or object state changes, which can be solved by hopping across multiple triplets.

Example

Scene Graph Description Selection (SGDS)

Reasoning over scene graphs to answer a given question

The goal of this task is to accurately interpret a scene graph within a given context and identify the correct description among distractors. We formulate SGDS as a multiple-choice question problem, consisting of the graph-based context \( C^g_i = (G_1, \dots, G_{i-1}) \), a scene graph \( G_i \), and five candidate descriptions, with one correct answer included. The model should be able to track nodes and edges from \( C^g_i \) and ensure that all elements in \( G_i \) are accurately represented. For SGDS, we use scene graphs representing a single action.

Example 1

Example 2