This&That: Language-Gesture Controlled Video Generation for Robot Planning



This & That

First frame
Gesture
Our Video Generation
dog
dog
Put this inside that
dog
dog
Close this
dog
dog
Put this inside that
dog
dog
Put this near that

Comparison vs. Previous Language-Conditioned Method

Condition
AVDC (Language-Only)
Our Video Generation
dog
Put carrot in pot or pan
Put this to there
dog
Put the yellow cube on top of the blue cube
Put this to there
dog
Close the drawer
Close this to there
dog
Fold the cloth from the bottom to top
Fold this to there
dog
Put the ball to the cup
Put this to there

Simulation Rollout Comparison

Ground Truth
Language-Only
Language-Gesture (Ours)
Stack right green cube on top of left green cube
stack this to there
Move cyan cylinder to the right of left gray cube
Move this to there
Stack rightmost red cube on top of second leftmost red cube
stack this to there
Move leftmost cyan cylinder behind second rightmost cyan cylinder
Move this to there

Limitation of Gesture-Only Conditioning

Condition
Gesture-Only
Language-Gesture (Ours)
dog
Fold this to there

Limitation of Language-Only Conditioning

Condition
Language-Only
Language-Gesture (Ours)
dog
Take the blue rectangular box and put in the top left of the table
Take this to there