This&That:
Language-Gesture Controlled Video Generation for Robot Planning
This & That
First frame
Gesture
Our Video Generation
Put this inside that
Close this
Put this inside that
Put this near that
Comparison vs. Previous Language-Conditioned Method
Condition
AVDC (Language-Only)
Our Video Generation
Put carrot in pot or pan
Put this to there
Put the yellow cube on top of the blue cube
Put this to there
Close the drawer
Close this to there
Fold the cloth from the bottom to top
Fold this to there
Put the ball to the cup
Put this to there
Simulation Rollout Comparison
Ground Truth
Language-Only
Language-Gesture (Ours)
Stack right green cube on top of left green cube
stack this to there
Move cyan cylinder to the right of left gray cube
Move this to there
Stack rightmost red cube on top of second leftmost red cube
stack this to there
Move leftmost cyan cylinder behind second rightmost cyan cylinder
Move this to there
Limitation of Gesture-Only Conditioning
Condition
Gesture-Only
Language-Gesture (Ours)
Fold this to there
Limitation of Language-Only Conditioning
Condition
Language-Only
Language-Gesture (Ours)
Take the blue rectangular box and put in the top left of the table
Take this to there