This&That: Language-Gesture Controlled Video Generation for Robot Planning

This & That

First frame

Gesture

Our Video Generation

dog

dog

Put this inside that

dog

dog

Close this

dog

dog

Put this inside that

dog

dog

Put this near that

Comparison vs. Previous Language-Conditioned Method

Condition

AVDC (Language-Only)

Our Video Generation

dog

Put carrot in pot or pan

Put this to there

dog

Put the yellow cube on top of the blue cube

Put this to there

dog

Close the drawer

Close this to there

dog

Fold the cloth from the bottom to top

Fold this to there

dog

Put the ball to the cup

Put this to there

Simulation Rollout Comparison

Ground Truth

Language-Only

Language-Gesture (Ours)

Stack right green cube on top of left green cube

stack this to there

Move cyan cylinder to the right of left gray cube

Move this to there

Stack rightmost red cube on top of second leftmost red cube

stack this to there

Move leftmost cyan cylinder behind second rightmost cyan cylinder

Move this to there

Limitation of Gesture-Only Conditioning

Condition

Gesture-Only

Language-Gesture (Ours)

dog

Fold this to there

Limitation of Language-Only Conditioning

Condition

Language-Only

Language-Gesture (Ours)

dog

Take the blue rectangular box and put in the top left of the table

Take this to there