Creative Machines Lab - Columbia University
  • Home
  • About
  • People
  • Research
    • Robot Lip Sync
    • _EMO_'s First Album
    • Crystallography
    • Fingerprints
    • Hidden Variables
    • Visual Self Modeling
    • Label Representations
    • Robot Visual Behavior Modeling
    • Particle Robotics
    • Deep Self Modeling
    • Evolutionary Self Modeling
    • Self Replication
    • Laser Cooking
    • Digital Food
    • Soft Actuator
    • Layered Assembly
    • Cellular Machines
    • Inverted Laser Sintering
    • Eureqa
    • Golem
    • Data Smashing
    • Jamming Gripper
    • Soft Robot Evolution
    • Truss Reconfiguration
    • Fluidic Assembly
    • Ornithopters
    • Tensegrity
  • Papers
    • Selected Papers
    • All Papers
  • Videos
  • Talks
  • Open Source
    • Titan Library
    • Fab@Home
    • FreeLoader
    • VoxCad
    • Spyndra
    • PARA
    • Aracna Robot
    • Cuneiforms
    • Eva
    • AMF
  • Join Us
  • Internal
    • Wiki
    • Email list
  • Contact
  • Home
  • About
  • People
  • Research
    • Robot Lip Sync
    • _EMO_'s First Album
    • Crystallography
    • Fingerprints
    • Hidden Variables
    • Visual Self Modeling
    • Label Representations
    • Robot Visual Behavior Modeling
    • Particle Robotics
    • Deep Self Modeling
    • Evolutionary Self Modeling
    • Self Replication
    • Laser Cooking
    • Digital Food
    • Soft Actuator
    • Layered Assembly
    • Cellular Machines
    • Inverted Laser Sintering
    • Eureqa
    • Golem
    • Data Smashing
    • Jamming Gripper
    • Soft Robot Evolution
    • Truss Reconfiguration
    • Fluidic Assembly
    • Ornithopters
    • Tensegrity
  • Papers
    • Selected Papers
    • All Papers
  • Videos
  • Talks
  • Open Source
    • Titan Library
    • Fab@Home
    • FreeLoader
    • VoxCad
    • Spyndra
    • PARA
    • Aracna Robot
    • Cuneiforms
    • Eva
    • AMF
  • Join Us
  • Internal
    • Wiki
    • Email list
  • Contact
Creative Machines Lab - Columbia University

Robot LIP SYNC

A robot learns to sync its lips to speech and song

We humans attribute outsized importance to facial gestures in general, and to lip motion in particular. While we may forgive a funny walking gait or an awkward hand motion, we remain unforgiving of even the slightest facial malgesture. This high bar is known as the “Uncanny Valley.” Robots oftentimes look lifeless, even creepy, because their lips don't move. But that is about to change.
We created a robot that, for the first time, is able to learn facial lip motions for tasks such as speech and singing. We demonstrated how the robot used its abilities to articulate words in a variety of languages, and even sing a song out of its AI-generated debut album “hello world_.”
​

The robot acquired this ability through observational learning rather than via rules. It first learned how to use its 26 facial motors by watching its own reflection in the mirror before learning to imitate human lip motion by watching hours of YouTube videos. 
The robot _EMO_ singing a song from its debut album Hello World_

How did the robot learn to lip sync?

Achieving realistic robot lip motion is challenging for two reasons: First, it requires specialized hardware containing a flexible facial skin actuated by numerous tiny motors that can work quickly and silently in concert. Second, the specific pattern of lip dynamics is a complex function dictated by sequences of vocal sounds and phonemes. 

Human faces are animated by dozens of muscles that lie just beneath a soft skin and sync naturally to vocal chords and lip motions. By contrast, humanoid faces are mostly rigid, operating with relatively few degrees of motion, and their lip movement is choreographed according to rigid, predefined rules. The resulting motion is stilted, unnatural, and uncanny.
Picture
We overcame these hurdles by developing a richly actuated, flexible face and then allowing the robot to learn how to use its face directly by observing humans. First, we placed the robotic face in front of a mirror so that the robot could learn how its own face moves in response to muscle activity. Like a child making faces in a mirror for the first time, the robot made thousands of random face expressions and lip gestures. Over time, it learned how to move its motors to achieve particular facial appearances, an approach called a “vision-to-action” language model (VLA).
Then, we placed the robot in front of recorded videos of humans talking and singing, giving AI that drives the robot an opportunity to learn how exactly humans’ mouths moved in the context of various sounds they emitted.

With these two models in hand, the robot’s AI could now translate audio directly into lip motor action. 
We tested this ability using a variety of sounds, languages, and contexts, as well as some songs. Without any specific knowledge of the audio clips' meaning, the robot was then able to move its lips in sync.

Technical ABSTRACT

Lip motion represents outsized importance in human communication, capturing nearly half of our visual attention during conversation. Yet anthropomorphic robots often fail to achieve lip-audio synchronization, resulting in clumsy and lifeless lip behaviors. Two fundamental barriers underlay this challenge. First, robotic lips typically lack the mechanical complexity required to re produce nuanced human mouth movements; second, existing synchronization methods depend on manually predefined movements and rules, restricting adaptability and realism. Here, we present a humanoid robot face designed to overcome these limitations, featuring soft silicone lips actuated by a ten degree-of-freedom (10-DoF) mechanism. To achieve lip synchronization without predefined movements, we use a self-supervised learning pipeline based on a Variational Autoencoder (VAE) combined with a Facial Action Transformer, enabling the robot to autonomously infer more realistic lip trajectories directly 1 from speech audio. Our experimental results suggest that this method outperforms simple heuristics like amplitude-based baselines in achieving more visually coherent lip-audio synchronization. Furthermore, the learned synchronization successfully generalizes across multiple linguistic contexts, enabling robot speech articulation in ten languages unseen during training.

learn more

All data supporting this study are available at the Dryad repository https://doi.org/10.5061/dryad.j6q573nrc. The codebase and trained model can be found at https://doi.org/10.5281/zenodo.17804235. No new materials were generated.

Press

This humanoid robot learned realistic lip movements by watching YouTube | TechSpot
A Robot Learns to Lip Sync | Columbia Engineering
​
This Lip-Syncing Robot Face Could Help Future Bots Talk Like Us - CNET
The Quest for the Perfect Lip-Synching Robot
​
Humanoid robot masters lip-sync, predicts face reaction with new system
The breakthrough that makes robot faces feel less creepy | ScienceDaily
A robot learned to lip sync after watching hours of YouTube videos | The Independent
The Quest for the Perfect Lip-Synching Robot
A Real-Life Robot Learned to Lip-Sync Thanks to AI – Scientific Inquirer
Columbia’s EMO Robot Learns Realistic Lip Sync
Mirror Training: How a Humanoid Robot Learned to Lip Sync Using AI and a Reflection - ScienceBlog.com
A Robot Learns to Lip Sync - Technology Organization

Project participants

Yuhang Hu, Jiong Lin, Judah Allen Goldfeder, Philippe M. Wyder, Yifeng Cao, Steven Tian, Yunzhe Wang, Jingran Wang, Mengmeng Wang, Jie Zeng, Cameron Mehlman, Yingke Wang, Delin Zeng, Boyuan Chen and Hod Lipson​

Related Publications

Learning Realistic Lip Motions for Humanoid Face Robots
Picture