A robot Composes a song and sings it
|
We created a robot that, for the first time, is able to learn facial lip motions for tasks such as speech and singing. We demonstrated how the robot used its abilities to articulate words in a variety of languages, and even sing a song out of its AI-generated debut album “hello world_.” (see full album below)
Achieving realistic robot lip motion is challenging for two reasons: First, it requires specialized hardware containing a flexible facial skin actuated by numerous tiny motors that can work quickly and silently in concert. Second, the specific pattern of lip dynamics is a complex function dictated by sequences of vocal sounds and phonemes. We overcame these hurdles by developing a richly actuated, flexible face and then allowing the robot to learn how to use its face directly by observing humans. Learn more about the process here. |
Listen to the FULL album
The song above is one of our favorites, but just one of a dozen songs composed and performed by the AI. The songs below are a debut album of our robot _EMO_ who sings about its experiences as a new robot: Learning new world models, learning to work with humans, and even learning to smile. Take a listen
|
|
|
Technical Abstract
Lip motion represents outsized importance in human communication, capturing nearly half of our visual attention during conversation. Yet anthropomorphic robots often fail to achieve lip-audio synchronization, resulting in clumsy and lifeless lip behaviors. Two fundamental barriers underlay this challenge. First, robotic lips typically lack the mechanical complexity required to re produce nuanced human mouth movements; second, existing synchronization methods depend on manually predefined movements and rules, restricting adaptability and realism. Here, we present a humanoid robot face designed to overcome these limitations, featuring soft silicone lips actuated by a ten degree-of-freedom (10-DoF) mechanism. To achieve lip synchronization without predefined movements, we use a self-supervised learning pipeline based on a Variational Autoencoder (VAE) combined with a Facial Action Transformer, enabling the robot to autonomously infer more realistic lip trajectories directly 1 from speech audio. Our experimental results suggest that this method outperforms simple heuristics like amplitude-based baselines in achieving more visually coherent lip-audio synchronization. Furthermore, the learned synchronization successfully generalizes across multiple linguistic contexts, enabling robot speech articulation in ten languages unseen during training.
learn more |
All data supporting this study are available at the Dryad repository https://doi.org/10.5061/dryad.j6q573nrc. The codebase and trained model can be found at https://doi.org/10.5281/zenodo.17804235. No new materials were generated.
|
Project participants |
Yuhang Hu, Jiong Lin, Judah Allen Goldfeder, Philippe M. Wyder, Yifeng Cao, Steven Tian, Yunzhe Wang, Jingran Wang, Mengmeng Wang, Jie Zeng, Cameron Mehlman, Yingke Wang, Delin Zeng, Boyuan Chen and Hod Lipson
|
Related Publications |




