Looking for Audio/Speech-to-Full Body Gesture Model for Stable Diffusion

Hey everyone,

I’m Darshan Hiranandani, currently looking for a model that can convert audio or speech input into full-body gestures. The idea is to generate a gesture mesh that can be used in conjunction with Stable Diffusion, using a reference image as input.

Has anyone come across any existing models or solutions that can do this? Any suggestions, resources, or insights on how to approach this would be really helpful.

Thanks in advance for your help!
Regards
Darshan Hiranandani