Speech Bubble Experiment

April 6, 2020 · 3 min read

Engineer

Why

Closed captions have been around for awhile its time to try to innovate on them.

What

Speech bubbles are an experimental way to display what is commonly known as captions or subtitles. Using results from multiple AI services we can programmatically add speech bubbles to videos. This is a super rough poc where the intent is to just display what is possible.

Video

todo: The bubble jerkiness can be fixed, a smooth elastic like bubble movement would be ideal

How

Taking the results from ContentAI extractors aws_transcribe, aws_closed_captions and aws_faces and run them through the Box2d physics engine.

ContentAI Extractors

aws_transcribe: Converts speech to text.
aws_closed_captions(experiment): Takes the results from the aws_transcribe extractor and creates .srt closed caption file.
aws_faces: Detect bounding box, attributes, emotions, landmarks, quality, and the pose for each face.

Box2d

Attach physics objects to characters faces.

Face Body - attach face body to characters face. Five frames a second, we take the position from aws_faces and update our physics face body.
Speech Bubble Body - we render the caption text onto the bubble body
Fixed Revolute Joint - we are connecting the face and speech bubble bodies with a joint. There are various types of Box2D joints, but for this poc will try to keep the two bodies equal distance apart using a revolute joint. As we update the position of the face body with the results from ContentAI the speech bubble bodies position is updated by the physics engine.

Future Ideas

Customize bubble - give brands and franchises the ability to customize their bubbles.
Multiple language support - our extractors can translate up to 32 different languages for global reach.
Modify text size - based on the characters emotion we can alter the text/font of individual words.
Highlight spoken word - with the millisecond granularity we can highlight words in the speech bubble as they are spoken.
Typewriter effect - with the millisecond granularity we can append the next word to the end of the line as it's being spoken.
Multiple speaker support - the experiment only supports one face.
Enhance bubble movement - the current bubble is limited in movement and is jerky. We can improve all current shortcomings.

Use Cases

Social media clips - viewers tend to scroll past stuff within 3-5 seconds and typically have the audio turned off, speech bubbles could help grab their attention.
Top Player Support - we own the video player our customer facing products use, we could add speech bubbles to our player.

Why​

What​

Video​

How​

ContentAI Extractors​

Box2d​

Future Ideas​

Use Cases​

Why