[Tech Demo] The first real-time video-audio transformer model

Lemon Slice Logo

Lemon Slice Live:
Video call with a transformer

Technical demo of a real-time, audio-video foundation model

We're excited to announce Lemon Slice Live - a real-time video chat experience powered by our research in diffusion transformer models (DiT). For the first time, any image of a character can be immediately transformed into an interactive video call supported in 10+ languages. To accomplish this, we trained a custom DiT model that streams at 25fps. It works across styles - from photorealistic to cartoons to paintings. And, unlike other tools for live avatar chat, it does not require training or setting up a character-specific model. We see this technical demo as an important step in our research towards video models that enable interactive, expressive characters. And, it's available today for anyone to try.

The technical innovations

Achieving this demo required several key technical innovations:

1. Making it fast. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We trained a custom video diffusion transformer (DiT) from scratch to achieve excellent lip sync and facial sync to audio (comparisons to other tools here) [1]. After training our base model, we implemented a student-teacher distillation paradigm to increase its speed to 25fps with limited quality loss. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.

2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Luma) generate 5-second chunks and experience quality degradation after multiple autoregressive extensions due to accumulation of generation errors [7]. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate infinitely long videos.

3. Orchestrating a complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, text-to-speech generation in addition to video generation. Our system currently achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.

We use Deepgram as our AI voice partner. Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. And Modal as the end-to-end compute platform. We partnered with these folks because they are best in the world at enabling real-time interactions, and are grateful for their support in achieving this technical demo.

Why it's novel

Existing real-time video chat experiences like HeyGen and Tavus require training a custom model per character based on a real-world video of the person (limited to photorealistic styles). Products like Tolan, Roblox Chatterblox, and Replika enable cartoon-style video chat via 3D game engines, which requires rigging and setting a character's motions a priori. VTubing tools like LivePortrait, Apple Memoji, or TikTok filters require a real-human to drive the motions.

Instead, our character motions are powered by a real-time video transformer model. It's zero shot, which means users can upload a single image - from photorealistic to cartoons to paintings - and then immediately video chat with that character. This has not been done before.

Furthermore, Lemon Slice Live is a consumer tool first and foremost. We have put the power of this technology in the hands of consumers rather than just making it a developer tool.

Use cases

This technology opens up an exciting future for embodied AI. Use-cases include:

  • Consumer entertainment - platforms like Character AI and Talkie show the enthusiasm for interactive chats, which are currently limited to text (primarily) or voice. Video conversations are the next interface.
  • B2B use-cases, like sales calls (including live selling) and customer support
  • Education. Every child can be interactively taught by their favorite cartoon character. Each mini lesson is followed by an interactive conversation.
  • Ads that pause and turn into a conversation
  • • and much more.

The vision

Every new media technology follows a pattern: it first democratizes existing media formats before metamorphosing into something entirely new. For example, TV began as radio shows where you could see the host before developing unique formats like sitcoms, sports coverage, and reality TV. YouTube (internet video) initially hosted pirated TV content before creators invented vlogging, reaction videos, and unboxings.

Generative video is at this inflection point. While it's currently democratizing traditional video workflows, a new media type will eventually emerge. We believe this new medium will be centered around interactivity. Our TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we're in the mood for.

At Lemon Slice, we're building the future of interactive media. If you're excited about building this future as well, we're hiring!

Architecture

Diffusion models have quickly become the leading paradigm for generative video modeling due to their ability to produce high-fidelity, temporally coherent samples [1,2]. These models iteratively refine noise toward realistic outputs, and recent advances have scaled this process using transformer architectures for both image and video domains [3,4]. As such, we chose to adopt a video diffusion transformer (DiT) architecture in which a transformer network predicts the noise at each diffusion timestep. The diffusion operates in a compressed latent space using a 3D causal variational autoencoder (VAE), which reduces the computational burden of modeling raw video pixels while preserving temporal and spatial structure [4]. Furthermore, we add audio as a conditioning signal via cross-attention, which allows our model to generate the lip-synchronization and expressive character speech that is necessary for character video calls.

Video throughput comparison chart

Performance

Achieving real-time performance in high-fidelity video diffusion models remains an exceptionally difficult challenge. State-of-the-art turbo models, which are specifically designed to be fast, such as Luma's Ray 2 Flash [8], Runway's Gen-4 Turbo, and OpenAI's Sora often require 5–10 seconds to generate just a single second of video. To achieve real-time video streaming we both (1) designed our base DiT architecture to balance speed and quality, and (2) further accelerated it using a student-teacher distillation strategy [5].

A key bottleneck in causal video generation is the accumulation of temporal artifacts over time—particularly in recursive or streaming contexts [7]. To address this, we introduce a novel temporal consistency preservation mechanism that stabilizes long-term generation by aligning intermediate representations across frames. On NVIDIA A100s today, our model achieves 16ms denoising time per frame (60fps) and an overall video streaming latency (including VAE encoding-decoding) at 256px of 25fps. Read more about how we achieved this using Modal and Pipecat on Modal's blog [coming soon].

Video throughput comparison chart

References

Join the Lemon Slice Live API Waitlist

Want to integrate Lemon Slice's real-time video generation into your own applications? We're opening access to our API for select partners and developers.

Apply for API Access →