December 2025
Supporting our release of LemonSlice Agents is LemonSlice-2, a novel video diffusion transformer model and inference framework that enables real-time, interactive avatar experiences. LemonSlice-2 is a 20 billion parameter, few-step causal model that achieves a generation throughput of 20 frames per second on a single GPU. Efficient attention and caching strategies enable ultra-fast response times in an interactive setting and infinite-length videos with zero error accumulation. LemonSlice-2 supports full-body avatar generation with expressive and semantically aware gestures. It is now available to the public for general use.
Any character
Videos generated in real-time from a single image and audio sample on one GPU
Any style
Videos generated in real-time from a single image and audio sample on one GPU
Expressive gestures & scene awareness
Videos generated in real-time from a single image and audio sample on one GPU
Real-time interactions
LemonSlice-2 enables real-time interactions with any character. Below we show screen recordings of the embeddable widget powered by the model, now available for general use.
Breaking the real-time barrier
LemonSlice-2 generates video frames faster than they can be watched. Strategies we used to break the real-time barrier include causal attention, a novel distribution matching distillation-inspired training paradigm, efficient caching, CUDA graph acceleration, and quantization.

Ultra-fast response times
Users of LemonSlice-2 experience an average response time of 2.8s. Video generation makes up only 26% of that time (730 milliseconds).

Infinite video
As an auto-regressive model, LemonSlice-2 is not limited to generating videos of a fixed length. Critically, unlike other autoregressive models, it does not experience error accumulation, allowing for infinite-length video generation.

Dynamic text control
LemonSlice-2 enables real-time manipulation of video content via text prompting.











