VidTune: Creating Video Soundtracks with
Generative Music and Contextual Thumbnails

1UC Berkeley UC Berkeley   2Adobe Research Adobe
Accepted to CHI 2026
click to unmute

What If You Could Visually Skim Generative Music With Thumbnails?





VidTune provides prompt suggestions based on the input video and expands users' prompt to generate a diverse set of candidate music tracks. Each track is presented with contextual thumbnails for efficient music preview in-context.

Abstract

Music shapes the tone of videos, yet creators find it hard to find soundtracks that match their video's mood and narrative. Recent text-to-music models let creators generate music from text prompts, but our formative study (N = 8) shows creators struggle to construct diverse prompts, quickly review and compare tracks, and understand their impact on the video. We present VidTune, a system that supports soundtrack creation by generating diverse music options from a creator's prompt and producing contextual thumbnails for rapid review. VidTune extracts representative video subjects to ground thumbnails in context, maps each track's valence and energy onto visual cues like color and brightness, and depicts prominent genres and instruments. Creators can refine tracks with natural language edits, which VidTune expands into new generations. In a controlled user study (N = 12) and an exploratory case study (N = 6), participants found VidTune helpful for efficiently reviewing and comparing music options and described the process as playful and enriching.


VidTune Interface

Full VidTune interface showing the video player and timeline (A–C), prompt suggestion panel (D), four candidate music tracks with animated contextual thumbnails (E), hover keywords (F), fit check (G), natural-language edit controls (H), filter and search panel (I), and the music map (J)

The video player and timeline let users choose a scene to add music and preview candidates in sync with the video (A-C). VidTune surfaces prompt suggestions based on the selected scene and user goal (D), then expands the prompt to generate 4 candidates with contextual thumbnails (E). On hover, users see reusable prompt keywords (F) and a fit check (G). Users can iterate with natural-language-edits (H), organize generations via filter/search (I), or view a music map for similarity-based exploration (J).





Four animated contextual thumbnail frames sampled at 0.5 FPS, showing how the illustrated character switches instruments (A, B), movements change with tempo (C), and visual effects reflect energy level (D)

VidTune's animated thumbnail sampled at 0.5 FPS. As the thumbnail animates, the character switches to instruments that enter later in the audio (A, B), movements convey rough tempo (C), and visual effects convey energy (D).





2D music map scatterplot where each dot represents a generated track positioned by audio-embedding similarity; a dashed line traces the sequence of tracks added to the current video, with a multi-select region highlighted for the Blend feature

VidTune's Music Map arranges generated tracks in a 2D space by audio-embedding similarity, revealing families of related music at a glance. A dashed path shows the sequence of tracks added to the current video. Users can multi-select tracks and use Blend to generate similar variations.




Pipeline

Pipeline diagram: a user video feeds into subject extraction and description to form a base prompt; generated music is analyzed for valence, energy, genre, and instruments which are mapped to visual cues; base and style prompts are fused to produce static and animated contextual thumbnails

From a user video, VidTune extracts an anchor subject and description to form a base prompt grounded in the footage. It analyzes the generated music to infer musical attributes, maps them to visual cues, and fuses these into a style prompt. The fused prompt is used to generate static and animated thumbnails that reflect both the video context and the music.


VidTune Thumbnails

Each row shows 4 alternative AI-generated music tracks for the same video, illustrated with VidTune’s thumbnails.
Click the speaker icon to unmute a track and compare how different music changes the feel of the video.

"Positive, optimistic music for busy morning scenes"

"Vibrant music for Paris at night"

"Children's music"

"Happy goodbye"

"Suspenseful chase music"

Preview Video



Demo Video

BibTeX

@article{huh2026vidtune,
          title={VidTune: Creating Video Soundtracks with Generative Music and Contextual Thumbnails},
          author={Huh, Mina and Fraser, Ailie C and Li, Dingzeyu and Dontcheva, Mira and Wang, Bryan},
          journal={arXiv preprint arXiv:2601.12180},
          year={2026}
        }