The Veo 3.1 Prompt Formula: Five Parts to Cinematic AI Video

PromptVerse Editorial

·May 2, 2026·7 min read

The Veo 3.1 Prompt Formula: Five Parts to Cinematic AI Video

We've been generating Veo 3.1 clips through Higgsfield (veo3_1 and veo3_1_lite) almost every day for a month, and the pattern keeps holding: when our prompts go from random sentence soup to a clean five-part structure, the hit rate roughly doubles. So this post is the Veo 3.1 prompt formula as we actually use it — not a theoretical guide, but the working template we drop into our notes app and edit per shot.

If you're new to Veo 3.1, the headline is that it's Google's flagship text-to-video model with synchronized dialogue, ambient sound, and music inside one generation. The trade-off is that it's pickier about prompt structure than Kling 3.0 or Seedance 2.0, and rewards you heavily for using actual cinematography vocabulary. Treat it like a director, not an artist.

The five-part Veo 3.1 prompt formula

Almost every consistent Veo 3.1 prompt we ship follows the same five-part order. We borrowed the structure from Google's own prompting guide and Kling's "Director's Mindset" framework, then trimmed it down until it fit on one mental sticky note:

Cinematography — camera, lens, movement, framing
Subject — who or what is in the shot, described concretely
Action — what the subject is doing, in present-tense verbs
Context — environment, time of day, weather, props
Style and ambiance — color, mood, texture, audio cues

Not "all five sections every time" — but always that order. Veo 3.1 reads prompts top-down and gives the earliest tokens disproportionate weight, so leading with cinematography gives you the most leverage over the final frame.

A working example, lightly cleaned up from a recent shoot:

Slow dolly-in on a 35mm anamorphic lens, eye-level framing, subtle handheld micro-shake. A weathered fisherman in a yellow rain slicker, his beard wet with sea spray. He pulls a heavy net over the railing of a wooden trawler, muscles tensing. Storm-grey ocean at dawn, horizon barely visible through fog, the boat rocking. Cool teal-and-amber color grade, sodium-lamp practicals, the muffled sound of waves and a distant foghorn.

That single paragraph cleanly maps to the five parts. We rarely write longer than this.

Why cinematography goes first

The single biggest lever in any Veo 3.1 prompt is the camera language. Same scene, different angle, different emotional output. Eye-level reads neutral and grounded. Low-angle makes the subject feel powerful. High-angle makes them feel small. Overhead reads clinical or omniscient. Dutch angle reads anxious.

We keep a short cheat list pinned in our prompt notes:

Static angles: eye-level, low-angle, high-angle, overhead, Dutch angle, over-the-shoulder.
Movement: dolly-in, dolly-out, tracking shot, crane up, crane down, slow pan left/right, whip pan, push-in, pull-out, orbit, arc.
Lens language: 35mm anamorphic, 85mm portrait, wide-angle, fisheye, macro, shallow depth of field, deep focus, rack focus.

If you want to feel the difference, run the same Veo 3.1 prompt twice — once with eye-level static shot, 50mm lens and once with low-angle dolly-in, anamorphic lens. The model isn't inventing a different scene. It's translating your direction into a different mood.

Pro tip: Treat camera instructions as their own sentence. Veo 3.1 parses The camera slowly dollies in. more reliably than …with the camera slowly dollying in as it…. Standalone sentences win.

What goes in each Veo 3.1 prompt section

Here's the slightly-longer breakdown of each part of the Veo 3.1 prompt formula and the mistakes we keep watching newer creators make.

Cinematography

Always include at least: shot size (close-up, medium, wide), angle (eye-level, low, high), and movement (static, dolly, tracking). Optional but useful: lens length and depth of field. Avoid stacking three different camera moves into one shot — pick one.

Subject

Be specific about the subject and let the model fill in fewer blanks. "A woman in a red coat" is weaker than "A 60-year-old woman in a faded scarlet wool coat, silver hair tied back, her cheeks flushed from the cold." Veo 3.1 is good at people, animals, and vehicles. It's still iffy on hands holding small objects, very tight crowds, and reflective glass over text.

Action

Use present-tense verbs. "She pulls the door open." not "She is pulling the door." One primary action per shot. If you need two actions, make it a beat: "She pulls the door open, then steps inside." Don't ask for actions that defy physics — the model is trained to respect natural motion and will quietly normalize anything goofy.

Context

This is environment, time of day, weather, surrounding props, and any ambient detail that affects light. It's also where you set spatial relationships — "behind her, a neon sign flickers" will actually place the sign behind her in the depth map. Veo 3.1 cares about spatial words: behind, beside, in front of, above, below, distant, foreground.

Style and ambiance

Color grade, lighting style, era, film-stock references, and audio. Veo 3.1 generates audio as part of the scene, so describing it as part of the prompt actually works. Mention the sound of rain, a foghorn, distant traffic, or a synth pad and the model will try to render it. Don't write dialogue here; if you want spoken lines, use Veo 3.1's dialogue feature directly.

A copy-paste Veo 3.1 prompt template

Here's the bare scaffold we paste into a fresh note for every new shot:

`` [Shot size + angle + lens + movement]. [Subject described concretely]. [Subject performs primary action]. [Environment, time of day, weather, key props in spatial relation to subject]. [Color grade, lighting style, era reference, ambient sound]. ``

Three sentences to six sentences, total length 100–150 words. Anything longer and the model starts averaging your competing instructions; anything shorter and you're back to lottery tickets.

Three Veo 3.1 prompt mistakes to stop making

We've been collecting these from our own bad outputs:

Stacking camera moves. "The camera dollies in and tilts down and rotates" usually produces a confused single move. Pick one and let the rest stay implicit.
Vague subjects with rich environments. If your subject is "a person" and your context is two paragraphs of detail, Veo 3.1 will make the environment beautiful and the person generic. Balance the two.
Forgetting params.generate_audio: true. This is a Higgsfield thing more than a Veo thing, but the default is silent. If you want the synchronized dialogue and ambient sound that's the whole point of Veo 3.1, you have to flip the flag. We've burned credits on this more than once.

When to reach for Veo 3.1 vs other Higgsfield models

Veo 3.1 isn't the right tool for every shot. Our quick mental routing:

Veo 3.1 — synced dialogue, photoreal humans, naturalistic physics, scenes where audio matters. Use veo3_1_lite for text-to-video and veo3_1 if you have a reference image.
Kling 3.0 — multi-shot sequences, longer continuous clips, native multilingual dialogue.
Seedance 2.0 — strong motion coherence, cinematic camera control, reliable character consistency across shots.
Wan 2.7 — fast iterations and stylized looks where photorealism isn't the goal.

The Veo 3.1 prompt formula above ports almost cleanly to Kling and Seedance with small tweaks. The structure is the durable skill; the model name is the disposable one.

Wrapping up

The whole point of having a Veo 3.1 prompt formula is to stop guessing. Five parts, in order: cinematography, subject, action, context, style. One primary camera move. Concrete subjects. Present-tense actions. Spatial words for context. Audio cues in the style line. Generate, watch, adjust one variable at a time.

Do that for a week and your hit rate will look completely different. We promise.

Sources: