Pikumo · probe image-gen pipeline · shot composition A/B/C

Where should shot composition live?

Story: morning of an important interview — 4 beats whose optimal framing is different in each (close-up → wide → tight → wide-pulling-back). Same character, same style, same identity photo. The only thing that changes is whether the orchestrator gets coarse shot direction.

The three variants

The matrix

Scale labels reflect what the model actually wrote in its revised_prompt: matches the brief's intended scale different scale than brief no scale vocabulary at all

Beat 1 — Mirror check
extreme close-up · intimate self-check at the bathroom mirror
Beat 2 — Walking to station
wide long shot · isolate her in the empty city street
Beat 3 — Coffee shop notes
tight medium close-up · hands & face, foreground carries nervousness
Beat 4 — At the doorway
wide low-angle from behind · pull back, scale her vs the building
A · no_shotno framing direction
no_shot beat 1no scale tag
no_shot beat 2no scale tag
no_shot beat 3no scale tag
no_shot beat 4close
B · coarse_externalcoarse direction per beat
coarse_external beat 1close
coarse_external beat 2wide
coarse_external beat 3close · medium
coarse_external beat 4close
C · coarse_plus_declare+ declare in tool prompt
coarse_plus_declare beat 1close
coarse_plus_declare beat 2wide
coarse_plus_declare beat 3close · medium
coarse_plus_declare beat 4close

What the matrix shows

Beat 2 is the clearest case. Brief asked for a wide long shot to isolate Mei in the empty street. Variant A rendered a medium portrait — street barely visible behind her. Variants B and C both pulled the camera back: full figure, the street recedes into the distance, telephone wires + a single sycamore give scale. The framing carried the intended loneliness only when coarse direction was provided.

Beat 4 is the second clearest. Brief asked for a wide low-angle from behind, the building looming. Variant A rendered her three-quarter front, waist-up, building reduced to a strip. Variant B+C delivered: she's small at the building's foot, shot from behind, the revolving door + facade fill the frame.

Beats 1 and 3 show smaller differences in the rendered images, but the orchestrator's language diverges sharply — variant A's revised_prompt contains no shot vocabulary at all on those beats. It punts framing entirely to the image tool's defaults.

B vs C — no behavioral difference. Forcing the orchestrator to explicitly declare the concrete shot in each tool prompt produced exactly the same scale buckets and (subjectively) very similar panels. The extra 186 output tokens bought nothing.

The numbers

metricA · no_shotB · coarse_externalC · coarse_plus_declare
prompt chars4,6465,8986,383
input tokens8,3428,7608,958
reasoning tokens575627662
output tokens1,5331,6891,719
wall seconds152.4155.6132.0
panels delivered4/44/44/4
scale diversity1/33/33/3

All runs: gpt-5.4-mini · /v1/responses · OpenAI direct · image_generation low quality · 1024×1024 · one identity photo attached.

Verdict

Ship variant B (coarse_external). Coarse direction per beat moves the orchestrator from "no framing decisions at all" to "all three scale registers used." Cost is +418 input tokens, +52 reasoning tokens — negligible.

Skip variant C. Forcing explicit declaration in the tool prompt added 6× the prompt overhead with zero behavior change. The model already does the right thing with just the directive; making it announce its choice didn't tighten the loop.

Don't fully delegate (variant A). Without direction the orchestrator drops shot vocabulary from its rewritten prompts entirely. The image tool then defaults to "pleasant person, centered, medium framing" — the same pattern that fought us in the story-source probe.

Production split: extract.ts owns coarse intent (scale + relationship + sequence rhythm). Orchestrator owns concrete realization (angle, what's in foreground, eyeline degree). Image tool fills the gaps. Director sets the shot list; DP frames the shot.