no scale tagPikumo · probe image-gen pipeline · shot composition A/B/C
Where should shot composition live?
Story: morning of an important interview — 4 beats whose optimal framing is different in each (close-up → wide → tight → wide-pulling-back). Same character, same style, same identity photo. The only thing that changes is whether the orchestrator gets coarse shot direction.
The three variants
- A — no_shot: beats have title / details / action / setting. The orchestrator decides all framing.
- B — coarse_external: each beat carries a
<shot scale="…" relationship="…"/>directive + intent note. Orchestrator chooses concrete framing. - C — coarse_plus_declare: variant B + the orchestrator is required to declare a concrete shot description in each tool-call prompt.
The matrix
Scale labels reflect what the model actually wrote in its revised_prompt:
matches the brief's intended scale
different scale than brief
no scale vocabulary at all
extreme close-up · intimate self-check at the bathroom mirror
wide long shot · isolate her in the empty city street
tight medium close-up · hands & face, foreground carries nervousness
wide low-angle from behind · pull back, scale her vs the building
no scale tag
no scale tag
no scale tag
close
close
wide
close · medium
close
close
wide
close · medium
closeWhat the matrix shows
Beat 2 is the clearest case. Brief asked for a wide long shot to isolate Mei in the empty street. Variant A rendered a medium portrait — street barely visible behind her. Variants B and C both pulled the camera back: full figure, the street recedes into the distance, telephone wires + a single sycamore give scale. The framing carried the intended loneliness only when coarse direction was provided.
Beat 4 is the second clearest. Brief asked for a wide low-angle from behind, the building looming. Variant A rendered her three-quarter front, waist-up, building reduced to a strip. Variant B+C delivered: she's small at the building's foot, shot from behind, the revolving door + facade fill the frame.
Beats 1 and 3 show smaller differences in the rendered images, but the orchestrator's language diverges sharply — variant A's revised_prompt contains no shot vocabulary at all on those beats. It punts framing entirely to the image tool's defaults.
B vs C — no behavioral difference. Forcing the orchestrator to explicitly declare the concrete shot in each tool prompt produced exactly the same scale buckets and (subjectively) very similar panels. The extra 186 output tokens bought nothing.
The numbers
| metric | A · no_shot | B · coarse_external | C · coarse_plus_declare |
|---|---|---|---|
| prompt chars | 4,646 | 5,898 | 6,383 |
| input tokens | 8,342 | 8,760 | 8,958 |
| reasoning tokens | 575 | 627 | 662 |
| output tokens | 1,533 | 1,689 | 1,719 |
| wall seconds | 152.4 | 155.6 | 132.0 |
| panels delivered | 4/4 | 4/4 | 4/4 |
| scale diversity | 1/3 | 3/3 | 3/3 |
All runs: gpt-5.4-mini · /v1/responses · OpenAI direct · image_generation low quality · 1024×1024 · one identity photo attached.
Verdict
Ship variant B (coarse_external). Coarse direction per beat moves the orchestrator from "no framing decisions at all" to "all three scale registers used." Cost is +418 input tokens, +52 reasoning tokens — negligible.
Skip variant C. Forcing explicit declaration in the tool prompt added 6× the prompt overhead with zero behavior change. The model already does the right thing with just the directive; making it announce its choice didn't tighten the loop.
Don't fully delegate (variant A). Without direction the orchestrator drops shot vocabulary from its rewritten prompts entirely. The image tool then defaults to "pleasant person, centered, medium framing" — the same pattern that fought us in the story-source probe.
Production split: extract.ts owns coarse intent (scale + relationship + sequence rhythm). Orchestrator owns concrete realization (angle, what's in foreground, eyeline degree). Image tool fills the gaps. Director sets the shot list; DP frames the shot.