Pikumo · probe image-gen pipeline · shot composition A/B/C

Where should shot composition live?

Story: morning of an important interview — 4 beats whose optimal framing is different in each (close-up → wide → tight → wide-pulling-back). Same character, same style, same identity photo. The only thing that changes is whether the orchestrator gets coarse shot direction.

The three variants

A — no_shot: beats have title / details / action / setting. The orchestrator decides all framing.
B — coarse_external: each beat carries a <shot scale="…" relationship="…"/> directive + intent note. Orchestrator chooses concrete framing.
C — coarse_plus_declare: variant B + the orchestrator is required to declare a concrete shot description in each tool-call prompt.

The matrix

Scale labels reflect what the model actually wrote in its revised_prompt: matches the brief's intended scale different scale than brief no scale vocabulary at all

Beat 1 — Mirror check
extreme close-up · intimate self-check at the bathroom mirror

Beat 2 — Walking to station
wide long shot · isolate her in the empty city street

Beat 3 — Coffee shop notes
tight medium close-up · hands & face, foreground carries nervousness

Beat 4 — At the doorway
wide low-angle from behind · pull back, scale her vs the building

A · no_shotno framing direction

no scale tag

no scale tag

no scale tag

B · coarse_externalcoarse direction per beat

wide

close · medium

C · coarse_plus_declare+ declare in tool prompt

wide

close · medium

What the matrix shows

Beat 2 is the clearest case. Brief asked for a wide long shot to isolate Mei in the empty street. Variant A rendered a medium portrait — street barely visible behind her. Variants B and C both pulled the camera back: full figure, the street recedes into the distance, telephone wires + a single sycamore give scale. The framing carried the intended loneliness only when coarse direction was provided.

Beat 4 is the second clearest. Brief asked for a wide low-angle from behind, the building looming. Variant A rendered her three-quarter front, waist-up, building reduced to a strip. Variant B+C delivered: she's small at the building's foot, shot from behind, the revolving door + facade fill the frame.

Beats 1 and 3 show smaller differences in the rendered images, but the orchestrator's language diverges sharply — variant A's revised_prompt contains no shot vocabulary at all on those beats. It punts framing entirely to the image tool's defaults.

B vs C — no behavioral difference. Forcing the orchestrator to explicitly declare the concrete shot in each tool prompt produced exactly the same scale buckets and (subjectively) very similar panels. The extra 186 output tokens bought nothing.

The numbers

metric	A · no_shot	B · coarse_external	C · coarse_plus_declare
prompt chars	4,646	5,898	6,383
input tokens	8,342	8,760	8,958
reasoning tokens	575	627	662
output tokens	1,533	1,689	1,719
wall seconds	152.4	155.6	132.0
panels delivered	4/4	4/4	4/4
scale diversity	1/3	3/3	3/3

All runs: gpt-5.4-mini · /v1/responses · OpenAI direct · image_generation low quality · 1024×1024 · one identity photo attached.

Verdict

Ship variant B (coarse_external). Coarse direction per beat moves the orchestrator from "no framing decisions at all" to "all three scale registers used." Cost is +418 input tokens, +52 reasoning tokens — negligible.

Skip variant C. Forcing explicit declaration in the tool prompt added 6× the prompt overhead with zero behavior change. The model already does the right thing with just the directive; making it announce its choice didn't tighten the loop.

Don't fully delegate (variant A). Without direction the orchestrator drops shot vocabulary from its rewritten prompts entirely. The image tool then defaults to "pleasant person, centered, medium framing" — the same pattern that fought us in the story-source probe.

Production split: extract.ts owns coarse intent (scale + relationship + sequence rhythm). Orchestrator owns concrete realization (angle, what's in foreground, eyeline degree). Image tool fills the gaps. Director sets the shot list; DP frames the shot.

Companion probe: pikumo-story-source-ab.pages.dev — does attaching the raw story help with tone?
Source: main/reference-anchor-experiment/probe_shot_composition.py. Raw artifacts in this directory: summary.json, prompts & revised_prompts per variant. Generated for the Pikumo image-gen pipeline reconciliation work.