| Field | Value |
|---|---|
| RFC | 0055 |
| Title | Promote RFC 0031's reserved vision-input / audio model-capability identifiers into a formal vocabulary, and add an optional meta.rendering hint + media.* URL-reference convention to the AI envelope, so an LLM node can emit images / audio / files / structured cards that any consumer renders portably |
| Status | Accepted |
| Author(s) | David Tufts (@davidscotttufts) |
| Created | 2026-05-25 |
| Updated | 2026-05-26 (Active → Accepted — non-steward host advertises the surface: MyndHyve workflow-runtime advertises aiProviders.maxInlineMediaBytes: 10485760 + aiProviders.modelCapabilities.advertised: ['vision-input','image-output'] live on https://workflow-runtime-gjw5bcse7a-uc.a.run.app/.well-known/openwop (openwop-side curl-verified 2026-05-26, revision workflow-runtime-00217-q7c); the reference host carries the behavioral assertions. Honest omission: MyndHyve does not advertise audio-input/audio-output — no audio pipeline today — which is the correct application of the reserved-identifier rule, not a gap.) 2026-05-25 (Draft → Active — §A vocabulary + §B meta.rendering + §C media kinds landed atomically with schemas, the media-asset-url-tenant-scoped SECURITY invariant, conformance, the reference-app renderer, and reference-host serving: the in-memory/sqlite host advertises aiProviders.maxInlineMediaBytes + media.{image,audio,file} in supportedEnvelopes/schemaVersions and serves tenant-scoped capability-token asset URLs (GET /v1/host/sample/assets/{token}) with the behavioral conformance assertions live. Active → Accepted awaits a non-steward host advertising the surface per RFCS/0001. The vision-input/audio-* model-capability identifiers are reserved + registered; a host advertises them only when its model supports them — the reference mock model does not, so it advertises none.) |
| Affects | spec/v1/ai-envelope.md (new §"Rendering hints" + §"Media reference payloads" + 3 universal kinds) · spec/v1/structured-output-subset.md (informative cross-ref) · schemas/ai-envelope.schema.json (optional rendering on the EnvelopeMeta $def) · 3 new schemas/envelopes/media.{image,audio,file}.schema.json + their supportedEnvelopes/schemaVersions advertisement · schemas/capabilities.schema.json (prose-registry extension to modelCapabilities.advertised — NOT an enum — + optional aiProviders.maxInlineMediaBytes) · RFC 0031 §C (registry table) · spec/v1/host-capabilities.md (§"Model-capability declarations" registry + media URL convention) · SECURITY/invariants.yaml (media-asset-url-tenant-scoped) · new conformance scenarios |
| Compatibility | additive |
| Supersedes | — |
| Superseded by | — |
Summary
RFC 0031 §C reserved vision-input, audio-output, and code-execution as future model-capability identifiers but never defined them, and the AI envelope (ai-envelope.md) has no portable way to say "this payload is an image / an audio clip / a renderable card" — so every host and consumer invents its own convention and the reference app's chat surface can only render text and reasoning. This RFC (1) promotes a small, fixed set of model-capability identifiers into a formal vocabulary so packs can gate on them and consumers can know what the active model can consume/emit, and (2) adds an optional meta.rendering hint plus a media.* URL-reference payload convention to the envelope so emitted rich content renders consistently across any consumer. It deliberately does not add a real-time audio/video/screen/cursor _transport_ — that is media plumbing OpenWOP composes with rather than owns (see Alternatives §4). Everything here is advertisement- or meta-level and ignorable by existing clients.
Motivation
Two concrete pains, both visible in the corpus today:
1. Capability vocabulary is half-defined. RFC 0031 §C's open question #1 named vision-input / audio-output / code-execution as "identifiers that will materialize" and punted on whether they go through a full RFC. They have now materialized: every major provider ships vision input and several ship audio I/O, and core.openwop.ai pack authors have no portable string to gate on (peerDependencies.aiProviders.modelCapabilities). Without a fixed vocabulary, a pack that needs vision has to hard-code provider names, which is exactly the anti-pattern RFC 0031 §"Alternatives" rejected.
2. The envelope can't describe how to render its payload. ai-envelope.md defines type / payload / meta / partial, and structured-output-subset.md covers JSON-schema portability — but nothing tells a consumer "render this payload.url as an inline image" vs. "render this object as a card" vs. "this is base64 audio." The reference app's MessageRenderer / EnvelopeInspector (apps/workflow-engine/frontend/react/src/chat/) therefore special-cases text + reasoning and drops everything else to a raw JSON dump. Any other consumer (a debugger, a mobile client, a third-party host's UI) faces the same wall. A rendering hint is an interop concern — two consumers reading the same envelope should render it the same way — which is why it belongs in the spec, not in each app.
The host _generation_ side already exists and already chose the right shape: host-capabilities.md §host.aiProviders defines ctx.callImageGenerator / ctx.callVideoGenerator returning a host-served URL ("videos are too large for inline base64", host-capabilities.md:119). This RFC carries that same URL-reference discipline into the _envelope_ so an LLM-emitted image rides the same rails as a host-generated one.
Proposal
§A — aiProviders.modelCapabilities identifier vocabulary (extends RFC 0031 §C)
Promote the following identifiers from "reserved/illustrative" to the formal reserved registry a host advertises and a pack may gate on.
Schema reality (corrected from this RFC's first draft via the /architect pass). capabilities.schema.json does not define modelCapabilities as a JSON enum. modelCapabilities.advertised is an open, pattern-validated string[] (^([a-z][a-z0-9-]|x-host-[a-z][a-z0-9-]-[a-z][a-z0-9-])$); the reserved identifiers live in prose (the field description + the RFC 0031 §C registry table). Therefore §A is a prose-registry extension, not a schema-enum change — the existing pattern already validates vision-input / audio-input / audio-output / image-output. Converting advertised.items to a closed enum is forbidden: it would reject every x-host- identifier and any other pattern-valid string a host already advertises — a value-space narrowing, which is breaking per COMPATIBILITY.md §2.2.
The four new reserved identifiers are registered in three prose locations — (a) the advertised description in schemas/capabilities.schema.json, (b) the RFC 0031 §C registry table, (c) host-capabilities.md §"Model-capability declarations":
"advertised": {
"type": "array",
"items": { "type": "string", "pattern": "^([a-z][a-z0-9-]*|x-host-...)$" },
- "description": "... Spec-reserved identifiers per RFC 0031 §C: structured-output, discriminator-enum, long-context, reasoning, function-calling. ..."
+ "description": "... Spec-reserved identifiers per RFC 0031 §C + RFC 0055: structured-output, discriminator-enum, long-context, reasoning, function-calling, vision-input, audio-input, audio-output, image-output. ..."
}
Growth of the reserved registry still requires an RFC (single-steward bootstrap rule, per RFC 0031 §C answer to OQ#1) — but it is registry/prose growth, never an enum edit.
Semantics (normative, when advertised):
vision-input— the active model accepts image content in the prompt. A pack declaringpeerDependencies.aiProviders.modelCapabilities: ["vision-input"]MUST refuse to register on a host that does not advertise it, with the existinghost_capability_missingdiscipline.audio-input/audio-output— the model accepts / emits audio content.image-output— the model emits images directly in its completion (distinct from the host-sideaiProviders.imageGenerationgeneration surface, which is a separate tool call).
code-execution from RFC 0031 §C is intentionally left out of this registry — it is a sandbox/runtime concern that belongs with RFC 0035, not the model-capability vocabulary.
§B — meta.rendering hint on the AI envelope (additive, optional)
Add an optional rendering object as a property on the EnvelopeMeta $def in ai-envelope.schema.json. Schema reality (corrected via /architect): the envelope's meta is $ref → #/$defs/EnvelopeMeta, and EnvelopeMeta is additionalProperties: false with required: ["source", "ts"] — so rendering must be added as an explicit, optional property (it cannot just "appear" under an open object; the original additionalProperties: true diff was wrong). It is a hint, not a contract: it never changes payload validation, and a consumer that doesn't recognize it MUST fall back to its default rendering (today: text / raw-JSON). Adding one optional property to a required-bounded additionalProperties:false $def is additive — existing envelopes (which omit it) still validate.
"$defs": {
"EnvelopeMeta": {
"type": "object",
"required": ["source", "ts"],
"properties": {
"source": { "type": "string", "enum": ["ai-generation", "user", "system"] },
"ts": { "type": "string", "format": "date-time" },
+ "rendering": {
+ "type": "object",
+ "description": "RFC 0055. Optional hint for how a consumer SHOULD render this envelope's payload. Non-normative w.r.t. payload validation; unknown values fall back to default rendering.",
+ "properties": {
+ "display": { "type": "string", "enum": ["markdown", "code", "card", "image", "audio", "file"], "description": "Renderer family the producer suggests." },
+ "mimeType": { "type": "string", "description": "IANA media type when display is image/audio/file." },
+ "lang": { "type": "string", "description": "Language tag when display: code." },
+ "alt": { "type": "string", "description": "Text alternative for a11y when display is image/audio/file. SHOULD be present." },
+ "title": { "type": "string", "description": "Optional caption / card header." }
+ },
+ "additionalProperties": false
+ }
},
"additionalProperties": false
}
}
Behavior:
- A producer (LLM node / host) MAY set
meta.rendering. It MUST NOT be required for any payload to validate. - A consumer SHOULD honor
displaywhen it recognizes the value and MUST degrade gracefully otherwise.altSHOULD be rendered for assistive technologies wheneverdisplayisimage/audio/file. meta.renderingcarries no secret material and is subject to the same SR-1 redaction discipline as the rest ofmeta.
§C — media.* URL-reference payload convention (normative when display is image/audio/file)
When meta.rendering.display is image, audio, or file, the payload SHOULD reference the asset by a host-served URL rather than inlining large binaries — mirroring the existing ctx.callVideoGenerator contract (host-capabilities.md):
{
"type": "media.image",
"schemaVersion": "1.0",
"envelopeId": "env_abc",
"correlationId": "run_123:node_render",
"payload": { "url": "https://host.example/v1/runs/run_123/assets/img_9.png", "bytes": 184320 },
"meta": { "rendering": { "display": "image", "mimeType": "image/png", "alt": "Q3 revenue bar chart" } }
}
Normative rules:
1. A host that serves asset URLs MUST scope them to the run's tenant and MUST NOT make them globally guessable (capability-token or signed-URL discipline; reuse the interrupt signed-token recipe). 2. Inline base64 is permitted only below a host-advertised cap (aiProviders.maxInlineMediaBytes, default 256 KiB); above it the host MUST use a URL reference. This bounds event-log bloat and keeps replay payloads portable. 3. Asset URLs are part of a run's debug-bundle manifest (debug-bundle.md) by reference, never by inlining the binary. 4. Rule 1 is enforced by a new protocol-tier SECURITY invariant media-asset-url-tenant-scoped (SECURITY/invariants.yaml), tested by media-url-inline-cap.test.ts. 5. Asset retention (resolves Unresolved Q2): a host MUST retain a media.* asset at least as long as the emitting run's event-log retention (replay.md), so a forked/replayed run can still resolve the URL. The event payload (a URL string) replays deterministically; this rule keeps the referenced asset available.
Universal-kind machinery (corrected via /architect). media.image / media.audio / media.file are new universal envelope kinds — and per ai-envelope.schema.json the payload is "selected by the type discriminator and validated against a per-kind schema at schemas/envelopes/{type}.schema.json," advertised via Capabilities.supportedEnvelopes + per-kind Capabilities.schemaVersions. So §C is not prose-only: it requires three new payload schemas schemas/envelopes/media.{image,audio,file}.schema.json (the { url?, base64?, bytes, … } shape), their registration in supportedEnvelopes, and schemaVersions entries. Vendor-namespaced kinds remain available for anything richer.
Compatibility
Additive. Three independent additive moves, each backward-safe:
- §A extends the open, pattern-validated reserved registry by four prose identifiers — no schema enum exists or is introduced (introducing one would narrow the value space = breaking; see §A). The existing
patternalready accepts them; existing hosts/packs are unaffected. Additive perCOMPATIBILITY.md§2.1. - §B adds one optional property (
rendering) to theEnvelopeMeta$def(additionalProperties:false,required:[source,ts]). Optional ⇒ existing envelopes that omit it still validate; existing consumers ignore it. Additive. - §C adds three new universal envelope kinds with their own per-kind payload schemas +
supportedEnvelopes/schemaVersionsadvertisement (consumers that don't recognize the kind fall back to raw rendering), an optionalaiProviders.maxInlineMediaBytesadvertisement (default 256 KiB), and one new protocol-tier SECURITY invariant. All additive — a host that emits nomedia.*kind is unaffected.
No existing v1.x conformance pass is invalidated: a host that advertises none of the new identifiers, never sets meta.rendering, and never emits a media.* kind behaves exactly as before.
Conformance
envelope-rendering-hint-shape.test.ts—meta.renderingvalidates against the schema; an envelope with nometa.renderingstill validates (proves optionality). (Always runs.)envelope-rendering-hint-ignored.test.ts— a consumer fixture given adisplayvalue it doesn't recognize falls back to default rendering without error. (Always runs.)model-capability-vision-gate.test.ts— a pack declaringmodelCapabilities: ["vision-input"]registers on a host advertising it and refuses withhost_capability_missingon one that doesn't. (Gated onaiProviders.supported.)media-url-inline-cap.test.ts— an emitted asset abovemaxInlineMediaBytesis served by URL, not inlined; the URL is tenant-scoped and not present in a cross-tenant view. (Gated on a host that advertises media emission.)media-url-debug-bundle-reference.test.ts— a run that emitted amedia.imagelists the asset by reference (not inline binary) in its debug bundle. (Gated ondebugBundle.supported.)
New fixture: a conformance.media.emit fixture node that emits one media.image URL envelope + one inline-under-cap media.audio envelope, so the rendering/cap assertions have a deterministic producer.
Alternatives considered
1. Define each rendering family as its own envelope schema rather than a meta hint. Rejected — payload shapes vary per provider and per pack; forcing a fixed schema per display family would either be too narrow (breaks the next provider) or balloon into a parallel type system. A hint over an open meta keeps the payload free and the rendering portable. The hint is advisory by design. 2. Leave rendering entirely to each consumer (do nothing). Rejected — that is the status quo, and it means the reference app, a debugger, and a mobile client each guess differently at the same envelope. Cross-consumer rendering consistency is an interop property; interop properties are why the spec exists. 3. Open the modelCapabilities enum (free strings). Rejected — free strings defeat gating (a pack can't reliably match "vision" vs "vision-input" vs "image-understanding"). RFC 0031 already chose a closed vocabulary; this RFC extends it under the same rule. 4. Add a real-time audio/video/screen-capture/cursor streaming transport. Rejected as out of charter. OpenWOP streams _events_ (SSE / RunEvent), not media frames; live media transport is the kind of plumbing OpenWOP composes with (WebRTC, host media servers) rather than standardizes, consistent with the README "What OpenWOP is not" boundary and the A2A/MCP delegation pattern. This RFC covers _emitted multimodal artifacts_, which are envelope content, not a new transport.
Unresolved questions
1. Default maxInlineMediaBytes. 256 KiB is a starting point chosen to keep event logs replay-friendly. Should it be lower (replay payload size) or higher (fewer round-trips for small images)? Resolve before Active with one adopter's real asset-size distribution. 2. Asset retention / GC. ~~Open~~ Resolved (via /architect): §C rule 5 now requires a host to retain a media.* asset at least as long as the emitting run's event-log retention, so a forked/replayed run resolves the URL. Remaining sub-question: GC granularity (per-asset TTL vs. tied to run-log GC) — host-impl detail, not wire. 3. audio-input ingestion shape. This RFC defines the _advertisement_ for audio-input but not the inbound audio payload shape (how a workflow input carries audio). Defer the input shape to a follow-up unless an adopter needs audio _ingestion_ (vs. emission) immediately.
Implementation notes (non-normative)
- Schema diffs (§A, §B) + the new universal kinds land on
Activepromotion with the conformance scenarios. - Reference-app payoff (drives the companion app work in
plans/app-ux-enhancements.md):MessageRendererswitches onmeta.rendering.displayto render images / audio players / code blocks / cards inline;EnvelopeInspectorshows the hint; the BYOK/model picker surfacesmodelCapabilitiesso users see whether the chosen model supports vision before they attach an image. - Reference-host target:
examples/hosts/postgresadvertisesmaxInlineMediaBytes+ serves tenant-scoped asset URLs; the in-memory demo host can advertise inline-only (cap = 0 forces URL, or a small cap) to exercise both paths.
Acceptance criteria
- [ ] Spec text merged (this file +
ai-envelope.md§"Rendering hints" + §"Media reference payloads"). - [ ]
renderingoptional property on theEnvelopeMeta$def inai-envelope.schema.json; four new identifiers in themodelCapabilities.advertisedprose registry (no enum); optionalmaxInlineMediaBytesincapabilities.schema.json. - [ ]
media.image/media.audio/media.filedocumented as universal kinds withschemas/envelopes/media.{image,audio,file}.schema.json+supportedEnvelopes/schemaVersionsadvertisement. - [ ]
media-asset-url-tenant-scopedprotocol-tier invariant inSECURITY/invariants.yaml+ its conformance test. - [ ] Five conformance scenarios +
conformance.media.emitfixture node. - [ ] CHANGELOG entry under
[Unreleased]. - [ ] A host advertises a
modelCapabilitiessuperset includingvision-inputand serves a tenant-scoped media URL passingmedia-url-inline-cap+media-url-debug-bundle-reference.
References
RFCS/0031-envelope-variants-and-model-capabilities.md— the capability-identifier vocabulary this extends (§C, OQ#1).spec/v1/ai-envelope.md— the envelope this annotates.spec/v1/structured-output-subset.md— companion JSON-schema portability reference.spec/v1/host-capabilities.md§host.aiProviders — the existingctx.callImageGenerator/ctx.callVideoGeneratorURL-reference convention this mirrors.spec/v1/debug-bundle.md— asset-by-reference export.plans/app-ux-enhancements.md— the reference-app UX work this unblocks.