We are building Numonic toward an AI-native creative infrastructure for prosumers, studios, and agencies, and over the past year of building we have converged on a working view of the stack we think this kind of work needs. It is related to, but materially different from, the generic agentic-infrastructure view that most platform teams discuss. The differences matter. Each reflects a creative-media constraint that a pure compute-plus-memory stack fails to encode.
This post is a conjecture, not a finished account. It describes the stack we are building toward, layer by layer, and highlights three layers that almost never appear in horizontal taxonomies yet look indispensable for AI-first creative work. If you are making short-form AI film, narrative video, multi-shot character work, animated sequences, or mixed model-to-motion pipelines, this is the direction we think the platform needs to go, and we want to hear where it is wrong. We start with the foundation and walk upward.
Compute plus memory is not creative infrastructure. Creative infrastructure is lineage, perception, and cost, standing on top of them.
The Stack, at a Glance
Seven layers, from the bottom up: Tenant-Scoped Execution, Persistent Agent Identity, Domain Memory, Tool Integration, Perceptual Evaluation, Orchestration, and Render Economics. Everything sits on an append-only provenance foundation, so that every step of creative labour is recoverable after the fact, not reconstructed from memory.
Layers 3, 5, and 7 are the three that do not appear in generic horizontal stacks. The rest of this post walks up from the foundation and spends most of its time on those three.
Layer 1: Tenant-Scoped Execution
Everything begins with tenant isolation, not with compute. A sandboxed microVM or a GPU-backed container is useful only when the execution context carries an unambiguous tenant, user, and agent identity, and when every side effect it produces inherits those labels. Our aim is to treat tenant scoping as a primitive rather than a security overlay, enforced by the database and honoured in every object store path. Compute providers (serverless GPU, dedicated render workers, bring-your-own-GPU) should plug in as substitutable implementations behind the same scoped contract.
The load-bearing principle: the agent does not own the environment. The tenant owns the environment, and the agent borrows it for a single execution. For an AI film team switching between a home workstation, a rented render farm, and a cloud GPU pool across a single shot, that contract is what keeps the lineage intact.
Layer 2: Persistent Agent Identity
A creative agent is not a disposable function call. It needs to be an economic actor with a name, a scope, a policy, and a history. A Character Continuity Agent that watches a lead actor's face across twelve shots of an AI-generated film needs a stable identity that survives model swaps, session boundaries, and provider migrations. It must be able to act on behalf of a specific tenant, with specific permissions, and produce audit records that are indistinguishable in provenance from a human contributor's.
This is different from external identity resolution, the kind used to match a public handle to a contact record. Agent identity is internal, verifiable, and budgeted. Without it, multi-agent coordination collapses into anonymous function calls, and the audit chain required for EU AI Act compliance breaks.
Layer 3: Domain Memory, Not Conversational Memory
The memory layer is where generic agent stacks and creative agent stacks diverge most sharply. A generic stack treats memory as a transcript optimiser: compress the chat history, retrieve the right fragment, reduce tokens, improve recall. That is useful for assistant agents but wrong for creative ones.
What creative agents need is domain memory: the complete lineage of an asset, including the model version, LoRAs, seed, sampler, workflow graph, reference images, character sheets, style references, and every transformation that followed. Domain memory is a graph of causally linked facts about assets, workflows, and models, stored append-only and queryable by any agent, at any time, without having ever participated in the original conversation.
Conversational memory compresses context. Domain memory expands it. A film team that owns the latter does not lose its hold on a character when it switches frontier models mid-project, because the character lives in the reference images, the style locks, and the accumulated judgement calls, not in a vendor's session store. This is the same principle behind the lineage work we have written about before: the version control problem for creative assets is not git, because creative assets have inherently branching, multi-parent ancestry that a linear history cannot represent.
Generic Stack vs Creative Stack
| Layer | Generic Agent Stack | Creative Stack |
|---|---|---|
| Memory | Transcript optimisation, token reduction | Asset lineage graph, append-only provenance |
| Evaluation | LLM-as-judge on text output | Perceptual models, visual similarity, aesthetic QA |
| Cost | Reporting feature on billing page | First-class layer with per-agent budgets and alerts |
| Identity | Session-scoped, stateless function calls | Persistent economic actor with audit history |
Layer 4: Tool Integration Through Open Standards
The tool layer is the best-understood part of the stack, and the area where the industry is converging fastest. We are standardising on Model Context Protocol as the contract between agents and tools, with our own MCP surface exposing search, publishing, annotation, and lineage operations. Every tool call should be auditable, bound to an agent identity, and recorded as a first-class interaction in the provenance graph.
The useful discipline here is not protocol choice. It is the refusal to build bespoke integrations for each external platform. If ComfyUI, Midjourney, an AI video tool, a render farm, or a client review portal cannot be reached through an MCP-style contract, we want to treat the adapter as a temporary bridge and its existence as technical debt to retire. The MCP-for-creative-tools pattern is why this discipline can hold: the protocol is stable enough to bet on, and it generalises across the generators AI film teams are already combining.
Layer 5: Perceptual Evaluation
This is the first of three layers that generic agent stacks simply omit.
Creative agents produce outputs that must be judged perceptually. Did the character stay on-model across eight shots? Did the motion hold between frames, or did the hand morph in the third second? Does the hero image match the brand palette closely enough? Is this generated frame too similar to a competitor's asset and therefore a legal risk? None of those questions can be answered by a language model reading text descriptions of the images.
Perceptual evaluation should be its own layer. It sits between the tool layer and the orchestration layer, and it uses multimodal embeddings, visual similarity indices, and aesthetic scoring to produce the signals that orchestration then acts on. Without it, an orchestration loop cannot close: there is no reliable way to decide whether a render is good enough to proceed to the next stage, or whether to regenerate, or whether to escalate to a human reviewer. For AI video and film teams, this is where most of the pain actually lives, and it is the layer we are most actively building against.
Layer 6: Orchestration With Glass-Box Observability
Orchestration is where most platforms announce success too early. Launching three agents in parallel and stitching their outputs together is not orchestration. It is fan-out. Real orchestration requires durable state, resumability under failure, supervisory hierarchies that can intervene mid-run, and observability that lets a human operator understand what every agent did and why.
Our direction is orchestration on a durable workflow engine with content-addressed steps, so a resumed workflow re-enters where it left off without re-billing compute. It should be multi-provider by default, so a single workflow can pair one model for reasoning, another for visual evaluation, and a self-hosted model for cost-sensitive classification. And we want it instrumented for what we call the glass-box principle: every decision, every tool call, every model response retained as an immutable record that the tenant can inspect.
Glass-box observability is what turns an orchestration engine from a black-box automation into a compliance-ready record of creative labour. It is also the feature that lets a studio defend its process to an auditor, a client, or a regulator without re-running anything.
The Seventh Layer That Does Not Fit Cleanly: Render Economics
Every layer above treats cost implicitly. In creative-media infrastructure, cost is load-bearing. A regeneration of a single hero frame can cost more than a week's licensing spend on the reference model. A batch render that produces 240 candidate variations for a client pitch is a financial event that must be attributed per agent, per workflow, per tenant, and budgeted accordingly. For AI video and film teams, where a single shot can burn through a day of compute credits before anyone has approved it, this is not a secondary concern. It is the one that decides whether the work is viable.
We are treating Render Financial Observability as its own layer rather than as a reporting feature on top of orchestration. It should capture unit economics (cost per accepted frame, not cost per frame), regeneration ratios, and per-agent spending. It should issue non-blocking threshold alerts so autonomous agents can self-govern inside approved envelopes, and it should expose those signals to both the human dashboard and the MCP tool surface, so another agent can reason about them.
The reason this belongs as a layer, not a feature, is symmetry. Tenant scoping, identity, memory, tools, evaluation, and orchestration all expose signals that agents can observe and act on. Cost must be exposed the same way or it stops being governable. A cost field hidden on a billing page is not infrastructure. It is a report.
The Accountability Chain
That chain is one continuous argument. Breaking any link breaks the whole, because compliance, auditability, and cost-governance all depend on being able to trace a rendered frame back to the agent that made it, the workflow that produced it, the model that ran, the tenant that authorised it, and the budget it consumed.
An Invitation
If you are building AI short film, narrative video, multi-character sequences, animated work, or any pipeline that chains generators into a shot that has to hold together, we want to build the rest of this stack with you rather than at you. Some of these layers are in the product today. Others are sketches on a whiteboard. The ones we feel most strongly about, Domain Memory, Perceptual Evaluation, and Render Economics, are also the ones where a working creative team can tell us what we are getting wrong faster than we can work it out alone.
Reach out through the platform, or follow along as we publish more of the work. The best version of this stack is the one shaped by the people actually making the films.
Key Takeaways
- The generic six-layer agentic stack (compute, identity, memory, tools, billing, orchestration) is a reasonable horizontal baseline but ignores creative-media constraints.
- Domain memory is different from conversational memory, and it is the harder, more defensible layer for AI-native studios and film teams.
- Perceptual evaluation is a distinct layer, not an afterthought, because orchestration cannot close loops without it.
- Render economics deserves a first-class layer. Compute cost is the primary unit of creative-agent value and must be attributed and budgeted per agent, per workflow.
- Tenant scoping, agent identity, and provenance form one continuous accountability chain. Breaking any link breaks the audit trail that creative work is increasingly expected to carry.
- This is a working conjecture, not a finished stack. If you are building through it, we would rather learn with you than ship past you.
Build It With Us
We are building Numonic toward a creative agentic stack that is tenant-scoped, identity-bound, lineage-complete, and cost-aware by design. If you are making AI film or video, we want your hands on what we build next.
Explore Numonic