Technical Architecture

Tool-Specific Metadata Extraction: Reading Every Format AI Tools Produce

Each AI generation tool writes metadata in its own format, in its own location, with its own quirks. ComfyUI uses PNG text chunks with JSON. A1111 uses PNG parameters strings. Midjourney uses Discord messages. Extraction must handle all of them — correctly, completely, and without data loss.

February 25, 202611 minNumonic Team
Abstract visualization: Neon orbiting spheres in purple

The PNG specification allows arbitrary text metadata through tEXt, iTXt, and zTXt chunks. ComfyUI uses this mechanism to embed two large JSON structures — the workflow graph and the prompt execution data — under specific keywords. Stable Diffusion Web UI (A1111) uses a different approach: a plain-text “parameters” string with a specific formatting convention. InvokeAI uses yet another format. And tools like Midjourney do not embed metadata in images at all.

An asset management system that claims to support AI-generated content must extract metadata from all of these formats. This is not a simple parsing problem — each format has edge cases, size limits, encoding variations, and failure modes that a production extractor must handle gracefully. Getting extraction right is the foundation of everything that follows: search, lineage, reproducibility, and compliance all depend on correct, complete metadata extraction.

The Forces at Work

  • Format diversity: Even within the PNG specification, tools use different chunk types (tEXt vs. iTXt), different keywords (workflow, prompt, parameters, Dream, invokeai_metadata), and different encoding strategies (plain text, JSON, base64). The extractor must probe for multiple formats and identify which tool produced the file before attempting to parse its metadata.
  • Large payload sizes: ComfyUI workflow JSON can exceed 200 kilobytes for complex workflows with forty or more nodes. Many PNG metadata libraries have fixed buffer sizes that silently truncate large text chunks, producing corrupted JSON that fails to parse. Understanding the two JSON blobs means understanding that both can be very large and must be extracted completely.
  • Custom node contamination: ComfyUI's extensibility means custom nodes can inject arbitrary data into the workflow and prompt structures. Some custom nodes add fields with non-standard types, circular references, or extremely large values. The extractor must handle these gracefully — extracting what it can and flagging what it cannot — without crashing on malformed input.
  • Non-image sources: Not all generation metadata comes from image files. Midjourney metadata comes from Discord messages. Some tools produce metadata in sidecar files (JSON or XML alongside the image). Some export metadata through APIs. The extraction layer must support file-based, message-based, and API-based sources.

The Problem

Most metadata extraction tools are built for traditional photography metadata: EXIF data measured in bytes, with well-defined field types and standard value ranges. AI generation metadata breaks these assumptions in every dimension:

A metadata extraction library designed for EXIF will fail on ComfyUI output — not with an error, but with silent data loss. It will read the standard EXIF fields (resolution, color space) and ignore the ComfyUI-specific text chunks that contain the generation parameters. The result is an image with “extracted metadata” that is missing everything that matters for AI asset management.

The Solution: Probing Extractors with Tool Detection

A robust extraction system uses a probe-first architecture: before attempting to parse metadata in any specific format, it probes the file to determine which tool produced it and which metadata format to expect.

Tool Detection

The first extraction step is tool identification. The system checks for signature metadata markers: a “workflow” text chunk indicates ComfyUI. A “parameters” text chunk with the A1111 formatting pattern indicates Stable Diffusion Web UI. An “invokeai_metadata” chunk indicates InvokeAI. No AI-specific text chunks with specific EXIF patterns may indicate DALL-E or Midjourney. The detection step routes the file to the correct tool-specific parser.

Format-Specific Parsers

Each tool gets a dedicated parser that understands its specific metadata format:

  • ComfyUI parser: Reads “workflow” and “prompt” text chunks. Validates both as JSON. Handles large payloads (no fixed buffer). Extracts node types, connections, widget values from the workflow blob. Extracts resolved parameters, seeds, model paths from the prompt blob. Handles custom node fields gracefully.
  • A1111 parser: Reads “parameters” text chunk. Parses the structured text format: first line is positive prompt, “Negative prompt:” line is negative prompt, remaining lines are key-value parameters (Steps, Sampler, CFG scale, Seed, Model, etc.). Handles multi-line prompts and BREAK tokens.
  • Midjourney parser: Operates on Discord message data rather than image file metadata. Extracts prompt text, parameter flags, job ID, variation/upscale relationships from message content and formatting.
  • Generic parser: For images with no recognized AI-specific metadata, extracts standard EXIF, XMP, and IPTC data. Captures technical metadata (dimensions, color space, camera info if present) and flags the asset as having no AI generation metadata detected.

Extraction Quality Reporting

Each extractor reports what it found and what it could not parse. The extraction result includes a quality assessment: full extraction (all expected fields parsed successfully), partial extraction (some fields parsed, some failed), or minimal extraction (only basic technical metadata). This quality signal flows into the normalization pipeline and eventually to the user, who can see which assets have rich metadata and which have gaps.

Consequences

  • Extractor maintenance burden: Each tool-specific parser must be maintained as tools evolve. When ComfyUI adds new node types or changes its JSON structure, the parser must be updated. When Midjourney changes its Discord message formatting, the parser must adapt. This is a continuous maintenance cost, but it is isolated to the extraction layer — downstream systems work with the normalized schema and are unaffected by extractor changes.
  • Extensibility: New tools can be supported by adding a new detection signature and a new parser. The architecture is designed for this — the probe-first approach means the system gracefully handles unknown formats (falling back to the generic parser) while providing rich extraction for recognized tools.
  • Testing complexity: Each parser needs its own test suite with real-world examples from its tool. Edge cases differ per tool: ComfyUI edge cases involve custom nodes and large workflows; A1111 edge cases involve multi-line prompts and extension-specific parameters; Midjourney edge cases involve message threading and variation chains. A comprehensive test suite requires a library of real metadata samples from each tool.
  • Foundation for everything: Correct extraction is the prerequisite for every other capability. Search quality depends on extraction completeness. Reproducibility depends on extracting the right parameters. Lineage tracking depends on extracting relationships. Investing in extraction quality pays compound returns across the entire system.

Related Patterns

Every Tool. Every Format. Fully Extracted.

Numonic's extractors read metadata from ComfyUI, A1111, Midjourney, DALL-E, and more — capturing every generation parameter so nothing is lost when you import your work.

Try Numonic Free