Generative Digital Twins:
Vision-Language Simulation Models for Executable Industrial Systems

Anonymous Authors
Anonymous Institution(s)
Overview of the Generative Digital Twins framework
Generative Digital Twins pipeline: from layout sketch and natural-language prompt to executable FlexScript and industrial simulation.

Abstract

We introduce Vision-Language Simulation Models (VLSM) that synthesize executable FlexScript from layout sketches and natural-language prompts for industrial simulation systems. Our framework is trained on GDT-120K, the first large-scale dataset of more than 120,000 prompt–sketch–code triplets that couple textual descriptions, spatial structures, and simulation logic. To evaluate generated scripts beyond surface text similarity, we propose three task-specific metrics: Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), which jointly assess structural integrity, parameter fidelity, and simulator executability. Systematic ablations over visual encoders, connector modules, and code-pretrained backbones show that the proposed VLSM family achieves near-perfect structural accuracy and robust execution, establishing a foundation for generative digital twins that unify visual reasoning and language-based program synthesis.

Dataset Construction and Timing Distributions

The GDT-120K dataset is built from real factory layouts and control logic, with each sample pairing a hand-drawn layout sketch, a natural-language description, and executable FlexScript. Statistical fitting over factory logs and engineering heuristics is used to instantiate realistic interarrival and service-time distributions, ensuring that the generated code can be executed directly in FlexSim without manual repair.

GDT-120K dataset construction workflow
GDT-120K dataset construction: from curated production layouts and scheduling rules to prompt–sketch–code triplets that capture both spatial structure and simulation logic.

Vision-Language Simulation Models

The proposed VLSM family couples an OpenCLIP visual encoder with code-pretrained language models through a lightweight connector module. Layout sketches are encoded as a set of visual tokens that condition FlexScript generation together with the textual prompt, allowing the model to reason jointly about topology, routing, and timing parameters. Two backbones are explored: TinyLLaMA-1.1B for compact deployment and StarCoder2-7B for stronger code priors.

Vision-Language Simulation Model architecture
VLSM architecture: an OpenCLIP encoder extracts visual tokens from the layout sketch, which are projected by a connector and concatenated with text tokens before being decoded by a code-pretrained LLM into executable FlexScript.

Prototype Interface

To make the model usable by non-expert users, we implement a simple prototype interface where engineers can upload a layout sketch, type a short natural-language description of the production logic, and obtain executable FlexScript that can be loaded into FlexSim. This interface exposes VLSM as a “layout-and-prompt in, digital twin out” tool for rapid what-if analysis.

User interface for Generative Digital Twins
Prototype interface for generative digital twins. Users provide a sketch and a short description, and VLSM returns FlexScript that instantiates the corresponding factory in FlexSim.

Qualitative Results

Qualitative comparisons highlight that VLSM captures both high-level flow logic and fine-grained parameterization. The model recovers workstation topology, buffering strategy, and AGV routing from sketches that contain only coarse geometric cues. Generated FlexScript consistently satisfies the proposed SVR, PMR, and ESR metrics, and produces factory simulations that are visually and functionally aligned with the ground-truth models.

Main qualitative examples of layout-to-FlexScript generation
Qualitative examples on held-out layouts. For each case, a sketch and textual description are converted into FlexScript that reproduces correct routing, resource assignment, and timing parameters when executed in FlexSim.
Comparison across different LLM backbones on the same factory layout.
Comparison of eight language backbones on a single production line. TinyLLaMA-1.1B and StarCoder2-7B closely match the ground-truth digital twin, while generic LLaMA variants omit machines or misplace buffers, confirming that code-specialized models are critical for executable digital twins.
Additional qualitative example with AGV-based layout
Additional example with AGV-based material handling and shared buffers. VLSM infers the number of vehicles, docking logic, and buffer capacities from multimodal inputs, yielding structurally valid and executable code.
Additional qualitative example with mixed manual and automated stations
Example with mixed manual and automated stations. The generated digital twin preserves re-entrant flows and complex precedence constraints, demonstrating that the model scales to dense production lines with heterogeneous workstation types.