Generative Digital Twins:
Vision-Language Simulation Models for Executable Industrial Systems
Abstract
We introduce Vision-Language Simulation Models (VLSM) that synthesize executable FlexScript from layout sketches and natural-language prompts for industrial simulation systems. Our framework is trained on GDT-120K, the first large-scale dataset of more than 120,000 prompt–sketch–code triplets that couple textual descriptions, spatial structures, and simulation logic. To evaluate generated scripts beyond surface text similarity, we propose three task-specific metrics: Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), which jointly assess structural integrity, parameter fidelity, and simulator executability. Systematic ablations over visual encoders, connector modules, and code-pretrained backbones show that the proposed VLSM family achieves near-perfect structural accuracy and robust execution, establishing a foundation for generative digital twins that unify visual reasoning and language-based program synthesis.
Dataset Construction and Timing Distributions
The GDT-120K dataset is built from real factory layouts and control logic, with each sample pairing a hand-drawn layout sketch, a natural-language description, and executable FlexScript. Statistical fitting over factory logs and engineering heuristics is used to instantiate realistic interarrival and service-time distributions, ensuring that the generated code can be executed directly in FlexSim without manual repair.
Vision-Language Simulation Models
The proposed VLSM family couples an OpenCLIP visual encoder with code-pretrained language models through a lightweight connector module. Layout sketches are encoded as a set of visual tokens that condition FlexScript generation together with the textual prompt, allowing the model to reason jointly about topology, routing, and timing parameters. Two backbones are explored: TinyLLaMA-1.1B for compact deployment and StarCoder2-7B for stronger code priors.
Prototype Interface
To make the model usable by non-expert users, we implement a simple prototype interface where engineers can upload a layout sketch, type a short natural-language description of the production logic, and obtain executable FlexScript that can be loaded into FlexSim. This interface exposes VLSM as a “layout-and-prompt in, digital twin out” tool for rapid what-if analysis.
Qualitative Results
Qualitative comparisons highlight that VLSM captures both high-level flow logic and fine-grained parameterization. The model recovers workstation topology, buffering strategy, and AGV routing from sketches that contain only coarse geometric cues. Generated FlexScript consistently satisfies the proposed SVR, PMR, and ESR metrics, and produces factory simulations that are visually and functionally aligned with the ground-truth models.