Learning to build an actionable, composable, and controllable digital twin

Abstract: Simulation has been the driving force behind robot development. With recent advances in computer vision and graphics, simulating sensor observations has particularly drawn wide attention across the community, since it may enable end-to-end testing of full autonomy systems. Unfortunately, existing sensor simulators, while impressive, still suffer from realism and can neither effectively model the outcomes of actions nor hallucinate counterfactual scenarios. In this talk, I will summarize our recent efforts to enable this goal.

First, I will discuss how we develop a high-fidelity closed-loop sensor simulator for self-driving vehicles. Our key insight is to build a digital twin directly from real-world data and leverage the compositional structure of the world to decompose the scene into foreground actors and background. This not only allows us to synthesize extremely high-quality sensor observations to avoid domain gap, but also facilitates better modeling of the interactions among the actors and the scene. Next, I will discuss how we can further expand the simulator to generate physically plausible sensor observations under different lighting conditions and improve the robustness of autonomous systems. Finally, I will present our recent efforts on pushing the boundaries of digital twins with generative models. I will showcase how we distill knowledge from multimodal LLMs into existing 3D systems, making them interactable, actionable, and thus suitable for physical intelligence.