The exploration of embodied intelligence has garnered widespread consensus in the field of artificial intelligence (AI), aiming to achieve artificial general intelligence (AGI). Classical AI models, which rely on labeled data for learning, struggle to adapt to dynamic, unstructured environments due to their offline learning paradigms. Conversely, embodied intelligence emphasizes interactive learning, acquiring richer information through environmental interactions for training, thereby enabling autonomous learning and action. Early embodied tasks primarily centered on navigation. With the surge in popularity of large language models (LLMs), the focus shifted to integrating LLMs/multimodal large models (MLM) with robots, empowering them to tackle more intricate tasks through reasoning and planning, leveraging the prior knowledge imparted by LLM/MLM. This work reviews initial embodied tasks and corresponding research, categorizes various current embodied intelligence schemes deployed in robotics within the context of LLM/MLM, summarizes the perception–planning–action (PPA) paradigm, evaluates the performance of MLM across different schemes, and offers insights for future development directions in this domain.

1. Introduction

Embodied intelligence currently poses the greatest challenge in the field of artificial intelligence (AI). This concept has been around since 1950. In recent years, as algorithms, theories, and hardware support for AI have advanced, both research and industrial circles have shifted their focus to embodied intelligence, aiming to uncover the path to artificial general intelligence (AGI) through its exploration [1–4]. The embodied cognition theory posits that thinking and cognition are largely dependent on and originate from the body. The structure of the body, the structure of the nerves, and the activity of the sensory and motor systems determine how we understand the world, determine our thinking style, and shape the way we see the world. Embodied cognition emphasizes the importance of physical entities; inspired by this, the ultimate objective of AI is to devise human-like intelligent robots capable of autonomous planning, decision-making, and execution. Such agents must possess physical entities to enable multifaceted interactions with the real world, perceive and comprehend their surroundings akin to humans, and proficiently handle complex tasks through autonomous learning—capabilities far beyond current robotic capabilities [5–7]. Achieving this goal necessitates understanding the origins of human intelligence. While certain contributing factors have been identified, the comprehensive mechanism of intelligence generation remains elusive. Consequently, scholars have have proposed various hypotheses, including a prominent theory positing that intelligence stems from environmental interaction, which underpins the concept of embodied intelligence [8–11]. Compared to AI models that typically lack a physical body, humans have demonstrated a distinct ability to learn through interaction, which is postulated to stem from our capacity to perceive our bodies in space. This includes sensing the position of our limbs during movement, discerning other objects and media, and consciously manipulating our body parts to engage with real-world objects [12, 13]. The embodied intelligence theory underscores the integrality of body and mind. Classical AI models, which rely on offline learning and environment interaction primarily sourced from human-annotated datasets due to their absence of a physical body, face limitations when applied to robots. Robots are furnished only with a rigid set of operating rules, constrained to learning within a narrow action space. Consequently, robots lack the autonomous learning capability in dynamic, unstructured environments, making interaction with such environments particularly challenging. Furthermore, in the absence of certain labels, they are unable to acquire related concepts [14–16]. From this perspective, embodied intelligence can be comprehended as the capacity to attain intelligent conduct through the interplay between a physical body and its surrounding environment. In contrast to the traditional third-person learning paradigm of AI, the objective of embodied intelligence is to perceive the environment from an egocentric perspective and continually learn to address the challenges posed by dynamic, unstructured settings via interactive environmental information. An embodied intelligent system is generally regarded as encompassing four pivotal components: (1) A physical entity endowed with environmental perception, motion, and operational execution capabilities; (2) an intelligent core tasked with essential functions like perception, understanding, decision-making, and control. This includes interpreting semantic information within the environment, grasping specific tasks, and making decisions based on environmental changes and target states to direct physical entities in task completion. It also entails learning decision-making and control paradigms from intricate data, evolving continuously to accommodate more sophisticated tasks and environments; (3) extensive training data. Over the past few decades, AI technology has been in a constant state of evolution, with research institutions across various countries persistently developing AGI robots, endeavoring to extend existing AI models into embodied contexts. In 2016, the Go match between Google’s AlphaGo computer and a human represented a milestone in the field of AI, demonstrating the promising applications of deep reinforcement learning (DRL) in interactive settings. Reinforcement learning (RL) primarily focuses on learning a strategy that maximizes the long-term rewards and penalties resulting from the agent’s interaction with the environment. Its core principle involves real-time interaction with the environment, learning from a sequence of successes and failures, maximizing the cumulative reward earned from the environment, and ultimately enabling the agent to adopt the optimal policy. This method is commonly applied to tackle tasks within a limited space of states and actions. Since 2013, researchers have consistently delved into the theory of RL, exerting a substantial influence across numerous domains. Notably, OpenAI has developed Dactyl, a humanoid robot hand adept at manipulating physical objects with dexterity. Meanwhile, DeepMimic, another humanoid robot, excels in mastering intricate motor skills. The fusion of robots and RL enables robots to tackle diverse tasks within a specified spatial range with robust adaptability and autonomy. RL has established the cornerstone for environment interaction in embodied intelligence research. As natural language processing (NLP) capabilities have advanced and subsequent tasks such as visual question answering and image captioning have been introduced, there has been an increasing emphasis on exploring more embodied tasks, with attempts to extend multimodal models into embodied environments. The advent of ChatGPT in 2022 brought large language models (LLMs), like GPT, into the spotlight. This underscores the emergent capabilities of LLMs and their profound impact on preceding multimodal models, subsequently making multimodal large models (MLM) a focal point of research. Various research institutions have incorporated LLM/MLM into embodied robot research, positioning robots as the executors of LLM/MLM directives, while LLM/MLM serves as the controllers and decision-makers for robots, interacting with their environment to accomplish tasks.

This work reviews previous research on embodied intelligence. We summarized various embodied tasks (embodied exploration, visual language navigation (VLN), embodied question answering), in terms of their relevant model frameworks, algorithms, and research tools, and introduced the current research progress and achievements of embodied intelligence guided by LLMs, and the rest of this paper is organized as follows: In Section 2, various embodied tasks (embodied exploration, VLN, and embodied question answering) are summarized in terms of their relevant model frameworks, algorithms, and research tools; Section 3 introduces the current research progress and achievements of embodied intelligence guided by LLM/MLM; in Section 4, the performance of MLMs in those embodied schemes is evaluated; finally, in Section 5, the current development status of embodied intelligence is analyzed, the application limitations of each embodied solution are summarized, and future development challenges are discussed.

2. Review of the Development of Embodied Tasks

The embodied performance of robot systems primarily manifests in their interaction with the environment or scene, where varying embodied tasks pose distinct requirements for this intelligent agent’s scene interaction. In this section, we concisely review various early embodied tasks, delineating their respective meanings, backgrounds, methodologies, and models. Early embodied tasks can broadly be categorized into embodied exploration, embodied navigation, and embodied question answering systems. These tasks are interconnected, exhibiting a dependency relationship among them, which we will elaborate on later.

2.1. Embodied Navigation

Since the introduction of the first mobile robot in the 1960s, navigation technology has garnered significant attention as a pivotal capability of robots. As a core technology, it establishes the groundwork for robots to accomplish diverse complex tasks and finds extensive application across various scenarios, actively engaging in production and daily life. Mobile robot navigation technology encompasses multilayered architectures such as perception, decision-making, and control. Robot navigation can be categorized based on system framework, the presence or absence of environmental priors, and signal input into modular framework-based navigation, end-to-end framework-based navigation, map-based navigation, mapless navigation, visual navigation, and VLN.

2.1.1. Modular Framework–Based Navigation and End-to-End Framework–Based Navigation

The classic approach to robot navigation involves using a modular framework to decompose navigation tasks into several subtasks like positioning, map-building, path planning, and motion. Each subtask corresponds to a particular solution, with the overall framework accommodating diverse algorithms such as classical control and machine learning [17]. Positioning algorithms enable robots to comprehend themselves and their targets, while navigation tasks in open environments can utilize the target point’s orientation as the forward direction. Simultaneous localization and mapping (SLAM) and other algorithms construct map models that provide robots with environmental priors. Path planning aims to find feasible paths from the initial to the target position in configuration space, avoiding obstacles. Positioning, mapping, and planning enhance overall navigation efficiency through endpoint guidance, environmental priors, and global guidance, respectively [18]. Despite the maturity of the modular system–based method, it heavily relies on manual design and is susceptible to sensor noise accumulation, reducing the robustness of related algorithms and hindering deployment in unfamiliar complex environments. In contrast, the end-to-end framework–based navigation method leverages end-to-end training to acquire comprehensive navigation knowledge from data. Its advantage lies in eliminating manual work, even intermediate steps like mapping and planning. For instance, Mousavian et al. [19] proposed using semantic segmentation and detection masks as observations obtained through state-of-the-art computer vision algorithms and a deep network to learn the navigation policy. This data-driven machine learning strategy–based method directly learns a mapping between raw observations and actions in an end-to-end manner for the task. Similarly, the cognitive mapping and planning (CMP) approach suggested by Gupta et al. [20] learns navigation plans from data, albeit still requiring map-building. The entire model comprises a mapper module that processes first-person images from the robot, integrates observations into a latent memory, and allows online usage in novel environments without a preconstructed map. Evidently, in end-to-end framework navigation methods, the navigation problem revolves around learning a policy that utilizes inputs (current image, egomotion, and target specification) at every time step to output actions that swiftly convey the robot to the target.

2.1.2. Map-Based Navigation and Mapless Navigation

Map-based and mapless navigation: Common maps utilized in navigation primarily encompass grid maps, geometric maps, topological maps, and semantic maps. Grid maps depict the environment as an array of grids on a plane, utilizing binary data to designate regions as traversable or obstacle-laden, thereby establishing a foundation for subsequent path planning. Their construction typically relies on LiDAR sensors [21]. Grid maps boast an intuitive format, facilitating ease in creation and maintenance. However, their accuracy in navigation is directly impacted by grid size. A fine grid resolution may consume substantial memory and hamper system search efficiency, whereas a coarse grid may result in an imprecise or distorted representation of regional information. Geometric maps reconstruct the environment by representing obstacle information through geometric attributes such as point, line, and plane features. Localization entails measuring the environmental data perceived by cameras and contrasting it with the reconstructed environmental framework to pinpoint the robot’s precise location within the environment through feature estimation techniques [22]. Topological maps are described through numerous key nodes and the lines connecting them, representing the environment without the need for precise physical coordinates or dimensions. Instead, they focus on nodes and their topological relationships [23, 24]. Nodes signify locations within the environment, while lines denote traversability between connected nodes. Topological maps offer internode distance and orientation information, aiding robotic movement between nodes. They require minimal memory, exhibit short search times, and display excellent real-time performance in navigation and positioning systems. However, they falter in depicting complexly structured maps. As an advanced map representation, semantic maps delineate environmental features and semantic information. They offer extensive data on open spaces, observation points, uncharted territories, observed objects, and agent actions, constituting highly valuable and compact representations for agent navigation [25–27]. Furthermore, with the progression of multimodal technology, emerging map types better encapsulate environmental priors. Huang et al. [28] proposed audio-visual language maps (AVLMaps), a unified 3D spatial map representation that stores cross-modal information derived from audio, visual, and linguistic cues, enhancing navigation target indexing accuracy. When robots navigate unknown environments, they must accomplish three tasks: safe exploration/navigation, surveying, and positioning. Online autonomous completion of exploration and surveying in unknown environments constitutes SLAM. SLAM entails mobile robots incrementally constructing comprehensive environmental maps through continuous environmental observation via cameras without prior knowledge of their location [29].

Mapless navigation necessitates environmental map construction, with robotic activity contingent on real-time environmental observations captured by cameras, obviating the need for absolute obstacle coordinate specification. DRL is the predominant method for mapless navigation. Yet, DRL demands extensive data and its models are susceptible to environmental disturbances, lacking generalization to new targets, thereby challenging real-world application. Consequently, enhancing generalization to new targets and reducing training costs is crucial for DRL-based mapless navigation [30, 31]. Research in this domain has been robust. Zhu et al. [32] proposed a target-driven visual navigation approach aiming to ascertain the shortest action sequence to relocate an agent from its current position to a designated target within an RGB image. It learns a stochastic policy function with dual inputs—current state and target—generating a probability distribution in the action space, thereby obviating the need for model retraining upon incorporating new targets. Kahn et al. [33] introduced a generalized computation graph to learn collision avoidance strategies for mobile robots, formalizing this as a RL problem, rewarding the robot for collision-free navigation.

2.1.3. Visual Navigation and VLN

Within the classic navigation framework, robot navigation technology leveraging laser sensors and infrared ultrasonic sensors has rapidly advanced and is widely utilized. However, their high costs and limited adaptability to adverse weather conditions, such as rainy and snowy days, have impeded their practical deployment in navigation tasks. In contrast, visual sensors offer cost-effectiveness and retention of environmental semantic information, indicating superior development prospects for vision-based navigation. Visual-based indoor navigation technology captures ambient information via cameras, identifies surrounding obstacles and nonobstacles, and plots feasible paths, enabling autonomous mobile robot navigation.

A common approach for map-based visual navigation involves extracting and processing the features of the operational environment prior to navigation. A global map is then established, stored in the robot’s database, and retrieved for matching and positioning during navigation. This involves comparing image features or landmarks captured by cameras with maps in the database, calculating matching probabilities, determining the robot’s pose, and planning an appropriate path via the planning module [34]. Users favor navigation systems with simple structures, robust adaptability to complex environments, and autonomous capabilities. Increasingly, modular framework navigation methods fail to meet the demands of intelligent autonomous navigation. Target-driven visual navigation has emerged as a superior solution [35, 36]. From the perspective of navigation targets, inputs to the navigation system can encompass coordinate points, object labels, graphs, and natural language [37–39]. When targets are defined by coordinate points, the agent’s task typically relies on a metric map, guiding it from a specific point to the target location. With label-specified targets, visual navigation entails observing the surrounding environment to identify instances of specified object categories. For image-based targets, visual observation of the object directs the agent toward the object depicted in the image. Natural language descriptions require the agent to comprehend the description, extract target information, and match it with semantic scene data, encompassing the advanced task of VLN.

Table 1. Representative work for embodied tasks.

Task	Method	Input	Output	Characteristics	Metric	Dataset/simulator
Embodied navigation	[19]	RGB image, depth image, previous action, success indicator	Action	Mapless End-to-end	SR	AVD [40] SunCG [41]
	[20]	Image	Action	Map-based End-to-end	ADG, SR	—
	[32]	RGB image	Action	Mapless target-driven End-to-end	ATL, SR	AI2-THOR [42]
	[43]	Human instruction Route	Action	Mapless End-to-end	NE, ADG, SR	R2R [44]
	[45]	Conceptual caption Text Image sequence	Path	Mapless Pretrained	SR	BnB [46]
	[47]	Text RGB image	Action	Mapless Large model-based	PS, PE	—
	[48]	RGB-D image	Action	Map-based Large model-based	SR, SPL, ADG	HM3D [49] Gibson [50]
	[51]	Text Observation Actions	Action	Mapless End-to-end Auxiliary task	TL, NE, SR, SPL, CLS, nDTW, SDTW	R2R, RxR [44] NDH [52] REVERIE [53]
	[54]	Text RGB image	Action	End-to-end Auxiliary task	TL, NE, OSR, SR, SPL	R2R
	[55]	Text Image sequence	Action	Mapless, Pretrained Auxiliary task	TL, NE, SR, SPL	R2R, RxR NDH, REVERIE

Embodied exploration	[56]	Image	Action	Curiosity mechanism	AG	VizDoom [57] Super Mario Bros
	[58]	Image	Action	Novelty mechanism	AG	Atari games
	[59]	Bump sensor Depth image RGB image Egocentric map	Action	Coverage mechanism	Coverage, LSR, SPL	House3D [60]
	[61]	Image sequence	Action	Reconstruction mechanism	RE	SUN360 [62] ModelNet [63]

Embodied question answering	[64]	Text Image sequence Action	Answer action	Modular	QA, NA	EQA-v1 [64]
	[65]	Text Image sequence Action	Answer	Modular	QA	IQUAD V1 [65]
	[66]	Text Image sequence action	Answer	Modular	QA, NA, SL	MT-EQA [66]
	[67]	Text Image sequence	Answer	Modular	QA	IQuADv1

VLN aims to develop embodied agents capable of communicating with humans via natural language, perceiving their environment, and navigating within a real 3D setting. Based on natural language instructions, the agent navigates a 3D environment, progressively approaching the target from any random point in the environment under step-by-step guidance, while capturing visual images at each step. This necessitates continuous environmental perception and the generation of action predictions, such as movement and turning, based on language instructions. Compared to pure visual navigation, visual linguistic navigation necessitates effective understanding and coordination of multimodal information while enhancing environmental interaction.

Distinct VLN tasks vary in their requirements for agent–environment interaction. Room-to-Room [68] is a prevalent task where an agent receives a natural language instruction, observes an initial RGB image based on its starting pose, and performs a series of actions (each generating a new pose and image observation) until it executes a specific stop action. The REVERIE task [53] differs, as it tasks an agent with navigating to and identifying a remote object (highlighted in a red bounding box) in a photorealistic 3D environment from multiple potential candidates based on a natural language instruction. Speaker-Follower [43] is a VLN model for Room-to-Room tasks designed to compensate for instruction deficiencies. It comprises a speaker module to enhance instructions, including a speaker model that provides textual route instructions and a follower model that navigates based on these instructions. During training, the speaker model supplements the follower model with additional path instructions, enriching training data and enhancing instructions.

VLN involves multiple modalities, vision, language, and action, posing significant data training challenges due to the high cost and lengthy duration of collecting training data. Current research focuses on adapting virtually trained models to unseen environments, improving generalization, and real-world application. Pretraining models offer a promising solution, enabling the acquisition of transferable multimodal representations to achieve top-tier performance in various vision and language tasks [45, 69, 70]. With the rise of pretrained LLMs and MLMs, their application in VLN has demonstrated substantial advantages. LM-Nav [47] is a robot navigation system leveraging pretrained large language and multimodal models, enabling self-supervised robot navigation to execute natural language instructions without annotated navigation data. Given unstructured text instructions, LM-Nav decodes these into a sequence of text tags using a pretrained language model, establishes textual landmarks in a topological map via a visual language model, and utilizes a novel search algorithm to maximize the probability target, generating and executing a plan via a visual navigation model. L3MVN [48] employs LLMs to convey object search common sense, constructs environment maps based on boundaries, and selects long-term goals for efficient exploration and search.

In addition, VLN represents a sophisticated inference task, with its outcome contingent upon the accumulation of steps. Auxiliary tasks facilitate agents in gaining a deeper understanding of their environment and current state, sans additional labeling. This understanding can be achieved by interpreting past behavior, predicting future decision-related information [71, 72], engaging in ongoing tasks like task completion, and aligning visions with instructions [51, 73, 74]. AuxRN [54] leverages four auxiliary tasks: Trajectory Retelling, Progress Estimation, Cross-modal Matching, and Angle Prediction, to generate precise navigation routes. For HOP [55], the trajectory order is also pivotal, enhanced through masked language modeling and pretraining on five tasks: Action Prediction with History, Trajectory Structure Matching, Trajectory Order Modeling, and Group Order Modeling, thereby refining the model’s trajectory generation capabilities. Furthermore, VLN tasks can be evolved into multiround dialog tasks, better suited for practical applications. The visual dialog navigation task [75] strives to develop an intelligent agent capable of continuous natural language dialog–based communication and navigation grounded in human reactions. This requires the model to possess memory and cross-modal reasoning abilities. The cross-modal memory network (CMN) utilizes independent language and visual memory modules to recall and comprehend dialogs and visual inputs pertinent to past navigation behaviors, further analyzing decision history for current navigation decisions.

2.2. Embodied Exploration

In embodied exploration tasks, agents acquire the ability to navigate unfamiliar environments to gather information essential for future task deployments in those environments [76]. When an agent finds itself in uncharted territory, active exploration and foundational knowledge acquisition about the surroundings are crucial for efficiently undertaking new tasks. Taking a home robot as an example, if it is expected to execute commands such as “bring some fruits from the kitchen to the owner,” it must explore the room in advance, figure out what its purpose is, the path to take, and where the items are located.

Navigation is central to embodied visual exploration. Unlike task-oriented navigation, embodied visual exploration lacks a definitive goal but aims to maximize environmental data collection and form pertinent map structure models, accurately depicting semantic and geometric environmental information for downstream tasks. Historically, SLAM technology has been used for map construction. Agents use this technology for localization and path planning, though it remains a well-established yet passive exploration method requiring advanced environment traversal [77]. Additionally, SLAM heavily relies on sensors, often leading to noise interference issues. Currently, learning-based techniques dominate embodied visual exploration tasks. Compared to traditional SLAM, the learning-based approach constructs maps using visual information less susceptible to noise and actively explores the environment. It employs unsupervised methods, such as RL, to learn within new environments, minimizing manual effort and enhancing efficiency. However, learning-based methods for embodied visual exploration often face sparse rewards challenges. In RL, rewards are granted solely upon reaching a specific state. In complex new environments, an agent’s state cannot be predetermined, resulting in sparse or absent rewards [78]. Hence, exploration strategies are a focal research area in embodied visual exploration. Four distinct exploration methods from the literature—curiosity, novelty, coverage, and reconstruction—can establish reward systems aiding agents in environmental exploration. Agents guided by curiosity are attracted to unexpected states, earning rewards when actual states diverge from predicted ones. The larger the discrepancy, the greater the reward, focusing on intrinsic rewards over external ones [56, 79, 80]. In the novelty mechanism, the goal is to discover unencountered states, rewarding agents based on the infrequency of current state accesses [58, 81, 82]. Coverage mechanisms incentivize agents to observe their environments comprehensively, rewarding them based on the increment of observed objects of interest [59]. Finally, in the reconstruction mechanism, agents attempt to reconstruct scenes using their central view and sensor information. The closer the reconstructed views are to the originals, the higher the reward [61, 83].

2.3. Embodied Question Answering

Malinowski et al. [84] pioneered the visual question answering task. This challenge entails providing an image and a natural language question, the model then answering by interpreting both the natural language and visual information. They subsequently proposed the Neural Image Question Answering model [85] to address the task, modeling the VQA problem as a generation problem, extracting image and problem features stitching as inputs, and then feeding into the LSTM network to output the answer, and it is a classic VQA model type based on multimodal fusion. The key to such a method is how to integrate the vision and text features. A number of academics have devoted their research to it. The most common strategies for fusing visual and text features involve methods like affine transformation [86], vector outer product [87], matrix decomposition [88], etc. Researches have shown that traditional VQA methods possess elements that are not pertinent to providing answers during feature fusion. Therefore, in an effort to raise the accuracy of answers, researchers have optimized the VQA model with attention mechanisms in order to augment the relevant beneficial data in visual and textual information [89–91]. Although the accuracy has been augmented through the utilization of attention mechanisms, the opaque, noninterpretable black box model for answering questions remains a mystery. Scholars have begun to question whether VQA models are using data bias in their answers, instead of making true deductions. How to establish an effective and interpretable universal model has always been a prominent research area. Inspired by the fact that humans divide questions into several steps when answering them, the reasoning process of visual question answering can be serialized. Neural module network (NMN) [92] model constructs multiple submodules required for answering questions, such as “Find,” “Transform,” “Combine,” “Describe,” and “Measure”. Afterward, a combination of submodules is automatically generated according to the structure of question for collaborative learning. In 2016, Stanford and Facebook jointly released the CLEVR dataset [93] and proposed a new modular visual reasoning method for the CLEVR dataset [94]. Due to the simple scenario of the dataset, only attention needs to be paid to the reasoning itself. As a result, research on this composite dataset has grown rapidly. Yi et al. [95] proposed neural-symbolic visual question answering (NS-VQA) system, where a scene parser renders an image to obtain a structured scene representation. NS-VQA is a typical model composed of scene parsers (de-renderers), problem parsers (program generators), and program executors, laying the foundation for subsequent visual reasoning models. The central idea of this type of visual question answering method is task decomposition, which is not only applicable to question answering tasks but also to robot navigation tasks, laying the foundation for subsequent embodied question answering.

The long-term aspiration of embodied question answering is to create an agent that is able to perceive its environment, communicate using natural language, and take action to assist humans. Specifically, the task involves placing the agent in an environment, allowing it to experience the environment from a first-person point of view, and having the capability to perform a series of atomic actions, the agent has to explore the environment to gain visual information to answer relevant questions. To answer questions in an embodied form, an agent requires a wide range of AI capabilities, including visual recognition, language understanding, object-oriented navigation, task planning, and commonsense reasoning. Consequently, embodied question answering is deemed the more difficult and complex challenge in current research on embodied AI.

The method employed initially utilized a hierarchical model comprising four modules: vision, language, navigation, and answering. The adaptative computation time (ACT), a navigator of the navigation module, was integrated with encoding visual and linguistic information to choose the next action, predict its duration, and answer the question on the basis of the collected images from multiple frames [64]. Reference [96] proposes a hierarchical-level embodied question answering method based on neural module control, which divides the problem into multiple subtasks, and then specialized systems complete each subtask through navigation and observation. Reference [65] suggested the interactive question answering (IQA) task, which necessitates an augmented level of interaction between the agent and its environment in terms of embodied question answering. In EQA task, the agent mainly collects information through navigation and observation of the environment to answer questions, while in IQA, the agent needs to take action to act on the environment and change its state, which requires more complex planning and reasoning. In addition, there are multitarget embodied question answering tasks and multiagent embodied question answering tasks. In multitarget tasks, the problem involves not only one target, but also multiple targets distributed in different locations in the environment, which adds complexity to reasoning and navigation [66]. In multiagent embodied question answering, multiple agents work in unison to address the question, thus avoiding overburdening a single agent with complex multitarget tasks. Working out how to distribute tasks to each agent in multiagent embodied question answering to avoid duplication of work is vital. Reference [67] introduces a scalable optimization-based planner to assign different viewpoints to each agent and execute tasks on their respective viewpoints.

As we mentioned earlier, these embodied tasks are not independent of each other. As shown in Figure 1, embodied exploration tasks can serve as preparatory work for downstream tasks, providing environmental prior knowledge. Both embodied navigation and exploration tasks can serve the comprehensive task of embodied question and answer.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Three embodied tasks and their relation. To answer the question, the agent must first interpret natural language instructions to determine the target’s location, which could be provided by prior environmental information, and then make an inference based on the visual information that has been gathered.

We have summarized the typical work of three embodied tasks in Table 1, which facilitates readers to clearly understand the types of methods, frameworks, architectures, input data (sensors), benchmarks, and other related information, and Table 2 provides a supplementary key to the abbreviations and specialized symbols used therein. Additionally, training intelligent agents capable of performing embodied tasks requires a suitable environment (embodied environment), while the cost of training real robots in the real world is expensive that requires a lot of time and is difficult to provide environmental diversity. Therefore, most works choose to train in a virtual environment first and then deploy it to the real environment. We summarized some commonly used simulation environments in Table 3, where users can directly control the state of objects in the environment through APIs and interact with objects in the environment through an agent or VR devices.

Table 2. Instruction of abbreviations and special symbols in Table 1.

Abbreviation	Full name
SR	Success rate
OSR	Oracle success rate
TL	Trajectory length
NE	Navigation error
SPL	Success rate weighted by path length
CLS	Coverage weighted by length Score
nDTW	The normalized dynamic time warping
SDTW	The normalized dynamic time warping the success weighted by nDTW
ADG	Average distance to goal
ATL	Average trajectory length
AG	Average score
LSR	Localization success rate
RE	Reconstruction error
QA	Question answering accuracy.
NA	Navigation accuracy
SL	Spawn location
PS	Planning success
PE	Planning efficiency

Table 3. Common simulation environments for embodied tasks.

Name	Published time	Scene type	Interactive mode	Basic task types
DeepMind lab [97]	2016	3D	API, agent	Composite activities
VirtualHome [98]	2018	3D indoor	API, agent	Household activities
ALFRED [99]	2019	3D indoor	Agent	Household activities
BabyAI [100]	2019	2D	Agent	Instruction following
VRKitchen [101]	2019	3D indoor	Agent, VR	Cooking activities
Habitat [102]	2019	3D indoor	Agent	Composite activities
CHALET [103]	2019	3D indoor	Agent	Household activities
SAPIEN [104]	2020	3D indoor	API, agent	Robotic interaction
ACTIONET [105]	2020	3D indoor	Agent	Household activities
ThreeDWorld [106]	2021	3D indoor, 3D outdoor	API, agent, VR	Composite activities
iGibson [107]	2021	3D indoor	API, agent, VR	Household activities
RFUniverse [108]	2022	3D indoor	Agent, VR	Household activities
AI2-Thor [42]	2022	3D indoor	API, VR	Household activities
RCareWorld [109]	2022	3D indoor	Agent, VR	Caregiving activities
MINEDOJO [110]	2022	3D indoor,3D outdoor	Agent	Composite activities

3. LLM and MLM

At present, embodied intelligence exhibits an integrated intelligence, which benefits from the improvement and fusion of single-modal models, especially the joining of vision and language models, allowing the model to handle multimodal inputs, processing visible information in accordance with text prompts, and operating in complicated environments, greatly promoting the model’s embodied ability. The initial fusion of vision and language models was primarily aimed at meeting the needs of specific tasks, such as classic visual question answering and image caption. Models such as LLMs and MLMs, which have universal capabilities, endow physical entities with strong generalization abilities, enabling autonomous robot system to shift from program execution orientation to task goal orientation, thereby allowing them to adjust to more intricate embodied activities, taking solid steps toward universal robots [47, 48]. In this section, we mainly reviewed some current work on LLMs and MLMs.

3.1. LLM

One of the challenges of visual language tasks lies in comprehending natural language and overcoming the bad generalization issues brought about by language diversity. Transformer [111] is the most preferable NLP model at present. What sets Transformer apart from earlier RNN and LSTM models is its reliance on attention mechanism instead of recursion and convolutions for capturing semantic dependencies in sentences, thereby achieving remarkable performance in various NLP tasks. Its computational efficiency and scalability enable model parameters to be expanded to over 100B. Transformer, as an early LLM, had a significant impact on later models. Currently, most LLMs are based on Transformer and are essentially variants of Transformer, for instance, BERT [112] and GPT-3 [113]. BERT is a pretrained language representation model based on the Transformer structure, which is trained on a large corpus using two unsupervised tasks, masked LM and next sentence prediction, and can be finely tuned to downstream tasks. While GPT is an auto-recursive model, GPT-1 was originally developed based on a generative, decoder-only Transformer architecture, using a hybrid method of unsupervised pretraining and supervised fine tuning. Subsequent research into the scaling effect of the model revealed that when the parameters of the model were expanded to a larger size, the model’s performance significantly improved; therefore, when the parameter size of GPT-3 reached 175B, the model showed astonishing emergence ability [114]. Aside from GPT, there are other language models with large-scale parameters such as PaLM [115] and LLaMA [116]. Although there are differences in training process, parameter size, and model structure details (activation functions, embedding layers, etc.), most of them are still based on the transformer structure. We have summarized information about them in Table 4.

Table 4. Vision/language large-scale models.

Type	Model	Parameter size	Affiliation
LLM	GPT-3 [113]	175B	OpenAI
	PlaM [115]	540B	Google
	LLaMA [116]	13B	Meta
	LaMDA [117]	137B	Google
	GLM [118]	130B	TsingHua
	BLooM [119]	176B	BigScience

LVM	ViT [120]	632M	Google
LVM	CLIP [121]	695M	OpenAI

MLM	KOSMOS-1 [122]	1.6B	Microsoft
	Flamingo [123]	10.2B	DeepMind
	BLIP-2 [124]	188M	Salesforce research
	ML-MFSL [125]	≈1.5B	UvA
	MiniGPT-4 [126]	13B	KAUST
	Video-LLaMA [127]	≈13B	Alibaba
	LLaMA-Adapter [128]	1.2M	Shanghai AI lab
	FROMAGe [129]	5.5M	CMU

3.2. MLM

The LLM has in context learning ability, which is trained through prompt learning and RL from human feedback to learn from human preferences, enabling it to cope with various natural language tasks. Moreover, it also exhibits certain abilities in code generation and solving mathematical problems. Although LLMs exhibit surprising zero/new shot inference performance in most NLP tasks, they are only restricted to discrete text. At the same time, large-scale visual models have experienced accelerated development in terms of visual perception. ViT [120] is a large visual model (LVM) inspired by Transformer, which also eliminates convolution and calculates attention between image regions. It divides the image into small pieces and uses the linear embedding sequence of these small pieces as input to the Transformer, achieving excellent results in image classification tasks and can be widened to multiple other tasks. CLIP [121] is a LVM that combines natural language; during the training process, image text pairs are used instead of image labels to enable the model to learn the correlation between image features and text features, leveraging contrastive learning to hone the model’s general feature representation skill which through the way that boosting the similarity between images and their corresponding text, and decreasing the similarity between images and unrelated text. CLIP’s supervised information acquired via natural language demonstrates how cross-modal learning enhances model training generalization, and its capacity to relate images with text makes it especially suitable for downstream visual language tasks. Visual information and textual information are often complementary by joining together LLMs and LVMs, and it is possible to further enhance the capability of vision–language models in tackling various challenging multimodal tasks. There are two key points in LLM that combined with LVM: (1) The zero-shot/feed-shot capability of the model and (2) the cross-modal alignment capability. Most of the current LVM models utilize pretrained single-mode models, such as pure visual models and pure language models, which can provide models with strong zero-shot/feed-shot capabilities to generalize to new tasks. However, single-mode visual/language models are trained within their respective modes, and how to align cross-modal information is the most critical problem. Presently, two main approaches are direct combination, such as KOSMOS-1 [122], where Transformer is used straightaway to merge textual and visual features, or an additional cross-modal alignment network, like Flamingo [123], and it deploys the Perceiver Resampler and Gated XATTN-dense modules to promote cross-modal information interaction. The Perceiver Resampler is used to generate a fixed number of visual outputs and perform cross-modal alignment through the cross-attention layer in the Gated XATTN-dense module. BLIP-2 [124] fills the modal gap through a lightweight Querying Transformer (Q-Transformer) that directly connects image encoder and LLM. Through two stages of training, the Q-Transformer learns the visual representation extracting and vision-to-language generation of the corresponding text. The Q-Transformer is composed of two modules that share self-attention layers, and it utilizes the cross-attention layer to achieve cross-modal interaction between vision and text as well. ML-MFSL [125] defines a new multimodal few-shot meta learning method to bridge the gap between vision and language patterns. It consists of three components: a vision encoder, a meta mapper, and a language model. The meta mapper, which consists of multiple attention modules, maps the visual encoding to the late space of the language model. During the training process, a set of visual prefixes are added before the visual features extracted from the vision encoder; then, it is fed into the meta mapper to obtain the output results and concatenate them with text features, and sent to LLM later. Overall, the key to cross-modal information alignment lies in finding an appropriate mechanism to deliver visual concepts to language models by accumulating shared knowledge from multimodal tasks. From the perspective of AGI, MLM is more consistent with the way humans comprehend the world and can handle a greater variety of tasks, which has a higher promoting influence on embodied intelligence. We have provided a summary of some widely used MLMs in Table 4.

3.3. Chain of Thought (CoT)

Prompting engineering and fine-tuning LLM has stimulated the model’s emergency capabilities, but these are still insufficient to reach outstanding performance on challenging tasks. OpenAI proposed CoT in 2020, a way that enhances language model reasoning ability and has been validated to be effective in intricate reasoning tasks. It can improve the model performance in a range of mathematical, common sense, and symbolic reasoning tasks [130]. The main idea of CoT seeks to promote LLMs to not only generate the final result, but also to present the thinking process that leads to the answer, just like human mental processes. The construction of CoT includes two modes: (1) filling-based mode and (2) prediction-based mode. Specifically, padding-based patterns require inferring steps between the context (previous and subsequent steps) to fill logical blanks, while prediction-based patterns require extending the reasoning chain with given conditions (instructions and previous reasoning history) [131]. Both patterns are striving for the same goal, which is to ensure that the produced steps are accurate and consistent; moreover, several investigations regarding CoT have proposed broadening single-mode CoT to multimodal CoT (M-CoT), making use of visual data to enhance reasoning [132].

4. LLM/MLM-Based Robot Embodied Intelligence System Schemes

Our ultimate vision for universal AI envisions humanoid robots capable of executing multiple tasks with both dexterity and efficiency. In former times, embodied tasks usually encompassed navigating, exploring, answering questions, but not usually involved doing comprehensive activities with intricate action sequences. LLMs have demonstrated their potential for comprehensive task execution, possessing extensive world knowledge and strong guiding learning abilities [133]. For instance, ChatGPT, as illustrated in reference [134], can tackle diverse robot-related tasks in a zero-shot manner, adapting to various formal elements and enabling closed-loop reasoning through dialog. SayCan [135] integrates robots’ low-level skills with LLMs, where robots function as the “hands and eyes” of the LLM, furnishing high-level semantic knowledge for tasks. SayCan has devised a set of actions articulated in natural language, utilizing prompts and LLMs to devise plausible planning steps and rating allowable actions with a learned value function. PROGPROMPT [136], another robot task planning framework, incorporates programming language structures, predefined operational functions, and environmental objects, formulating robot plans as Python programs, enabling LLMs to directly yield executable plans as code. PROGPROMPT represents task planning as <O, P, A, T, I, G, t> tuples, where t is the task description, O represents the collection of all objects in the environment, P represents the relevant attributes of objects in O, A represents a set of executable operations, T represents the transformation model used to update the current environmental state, s represents the assignment state of the attribute P of object O in the current environment, and I and G represent the initial and target states, respectively. At the prompt of t, s continuously updates based on the previous moment’s s and A until s transitions from the I state to the G state. Both PROGPROMPT and ChatGPT are equipped with predefined function libraries, relying on function calls to achieve tasks. In these schemes, LLMs primarily function as the control hubs for embodied robots, generating task plans. Furthermore, LLM can provide prior knowledge to assist the robot in completing specific tasks, exemplified by Housekeep task [137] and TidyBot [138]. LLM is used to plan home cleaning tasks by moving every object on the floor to its “appropriate position” and utilizing the summarization capabilities of LLMs, to derive general preferences for specific individuals and provide personalized robot services. Visual information is crucial for most tasks in real-world environments, yet LLMs primarily process textual data, necessitating their integration with visual information in embodied tasks.

In the current embodied intelligence paradigm, LLMs can leverage visual information through two approaches: either by invoking diverse visual foundation models (VFMs) through the LLM, utilizing VFMs’ expertise in visual processing for environmental comprehension and interaction, exemplified by Visual ChatGPT [139], MM-REACT [140], and HuggingGPT [141]. This method necessitates clear visual operational prompts; if the prompt’s description of the task is too abstract, it is difficult to determine which VFMs should be called, and invoking multiple models can introduce cumulative errors. Alternatively, directly merging a visual model with an LLM forms a MLM for embodied tasks. PaLM-E [142] is the first large-scale embodied multimodal model that seamlessly integrates continuous multimodal sensor inputs into a LLM, enhancing real-world agent decision-making reliability. The PaLM-E system primarily comprises two components: a pretrained language model, PaLM, and a pretrained visual model, ViT/OSRT, capable of accepting multimodal inputs encompassing text and images. PaLM-E encodes continuous multimodal sensor inputs into vector sequences sharing the same spatial dimension as language embeddings, thereby integrating them into the language model. This enables the model to output results from classical image-language tasks or utilize them to control the robot’s underlying behavioral plan sequence, addressing diverse embodied reasoning challenges.

Similar to this work, EmbodiedGPT [143] is another end-to-end MLM. As depicted in Figure 2, EmbodiedGPT comprises four integrated modules: (1) a ViT-G/14 visual model for encoding observed visual features; (2) an LLaMA language model for executing question answering, description, and specific planning tasks; (3) an embedded transformer, serving as a bridge between the visual and language domains, inputting visual features into the language mapping layer to obtain matching modalities or extracting high-level related features to generate low-level control plans for the downstream network based on the plan; and (4) a policy network that generates low-level actions based on task-related features extracted by the embedded transformer, enabling agents to effectively interact with the environment. EmbodiedGPT utilizes CoT to produce embodied planning and trains on ecocentric data. The task planning generated by it has strong executability and granularity at the object part level. Inner Monologue [144] underscores the significance of feedback between agents and the environment in embodied tasks, aiding real-time task planning updates. In Inner Monologue, feedback primarily encompasses Success Detection, Passive Scene Description, and Active Scene Description. Success Detection provides task completion information, Passive Scene Description offers structured semantic scene information at each planning step, and Active Scene Description provides unstructured semantic information only upon querying by the LLM planner. The detailed information of these models is outlined in Table 5, while Table 6 provides a supplementary key to the abbreviations and special symbols used therein.

Table 5. Embodied robot systems based on LLM/MLM.

Scheme	Visual model	LLM	Task	Benchmark	Metrics
SayCan [135]	—	PaLM	TP, MP	Mock kitchen, CRWE	PSR, ESR
PROGPROMPT [136]	ViLD	GPT-3	TP, TPE	Virtual Home^∗ [98], CRWE	SR, GCR, Exec
TidyBot [138]	ViLD, CLIP	GPT-3	PP	CRWE	OPA
PaLM-E [142]	ViT, OSRT	PaLM	TP, VQA, VC	TAMP, language-Table, CRWE	SR
EmbodiedGPT [143]	ViT-G/14	LLaMA	TP, VQA, VC	Meta-World^∗ [145], Virtual Home^∗, Franka Kitchen^∗ [146], CRWE	SR
Inner Monologue [121]	MDETR	InstructGPT	TP, TPE	Ravens^∗ [147], CRWE	SR

^∗Means Public disclosure benchmark.

Table 6. Instruction of abbreviations and special symbols in Table 5.

Abbreviation	Full name
TP	Task planning
MP	Motion planning
TPE	Task plan executing
VQA	Visual question answering
VC	Visual captioning
PP	Pick and place
OPA	The object placement accuracy
CRWE	Custom real-world environment
PSR	The plan success rate
ESR	The execution success rate
SR	The success rate
GCR	The goal conditions recall
Exec	The executability
^∗	Public disclosure benchmark

The embodied robot system based on MLMs can be characterized by several common features: (1) multimodal input, where intelligent agents must perceive and understand the environment during real-world tasks, as solely relying on text cannot yield feasible task planning in reality. Combining multimodal information like images, videos, and feedback states can enhance planning reliability; (2) multilevel planning, where deploying a complex task on a robot, results in a considerable robot control sequence, challenging to accurately generate solely through a LLM. Therefore, to execute complex tasks in an environment for an extended period, the agent first generates abstract high-level task planning, then sequentially produces low-level fine-grained task planning composed of low-level skills, feeding back the state during the process to the LLM for real-time high-level task planning adjustment, forming a control loop; (3) pretrained models, where both the visual and language components of MLM utilize pretrained large models, adapting to various tasks through transfer learning and enabling few-shot/zero-shot training [148]. Notably, using a frozen pretrained model maintains MLM’s performance in multiple tasks without changing weights during fine-tuning, ensuring the underlying model’s versatility remains intact [149].

From the above analysis, it emerges that a comprehensive embodied intelligent system typically encompasses multimodal perception, task planning, and command compliance, summarized as the perception–planning–action (PPA) paradigm shown in Figure 3. The perception part acquires information about the agent’s environment, crucial for task planning and execution. It gathers as much information as possible, varying based on the task; for example, in grasping tasks, understanding objects’ detailed physical characteristics is essential, recommending the establishment of a world model [150]; for exploratory tasks, the focus shifts to scene entity relationships, where creating a scene graph is advantageous [151]. The planning part generates executable task planning in natural language form via MLM/LLM based on overall task and environmental information. This planning stage often lacks fine-grained execution details, requiring further planning by the action part, which parses each planning entry into a sequence of actions the agent can execute effectively to interact with the environment. The action stage’s primary goal is instruction compliance, invoking the robot’s low-level skills, including simple operations like forward and backward movement, steering, obstacle avoidance, or advanced operations mentioned earlier, including environmental exploration and VLN, etc. In short, low-level skills significantly impact the robot’s overall physical performance. Much research focuses on expanding low-level skills, including robot action conversion and video learning. Robot action conversion employs a visual language model to convert natural language instructions into robot action sequences based on prompts and scenes, with RT-1/2 being typical related work. RT-1 [152], built on the Transformer architecture, takes images and task descriptions as inputs, directly outputting tokenized actions, and PaLM-E uses RT-1 for underlying strategy provision. Similarly, VIMA [153] encodes input sequences of text and visual cue symbols through a pretrained language model, autoregressively decoding robot control actions for each environmental interaction step, differing from RT-1 by operating on predivided image objects using a detector for subsequent embedding. LATTE [154] considers more detailed and flexible robot control, enabling robots to complete actions in various ways, ensuring safety and modifying motion trajectories based on human intentions and current dynamic spatial state constraints. Video learning involves learning behavior by watching unlabeled online videos and expanding into complex tasks through combination [155], with the main challenge being the lack of video labels for model training. VPT [156] is an example of training inverse dynamic models into pseudo labels using a small amount of labeled data when labels are absent from the learning video dataset, followed by large-scale behavioral cloning or RL, fine-tuning the model to downstream tasks for various skills acquisition. Moreover, the second and third planning stages involve substantial embodied reasoning combined with environmental information, with adding a thinking chain in this part being the current mainstream choice.

5. Conclusion and Future Work

Although AGI remains unrealized, research in the field has progressed into the era of embodied intelligence. In this paper, we provide a comprehensive summary of the task content, model construction, and PPA paradigm associated with embodied intelligence systems. The embodied model aims to accurately perceive the environment, identify relevant objects, analyze their spatial relationships, formulate detailed task plans, and simulate human perception and interaction with the surroundings. Currently, embodied intelligence schemes employed in robots benefit from advancements in previous perceptual intelligence, representing an integration and expansion of these achievements onto embodied intelligence agents. Utilizing LLM or MLM as the backbone, these systems call upon packaged “skills.” Consequently, the overall embodied performance of the current system primarily enhances by combining low-level skills, though it remains limited by them. Here, LLM or MLM primarily serves as a “remote control” with low-level skills being the true drivers. The embodied schemes that integrate diverse VFMs like Visual ChatGPT, MM-REACT, and HuggingGPT have a modular system structure, which requires high execution efficiency for each part, and some generation results are unsatisfied due to the failure of VFMs and the instability of the prompt. The embodied scheme based on MLM has a more concise system structure, but faces challenges in data efficiency, reasoning ability, etc. We can discuss the current development status and future development of embodied intelligent systems based on the PPA paradigm from the following directions.

5.1. Task Complexity

A direct indicator of the ability of an embodied intelligent system is to observe how complex the system can perform tasks. Based on results, we categorize current embodied tasks into two types. The first type, embodied actions, necessitates the agent to execute a series of actions culminating in a completion signal. The second, embodied question and answer, similarly requires a series of actions but necessitates further analysis, ultimately returning an answer. Various embodied schemes often establish embodied environments based on virtual environments and agents, exemplified by the task “Bring me the rice chips from the drawer.” Here, the mobile agent needs to: (1) go to the drawers, (2) open the top drawer, and (3) retrieve the rice chip bag and place it on the counter. These tasks are relatively straightforward, focusing on navigation and retrieval without extensive environmental interaction or analysis. Taking SayCan as an example, PaLM-SayCan performed worst on the most challenging long-horizon tasks, where most failures were a result of early termination by the LLM. Therefore, such systems exhibit limited tolerance for unexpected situations, such as items being hidden or environmental settings contradicting the task description. Hence, during training, it is essential to increase task complexity by incorporating these unexpected factors, enabling embodied systems to overcome obstacles and execute tasks smoothly. Completing long-horizon tasks requires long-horizon reasoning over a required order, an abstract understanding of the instruction, and knowledge of both the environment and robot’s capabilities. Understanding complex environments is paramount, yet current schemes hinge on LLM/MLM-based task decomposition mechanisms that harness common sense for basic task planning but fail to grasp specific scenarios. An exemplary embodied intelligent system must function seamlessly across diverse, unforecasted environments, highlighting the necessity of interpreting and executing natural language instructions to augment knowledge transfer and generalization capabilities within intricate settings.

5.2. Model Reasoning Ability

LLM contains substantial real-world knowledge driven by a large amount of data, but lacks robust reasoning ability. CoT can enhance LLM’s reasoning ability by adding gradual reasoning text to the dataset, improving answer generation accuracy. However, acquiring this prompt is not straightforward, necessitating zero-shot training strategies. Model reasoning methods include probability-based and symbol-based methods. LLM’s reasoning stems from a probability-based methodology, where decisions are formulated based on the inherent correlations within data. In contrast to symbol-based methods, this approach boasts robust generalization abilities. However, it fails to endow the model with a genuine understanding of the causal nexus among knowledge, behavior, and environment, thereby yielding biased outcomes influenced by data. Consequently, these models struggle to operate interpretably, robustly, and reliably in genuine contexts. Reasoning types encompass common sense, relational, logical, and counterfactual reasoning, each requiring distinct solutions. Neither probability-based nor symbol-based methods can cover all reasoning types, suggesting a hybrid approach as a valuable paradigm exploration.

5.3. Migration From Simulation to Reality

In comparison with the extensive language or visual language datasets available, robot data remain notably sparse. Acquiring such data is both time-consuming and resource-intensive. To establish a versatile entity model that spans various scenarios and tasks within robot technology, the construction of large-scale datasets and the utilization of high-fidelity simulated environment data as a supplementary to real-world data are imperative. Nonetheless, the paramount issue with relying solely on simulated data lies in the simulation-to-reality gap. Simulation-to-reality adaptation entails the transposition of competencies or behaviors acquired in a simulated environment (cyberspace) to real-world scenarios (physical world). This process entails validating and refining the efficacy of algorithms, models, and control strategies developed in simulations to guarantee their resilient and dependable performance in physical environments. For instance, reference [135] notes that in a simulated kitchen, PaLM-SayCan attained a planning success rate of 84% and an execution rate of 74%, whereas in an authentic kitchen setting, there was a decrement of 3% in planning performance and 14% in execution. The question then arises: How can models trained in simulated environments be generalized to the real world without a significant decrement in execution performance? Crafting world models that closely mirror real-world environments and procuring high-quality data are pivotal to enhancing generalization capabilities.

5.4. Unified Evaluation Criteria

Although numerous benchmarks exist for assessing low-level control strategies, they frequently exhibit substantial disparities in the evaluated skills. Moreover, the objects and scenes incorporated into these benchmark tests are frequently constrained by simulator limitations. To conduct a thorough evaluation of the physical model, it is imperative to employ realistic simulators for benchmark tests that encompass a wide array of skills. Regarding advanced task planners, many benchmarks concentrate on assessing planning capabilities through question answering tasks. However, a more effective approach would be to simultaneously evaluate both high-level task planners and low-level control strategies to execute long-term tasks and measure success rates, rather than merely relying on isolated evaluations of the planners. This holistic approach offers a more comprehensive assessment of the capabilities of embedded intelligence systems.

5.5. Underlying Skill Expansion

In the realm of multimodal LLMs, current embodied intelligence adopts the PPA paradigm, integrating MLM with thought chains for task planning and reasoning. Essentially, it constitutes an integrated and invoked methodology. The primary limitation of the system lies in the scope and capabilities of the underlying skills. Future endeavors to broaden the skill repertoire and enhance their robustness would alleviate this constraint.

PPA, as the cornerstone paradigm of embodied intelligence, is not the definitive framework for general AI. Yet, it marks a promising beginning. Until we unravel the mysteries of the brain, exploring embodied intelligence from the ground up offers valuable insights for cognitive science.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the Heilongjiang Province Key Research and Development Program (Grant GA21A302).

Open Research

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

1 Pfeifer R. and Iida F., Embodied Artificial Intelligence: Trends and Challenges, Lecture Notes in Computer Science. (2004) 3139, 1–26, https://doi.org/10.1007/978-3-540-27833-7_1.
10.1007/978-3-540-27833-7_1
Google Scholar
2 Chrisley R., Embodied Artificial Intelligence, Artificial Intelligence. (2003) 149, no. 1, 131–150, https://doi.org/10.1016/s0004-3702(03)00055-9, 2-s2.0-0042889574.
10.1016/s0004-3702(03)00055-9
Web of Science® Google Scholar
3 Hughes J., Abdulali A., Hashem R., and Iida F., Embodied Artificial Intelligence: Enabling the Next Intelligence Revolution, IOP Conference Series: Materials Science and Engineering. (2022) 1261, no. 1, https://doi.org/10.1088/1757-899x/1261/1/012001.
10.1088/1757-899x/1261/1/012001
Google Scholar
4 Howard D., Eiben A. E., Kennedy D. F., Mouret J.-B., Valencia P., and Winkler D., Evolving Embodied Intelligence from Materials to Machines, Nature Machine Intelligence. (2019) 1, no. 1, 12–19, https://doi.org/10.1038/s42256-018-0009-9.
10.1038/s42256-018-0009-9
Web of Science® Google Scholar
5 Fei-Fei L. and Krishna R., Searching for Computer Vision North Stars, Dædalus. (2022) 151, no. 2, 85–99, https://doi.org/10.1162/daed_a_01902.
10.1162/daed_a_01902
Google Scholar
6 Ziemke T., The Body of Knowledge: On the Role of the Living Body in Grounding Embodied Cognition, Biosystems (Amsterdam). (2016) 148, 4–11, https://doi.org/10.1016/j.biosystems.2016.08.005, 2-s2.0-84993990240.
10.1016/j.biosystems.2016.08.005
PubMed Web of Science® Google Scholar
7 Agrim G., Silvio S., Surya G., and Li F.-F., Embodied Intelligence via Learning and Evolution, Nature Communications. (2021) 12, no. 1, 1–12.
PubMed Google Scholar
8 Bartolozzi C., Indiveri G., and Donati E., Embodied Neuromorphic Intelligence, Nature Communications. (2022) 13, no. 1, 1024–14, https://doi.org/10.1038/s41467-022-28487-2.
10.1038/s41467-022-28487-2
CAS PubMed Web of Science® Google Scholar
9 Manzotti R., Embodied AI beyond Embodied Cognition and Enactivism, Philosophies. (2019) 4, no. 3, https://doi.org/10.3390/philosophies4030039.
10.3390/philosophies4030039
Google Scholar
10 Posner I., Robots Thinking Fast and Slow: On Dual Process Theory and Metacognition in Embodied AI, 2020, OpenReview, https://openreview.net/forum?id=iFQJmvUect9.
Google Scholar
11 Smith L. and Gasser M., The Development of Embodied Cognition: Six Lessons from Babies, Artificial Life. (2005) 11, no. 1-2, 13–29, https://doi.org/10.1162/1064546053278973, 2-s2.0-15444371960.
10.1162/1064546053278973
PubMed Web of Science® Google Scholar
12 Nguyen P. D. H., Georgie Y. K., Kayhan E., Eppe M., Hafner V. V., and Wermter S., Sensorimotor Representation Learning for an “Active Self” in Robots: A Model Survey, KI-Künstliche Intelligenz. (2021) 35, no. 1, 9–35, https://doi.org/10.1007/s13218-021-00703-z.
10.1007/s13218-021-00703-z
Web of Science® Google Scholar
13 Hoffmann M. and Pfeifer R., The Implications of Embodiment for Behavior and Cognition: Animal and Robotic Case Studies, 2012, https://arxiv.org/abs/1202.0440.
Google Scholar
14 Gounaris A. and Elkheir G. A., Why Embodied Artiffcial Intelligence Is Not So Embodied?, Proceedings of the XXIII World Congress of Philosophy. (2018) 45, 9–14.
10.5840/wcp23201845911
Google Scholar
15 Ziemke T., Embodied AI as Science: Models of Embodied Cognition, Embodied Models of Cognition, or Both?, Lecture Notes in Computer Science. (2004) 3139, 27–36, https://doi.org/10.1007/978-3-540-27833-7_2.
10.1007/978-3-540-27833-7_2
Google Scholar
16 Krishna R., Lee D., Fei-Fei L., and Bernstein M. S., Socially Situated Artificial Intelligence Enables Learning from Human Interaction, Proceedings of the National Academy of Sciences. (2022) 119, no. 39, 1–8, https://doi.org/10.1073/pnas.2115730119.
10.1073/pnas.2115730119
Web of Science® Google Scholar
17 Mishkin D., Dosovitskiy A., and Koltun V., Benchmarking Classic and Learned Navigation in Complex 3D Environments, 2019, https://arxiv.org/abs/1901.10915.
Google Scholar
18 Labbé M. and Michaud F., Rtab-map as an Open-Source Lidar and Visual Simultaneous Localization and Map-Ping Library for Large-Scale and Long-Term Online Operation, Journal of Field Robotics. (2019) 36, no. 2, 416–446, https://doi.org/10.1002/rob.21831, 2-s2.0-85055707168.
10.1002/rob.21831
Google Scholar
19 Mousavian A., Toshev A., Fišer M., Košecká J., Wahid A., and Davidson J., Visual Representations for Semantic Target Driven Navigation, Proceedings of International Conference on Robotics and Automation, 2019, Montreal, QC, Canada.
Google Scholar
20 Gupta S., Tolani V., Davidson J., Levine S., Sukthankar R., and Malik J., Cognitive Mapping and Planning for Visual Navigation, International Journal of Computer Vision. (2020) 128, no. 5, 1311–1330, https://doi.org/10.1007/s11263-019-01236-7.
10.1007/s11263-019-01236-7
Web of Science® Google Scholar
21 De Gregorio D. and Di Stefano L., SkiMap: An Efficient Mapping Framework for Robot Navigation, Proceedings of IEEE International Conference on Robotics and Automation, 2017, Singapore.
Google Scholar
22 Li H., Hu Z., and Liu J., Enhanced Geometric Map: a 2D&3D Hybrid City Model of Large Scale Urban Environment for Robot Navigation, ROBOT. (2016) 38, no. 3, 311–321.
Google Scholar
23 Beeching E., Dibangoye J., Simonin O., and Wolf C., Learning to Plan with Uncertain Topological Maps, Proceedings of the 16th European Conference on Computer Vision, August, 2020, Glasgow, United Kingdom.
Google Scholar
24 Savinov N., Dosovitskiy A., and Koltun V., Semi-Parametric Topological Memory for Navigation, 2018, https://arxiv.org/abs/1803.00653.
Google Scholar
25 Hong Y., Zhou Y., Zhang R. et al., Learning Navigational Visual Representations With Semantic Map Supervision, Proceedings of IEEE/CVF International Conference on Computer Vision, October, 2023, Paris, France.
Google Scholar
26 Yang W., Wang X., Farhadi A., Gupta A., and Mottaghi R., Visual Semantic Navigation Using Scene Priors, 2018, https://arxiv.org/abs/1810.06543.
Google Scholar
27 Du H., Yu X., and Zheng L., Learning Object Relation Graph and Tentative Policy for Visual Navigation, Proceedings of the 16th European Conference on Computer Vision, August 2020, Glasgow, UK.
Google Scholar
28 Huang C., Mees O., Zeng A., and Burgard W., Audio Visual Language Maps for Robot Navigation, Proceedings of International Symposium on Experimental Robotics, November 2023, Thailand, Chiang Mai, 26–30.
Google Scholar
29 Smith R. C. and Cheeseman P., On the Representation and Estimation of Spatial Uncertainty, The International Journal of Robotics Research. (1986) 5, no. 4, 56–68, https://doi.org/10.1177/027836498600500404, 2-s2.0-0022905675.
10.1177/027836498600500404
CAS PubMed Web of Science® Google Scholar
30 Mirowski P., Grimes M. K., Malinowski M. et al., Learning to Navigate in Cities without a Map, Proceedings of the 32nd International Conference on Neural Information Processing Systems, December 2018, Red Hook, NY, USA, 2424–2435.
Google Scholar
31 Surmann H., Jestel C., Marchel R., Musberg F., Elhadj H., and Ardani M., Deep Reinforcement Learning for Real Autonomous Mobile Robot Navigation in Indoor Environments, 2020, https://arxiv.org/abs/2005.13857.
Google Scholar
32 Zhu Y., Mottaghi R., Kolve E. et al., Target-Driven Visual Navigation in Indoor Scenes Using Deep Reinforcement Learning, Proceedings of IEEE International Conference on Robotics and Automation, June 2017, Singapore.
Google Scholar
33 Kahn G., Villaflor A., Ding B., Abbeel P., and Levine S., Self-Supervised Deep Reinforcement Learning With Generalized Computation Graphs for Robot Navigation, Proceedings of IEEE International Conference on Robotics and Automation, May 2018, Brisbane, QLD, Australia.
Google Scholar
34 Bonin-Font F., Ortiz A., and Oliver G., Visual Navigation for Mobile Robots: A Survey, Journal of Intelligent and Robotic Systems. (2008) 53, no. 3, 263–296, https://doi.org/10.1007/s10846-008-9235-4, 2-s2.0-54149119534.
10.1007/s10846-008-9235-4
Web of Science® Google Scholar
35 Gupta S., Fouhey D. F., Levine S., and Malik J., Unifying Map and Landmark Based Representations for Visual Navigation, 2017, https://arxiv.org/abs/1712.08125.
Google Scholar
36 Campari T., Eccher P., Serafni L., and Ballan L., Exploiting Scene-Specific Features for Object Goal Navigation, Proceedings of the 16th European Conference on Computer Vision, August 2020, Glasgow, UK.
Google Scholar
37 Anderson P., Chang A., Chaplot D. S. et al., On Evaluation of Embodied Navigation Agents, 2018, https://arxiv.org/abs/1807.06757.
Google Scholar
38 Ye X. and Yang Y., From Seeing to Moving: A Survey on Learning for Visual Indoor Navigation (VIN), 2020, https://arxiv.org/abs/2002.11310.
Google Scholar
39 Wang H., Liang W., Gool L. V., and Wang W., Towards Versatile Embodied Navigation, Proceedings of the 36th International Conference on Neural Information Processing Systems, December 2022, New Orleans, LA, USA.
Google Scholar
40 Ammirato P., Poirson P., Park E., Kosecka J., and Berg A. C., A Dataset for Developing and Benchmarking Active Vision, Proceedings of IEEE International Conference on Robotics and Automation, June 2017, Singapore.
Google Scholar
41 Song S., Yu F., Zeng A., Chang A. X., Savva M., and Funkhouser T., Semantic Scene Completion From a Single Depth Image, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, July 2017, Honolulu, HI, USA.
Google Scholar
42 Kolve E., Mottaghi R., Han W. et al., AI2-THOR: An Interactive 3D Environment for Visual AI, 2017, https://arxiv.org/abs/1712.05474.
Google Scholar
43 Fried D., Hu R., and Cirik V., Speaker-Follower Models for Vision-And-Language Navigation, Proceedings of the 32nd International Conference on Neural Information Processing Systems, December 2018, Montréal, Canada.
Google Scholar
44 Ku A., Anderson P., Patel R., Ie E., and Baldridge J., Room-Across-Room: Multilingual Vision-and-Language Navigation With Dense Spatiotemporal Grounding, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
Google Scholar
45 Guhur P.-L., Tapaswi M., Chen S., Laptev I., and Schmid C., Airbert: In-Domain Pretraining for Vision-and-Language Navigation, Proceedings of IEEE/CVF International Conference on Computer Vision, October 2021, Montreal, QC, Canada.
Google Scholar
46 Jain V., Magalhaes G., Ku A., Vaswani A., Ie E., and Baldridge J., Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
Google Scholar
47 Shah D., Osiński B., Ichter B., and Levine S., LM-Nav: Robotic Navigation With Large Pre-trained Models of Language, Vision, and Action, Proceedings of the 6th Conference on Robot Learning, December 2023, Auckland, New Zealand.
Google Scholar
48 Yu B., Kasaei H., and Cao M., L3MVN: Leveraging Large Language Models for Visual Target Navigation, 2023, https://arxiv.org/abs/2304.05501.
Google Scholar
49 Ramakrishnan S. K., Gokaslan A., Wijmans E. et al., Habitat-Matterport 3D Dataset (HM3D): 1000 Large-Scale 3D Environments for Embodied AI, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
Google Scholar
50 Xia F., Zamir A. R., He Z.-Y., Sax A., Malik J., and Savarese S., Gibson Env: Real-World Perception for Embodied Agents, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, Salt Lake City, UT, USA.
Google Scholar
51 Chen S., Guhur P.-L., Schmid C., and Laptev I., History Aware Multimodal Transformer for Vision-and-Language Navigation, Proceedings of the 35th International Conference on Neural Information Processing Systems, December 2024, Red Hook, NY, USA.
Google Scholar
52 Thomason J., Murray M., Cakmak M., and Zettlemoyer L., Vision-and-dialog Navigation, Proceedings of the Conference on Robot Learning, November 2020, Cambridge, MA, USA.
Google Scholar
53 Qi Y., Wu Q., Anderson P. et al., REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, USA.
Google Scholar
54 Zhu F., Zhu Y., Chang X., and Liang X., Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, USA, 13–19.
Google Scholar
55 Qiao Y., Qi Y., and Hong Y., HOP: History-and-Order Aware Pre-Training for Vision-and-Language Navigation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, New Orleans, LA, USA, 18–24.
Google Scholar
56 Pathak D., Agrawal P., Efros A. A., and Darrell T., Curiosity-Driven Exploration by Self-Supervised Prediction, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, Honolulu, HI, USA, 21–26.
Google Scholar
57 Kempka M., Wydmuch M., Runc G., Toczek J., and Jaskowski W., ViZDoom: A Doom-Based AI Research Platform for Visual Reinforcement Learning, Proceedings of IEEE Conference on Computational Intelligence and Games, September 2016, Santorini, Greece.
Google Scholar
58 Ostrovski G., Bellemare M. G., van den Oord A., and Munos R., Count-Based Exploration With Neural Density Models, Proceedings of the 34th International Conference on Machine Learning, August 2017, Sydney, Australia.
Google Scholar
59 Chen T., Gupta S., and Gupta A., Learning Exploration Policies for Navigation, Proceedings of the 7th International Conference on Learning Representations, May 2019, New Orleans.
Google Scholar
60 Wu Y., Wu Y., Gkioxari G., and Tian Y., Building Generalizable Agents With a Realistic and Rich 3d Environments, Proceedings of the 6th International Conference on Learning Representations, May 2018, Vancouver, BC, Canada.
Google Scholar
61 Ramakrishnan S. K. and Grauman K., Sidekick Policy Learning for Active Visual Exploration, Proceedings of the European Conference on Computer Vision, September 2018, Munich, Germany.
Google Scholar
62 Xiao J., Ehinger K. A., Oliva A., and Torralba A., Recognizing Scene Viewpoint Using Panoramic Place Representation, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2012, Providence, RI, USA.
Google Scholar
63 Wu Z., Song S., Khosla A. et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2015, Boston, MA, USA.
Google Scholar
64 Das A., Datta S., and Gkioxari G., Embodied Question Answering, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, Salt Lake City, UT, USA, 18–23.
Google Scholar
65 Gordon D., Kembhavi A., Rastegari M., Redmon J., Fox D., and Farhadi A., IQA: Visual Question Answering in Interactive Environments, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, Salt Lake City, UT, USA.
Google Scholar
66 Yu L., Chen X., Gkioxari G., Bansal M., Berg T. L., and Batra D., Multi-Target Embodied Question Answering, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, Long Beach, CA, USA.
Google Scholar
67 Tan S., Xiang W., Liu H., Guo D., and Sun F., Multi-Agent Embodied Question Answering in Interactive Environments, Proceedings of European Conference on Computer Vision, August 2020.
Google Scholar
68 Anderson P., Wu Q., Teney D. et al., Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, Salt Lake City, UT, USA, 18–23.
Google Scholar
69 Hong Y., Wang Z., Wu Q., and Gould S., Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, New Orleans, LA, USA.
Google Scholar
70 Tsai Y. H. H., Dhar V., Li J., Zhang B., and Zhang J., Multimodal Large Language Model for Visual Navigation, 2023, https://arxiv.org/abs/2310.08669.
Google Scholar
71 Ma C., Lu J., and Wu Z., Self-Monitoring Navigation Agent via Auxiliary Progress Estimation, Proceedings of the International Conference on Learning Representations, May 2019, New Orleans, LA, USA.
Google Scholar
72 Ma C.-Y., Wu Z., AlRegib G., Xiong C., and Kira Z., The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
Google Scholar
73 Chen S., Guhur P.-L., Tapaswi M., Schmid C., and Laptev I., Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, New Orleans, LA, USA.
Google Scholar
74 Song C. H., Kil J., Pan T.-Y., Sadler B. M., Chao W.-L., and Su Y., One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, New Orleans, LA, USA.
Google Scholar
75 Zhu Y., Zhu F., Zhan Z., Lin B., Jiao J., and Chang X., Vision-Dialog Navigation by Exploring Cross-Modal Memory, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, USA.
Google Scholar
76 Ramakrishnan S. K., Jayaraman D., and Grauman K., An Exploration of Embodied Visual Exploration, International Journal of Computer Vision. (2021) 129, no. 5, 1616–1649, https://doi.org/10.1007/s11263-021-01437-z.
10.1007/s11263-021-01437-z
Web of Science® Google Scholar
77 Chaplot D. S., Gandhi D., Gupta A., and Salakhutdinov R., Object Goal Navigation Using Goal-Oriented Semantic Exploration, Proceedings of the 34th International Conference on Neural Information Processing Systems, December 2020, Red Hook, NY, USA.
Google Scholar
78 Ramakrishnan S. K., Jayaraman D., and Grauman K., Emergence of Exploratory Look-Around Behaviors Through Active Observation Completion, Science Robotics. (2019) 4, no. 30, https://doi.org/10.1126/scirobotics.aaw6326, 2-s2.0-85067019175.
10.1126/scirobotics.aaw6326
PubMed Web of Science® Google Scholar
79 Oudeyer P., Kaplan F., and Hafner V. V., Intrinsic Motivation Systems for Autonomous Mental Development, IEEE Transactions on Evolutionary Computation. (2007) 11, no. 2, 265–286, https://doi.org/10.1109/tevc.2006.890271, 2-s2.0-34047267520.
10.1109/TEVC.2006.890271
Web of Science® Google Scholar
80 Lopes M., Lang T., Toussaint M., and Oudeyer P.-Y., Exploration in Model-Based Reinforcement Learning by Empirically Estimating Learning Progress, Proceedings of the 25th International Conference on Neural Information Processing Systems, December 2012, Lake Tahoe, Nevada.
Google Scholar
81 Tang H., Houthooft R., Foote D. et al., Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning, Proceedings of the 30th International Conference on Neural Information Processing Systems, December 2016, Barcelona, Spain.
Google Scholar
82 Bellemare M., Srinivasan S., Ostrovski G., Schaul T., Saxton D., and Munos R., Unifying Count-Based Exploration and Intrinsic Motivation, Proceedings of the 30th International Conference on Neural Information Processing Systems, December 2016, Barcelona, Spain.
Google Scholar
83 Jayaraman D. and Grauman K., Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, Salt Lake City, UT, USA, 18–23.
Google Scholar
84 Malinowski M. and Fritz M., A Multi-World Approach to Question Answering about Real-World Scenes Based on Uncertain Input, Proceedings of the 27th International Conference on Neural Information Processing Systems, December 2014, Montreal, Canada.
Google Scholar
85 Malinowski M., Rohrbach M., and Fritz M., A Neural-Based Approach to Answering Questions about Images, Proceedings of IEEE International Conference on Computer Vision, December 2015, Santiago, Chile.
Google Scholar
86 Ren M., Kiros R., and Zemel R., Exploring Models and Data for Image Question Answering, Proceedings of the 29th Conference on Neural Information Processing Systems, December 2015, Montreal, Canada.
Google Scholar
87 Fukui A., Park D. H., Yang D., Rohrbach A., Darrell T., and Rohrbach M., Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, November 2016, Austin, Texas.
Google Scholar
88 Watson J. D. and Crick F. H. C., Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid, Nature. (1953) 171, no. 4356, 737–738, https://doi.org/10.1038/171737a0, 2-s2.0-0038497542.
10.1038/171737a0
CAS PubMed Web of Science® Google Scholar
89 Yu Z., Yu J., Fan J., and Tao D., Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering, Proceedings of IEEE International Conference on Computer Vision, October, 2017, Venice Italy.
Google Scholar
90 Shih K., Singh S., and Hoiem D., Where to Look: Focus Regions for Visual Question Answering, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2016, Las Vegas, NV, USA.
Google Scholar
91 Yu D., Fu J., Mei T., and Rui Y., Multi-Level Attention Networks for Visual Question Answering, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, July 2017, Honolulu, HI, USA.
Google Scholar
92 Andreas J., Rohrbach M., Darrell T., and Klein D., Neural Module Networks, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June 2016, Las Vegas, NV, USA.
Google Scholar
93 Johnson J., Hariharan B., van der Maaten L., Li F. F., Zitnick C. L., and Girshick R., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, July 2017, Honolulu, HI, USA.
Google Scholar
94 Johnson J., Hariharan B., van der Maaten L. et al., Inferring and Executing Programs for Visual Reasoning, Proceedings of IEEE International Conference on Computer Vision, October 2017, Venice, Italy.
Google Scholar
95 Yi K., Wu J., Gan C., Torralba A., Kohli P., and Tenenbaum J. B., Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, Proceedings of the 32nd International Conference on Neural Information Processing Systems, December 2018, Montreal, Canada.
Google Scholar
96 Das A., Gkioxari G., Lee S., Parikh D., and Batra D., Neural Modular Control for Embodied Question Answering, Proceedings of the 2nd Conference on Robot Learning, October 2018, Zürich, Switzerland, 29–31.
Google Scholar
97 Beattie C., Leibo J. Z., Teplyashin D. et al., DeepMind Lab, 2016, https://arxiv.org/abs/1612.03801.
Google Scholar
98 Puig X., Ra K., Boben M. et al., VirtualHome: Simulating Household Activities via Programs, Proceedings of IEEE/CVF Conference on Computer Vision and 23 Pattern Recognition, June 2018, Salt Lake City, UT, USA.
Google Scholar
99 Shridhar M., Thomason J., Gordon D. et al., ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, USA.
Google Scholar
100 Chevalier-Boisvert M., Bahdanau D., Lahlou S. et al., BABYAI: A Platform to Study the Sample Efffciency of Grounded, Language Learning. (2018) https://arxiv.org/abs/1810.08272.
Google Scholar
101 Gao X., Gong R., Shu T., Xie X., Wang S., and Zhu S.-C., VRKitchen: an Interactive 3D Virtual Environment for Task-Oriented Learning, 2019, https://arxiv.org/abs/1903.05757.
Google Scholar
102 Savva M., Kadian A., Maksymets O. et al., Habitat: A Platform for Embodied AI Research, Proceedings of IEEE/CVF International Conference on Computer Vision, November 2019, Seoul, Korea.
Google Scholar
103 Yan C., Misra D., Bennnett A., Walsman A., Bisk Y., and Artzi Y., CHALET: Cornell House Agent Learning Environment, 2018, https://arxiv.org/abs/1801.07357.
Google Scholar
104 Xiang F., Qin Y., Mo K. et al., SAPIEN: A Simulated Part-Based Interactive Environment, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, USA, 13–19.
Google Scholar
105 Duan J., Yu S., Tan H. L., and Tan C., ACTIONET: An Interactive End-To-End Platform for Task-Based Data Collection and Augmentation in 3D Environment, Proceedings of IEEE International Conference on Image Processing, Abu Dhabi, United Arab, October 2020, 25–28.
Google Scholar
106 Gan C., Schwartz J., Alter S. et al., ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation, 2020, https://arxiv.org/abs/2007.04954.
Google Scholar
107 Shen B., Xia F., Li C., Martín-Martín R., Fan L., and Wang G., iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2021, Prague, Czech Republic.
Google Scholar
108 Fu H., Xu W., Xue H. et al., RFUniverse: A Physics-Based Actioncentric Interactive Environment for Everyday Household Tasks, 2022, https://arxiv.org/abs/2202.00199.
Google Scholar
109 Ye R., Xu W., Fu H., Jenamani R. K., Nguyen V., and Lu C., RCareWorld: A Human-Centric Simulation World for Caregiving Robots, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2022, Kyoto, Japan.
Google Scholar
110 Fan L., Wang G., Jiang Y. et al., MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, Proceedings of the 36th International Conference on Neural Information Processing Systems, December 2022, New Orleans, LA, USA.
Google Scholar
111 Vaswani A., Shazeer N., Parmar N. et al., Attention Is All You Need, Proceedings of the 31st International Conference on Neural Information Processing Systems, December 2017, Long Beach.
Google Scholar
112 Devlin J., Chang M.-W., Lee K., and Toutanova K., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2019, Minneapolis, Minnesota.
Google Scholar
113 Brown T. B., Mann B., Ryder N. et al., Language Models Are Few-Shot Learners, Proceedings of the 34th International Conference on Neural Information Processing Systems, December 2020, Vancouver, BC, Canada, 6–12.
Google Scholar
114 Chung H. W., Hou L., Longpre S. et al., Scaling Instruction-Finetuned Language Models, Journal of Machine Learning Research. (2024) 25, no. 1, 3381–3433.
Web of Science® Google Scholar
115 Chowdhery A., Narang S., Devlin J. et al., PaLM: Scaling Language Modeling with Pathways, Journal of Machine Learning Research. (2024) 24, no. 1, 11324–11436.
Google Scholar
116 Touvron H., Lavril T., Izacard G. et al., LLaMA: Open and Efficient Foundation Language Models, 2023, https://arxiv.org/abs/2302.13971.
Google Scholar
117 Thoppilan R., De Freitas D., Hall J. et al., LaMDA: Language Models for Dialog Applications, 2022, https://arxiv.org/abs/2201.08239.
Google Scholar
118 Du Z., Qian Y., Liu X. et al., GLM: General Language Model Pretraining with Autoregressive Blank Infilling, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, May 2022, Dublin, Ireland.
Google Scholar
119 Le Scao T., Fan A., Akiki C. et al., BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, 2022, https://arxiv.org/abs/2211.05100.
Google Scholar
120 Dosovitskiy A., Beyer L., Kolesnikov A. et al., An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale, Proceedings of the 9th International Conference on Learning Representations, May 2021, Austria, 3–7.
Google Scholar
121 Radford A., Kim J. W., Hallacy C. et al., Learning Transferable Visual Models from Natural Language Supervision, Proceedings of the 38th International Conference on Machine Learning, July 2021, Virtual Event.
Google Scholar
122 Huang S., Dong L., Wang W. et al., Language Is Not All You Need: Aligning Perception with Language Models, Proceedings of the 37th International Conference on Neural Information Processing Systems, December 2023, New Orleans, LA, USA.
Google Scholar
123 Alayrac J.-B., Donahue J., Luc P. et al., Flamingo: a Visual Language Model for Few-Shot Learning, Proceedings of the 36th Conference on Neural Information Processing Systems, December 2022, New Orleans.
Google Scholar
124 Li J., Li D., Savarese S., and Hoi S., BLIP-2: Bootstrapping Language-Image Pretraining With Frozen Image Encoders and Large Language Models, Proceedings of the 40th International Conference on Machine Learning, July 2023, Honolulu.
Google Scholar
125 Najdenkoska I., Zhen X., and Worring M., Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning, 2023, https://arxiv.org/abs/2302.14794.
Google Scholar
126 Zhu D., Chen J., Shen X., Li X., and Elhoseiny M., MiniGPT-4: Enhancing Vision-Language Understanding With Advanced Large Language Models, 2023, https://arxiv.org/abs/2304.10592.
Google Scholar
127 Zhang H., Li X., and Bing L., Video-LLaMA: An Instruction-Tuned Audio-Visual Language Model for Video Understanding, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, December 2023, Singapore.
Google Scholar
128 Zhang R., Han J., Liu C. et al., LLaMA-Adapter: Efficient Finetuning of Language Models With Zero-Init Attention, 2023, https://arxiv.org/abs/2303.16199.
Google Scholar
129 Koh J. Y., Salakhutdinov R., and Fried D., Grounding Language Models to Images for Multimodal Inputs and Outputs, Proceedings of the 40th International Conference on Machine Learning, July 2023, Honolulu.
Google Scholar
130 Wei J., Wang X., Schuurmans D. et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Proceedings of the 36th Conference on Neural Information Processing Systems, December 2022, New Orleans.
Google Scholar
131 Ho N., Schmid L., and Yun S., Large Language Models Are Reasoning Teachers, 2022, https://arxiv.org/abs/2212.10071.
Google Scholar
132 Zhang Z., Zhang A., Li M., Zhao H., Karypis G., and Smola A., Multimodal Chain-Of-Thought Reasoning in Language Models, 2023, https://arxiv.org/abs/2302.00923.
Google Scholar
133 Huang W., Abbeel P., Pathak D., and Mordatch I., Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents, Proceedings of the 39th International Conference on Machine Learning, July 2022, Baltimore.
Google Scholar
134 Vemprala S., Bonatti R., Bucker A., and Kapoor A., ChatGPT for Robotics: Design Principles and Model Abilities, IEEE Access. (2024) 12, 55682–55696, https://doi.org/10.1109/access.2024.3387941.
10.1109/access.2024.3387941
Web of Science® Google Scholar
135 Ahn M., Brohan A., Brown N. et al., Do As I Can, Not as I Say: Grounding Language in Robotic Affordances, Proceedings of the 6th International Conference on Neural Information Processing Systems, December 2022, Auckland, New Zealand.
Google Scholar
136 Singh I., Blukis V., Mousavian A. et al., PROGPROMPT: Generating Situated Robot Task Plans Using Large Language Models, Proceedings of IEEE International Conference on Robotics and Automation, June 2023, London, UK.
Google Scholar
137 Kant Y., Ramachandran A., Yenamandra S. et al., Housekeep: Tidying Virtual Households Using Commonsense Reasoning, Proceedings of European Conference on Computer Vision, October 2022, Tel Aviv, Israel, 23–27.
Google Scholar
138 Wu J., Antonova R., Kan A. et al., TidyBot: Personalized Robot Assistance With Large Language Models, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), October 2023, Detroit, MI, USA.
Google Scholar
139 Wu C., Yin S., Qi W., Wang X., Tang Z., and Duan N., Visual ChatGPT: Talking, Drawing and Editing With Visual Foundation Models, 2023, https://arxiv.org/abs/2303.04671.
Google Scholar
140 Yang Z., Li L., Wang J. et al., MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, 2023, https://arxiv.org/abs/2303.11381.
Google Scholar
141 Shen Y., Song K., Tan X., Li D., Lu W., and Zhuang Y., HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in Hugging Face, Proceedings of the 37th Conference on Neural Information Processing Systems, December 2023, New Orleans, Louisiana.
Google Scholar
142 Driess D., Xia F., Sajjadi M. S. M. et al., PaLM-E: An Embodied Multimodal Language Model, Proceedings of the 40th International Conference on Machine Learning, July 2023, Honolulu.
Google Scholar
143 Mu Y., Zhang Q., Hu M. et al., EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, Proceedings of the 37th International Conference on Neural Information Processing Systems, December 2023, New Orleans, LA, USA.
Google Scholar
144 Huang W., Xia F., Xiao T. et al., Inner Monologue: Embodied Reasoning through Planning With Language Models, Proceedings of the 6th Conference on Robot Learning, 2023.
Google Scholar
145 Yu T., Quillen D., He Z. et al., Meta-world: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning, Proceedings of the Conference on Robot Learning, November 2020, Cambridge, MA, USA.
Google Scholar
146 Gupta A., Kumar V., Lynch C., Levine S., and Hausman K., Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning, Proceedings of the Conference on Robot Learning, November 2020, Cambridge, MA, USA.
Google Scholar
147 Zeng A., Florence P., Tompson J. et al., Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Proceedings of the 2020 Conference on Robot Learning, November 2020, Cambridge, MA, USA.
Google Scholar
148 Li S., Puig X., Paxton C. et al., Pre-Trained Language Models for Interactive Decision-Making, Proceedings of the 36th Conference on Neural Information Processing Systems, December 2022, New Orleans.
Google Scholar
149 Levine Y., Dalmedigos I., Ram O. et al., Standing on the Shoulders of Giant Frozen Language Models, 2022, https://arxiv.org/abs/2204.10019.
Google Scholar
150 Lv J., Yu Q., Shao L., Liu W., Xu W., and Lu C., SAGCI-system: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning, Proceedings of International Conference on Robotics and Automation, May 2022, Philadelphia, PA, USA.
Google Scholar
151 Gadre S. Y., Ehsani K., Song S., and Mottaghi R., Continuous Scene Representations for Embodied AI, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, New Orleans, LA, USA.
Google Scholar
152 Brohan A., Brown N., Carbajal J. et al., RT-1: Robotics Transformer for Real-World Control at Scale, 2022, https://arxiv.org/abs/2212.06817.
Google Scholar
153 Jiang Y., Gupta A., Zhang Z. et al., VIMA: General Robot Manipulation With Multimodal Prompts, Proceedings of the 40th International Conference on Machine Learning, July 2023, Honolulu.
Google Scholar
154 Bucker A., Figueredo L., Haddadin S. et al., LATTE: Language Trajectory Transformer, Proceedings of IEEE International Conference on Robotics and Automation, June 2023, London, UK.
Google Scholar
155 Yu T., Abbeel P., Levine S., and Finn C., One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks, 2018, https://arxiv.org/abs/1810.11043.
Google Scholar
156 Baker B., Akkaya I., Zhokhov P. et al., Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, Proceedings of the 36th Conference on Neural Information Processing Systems, December 2022, New Orleans.
Google Scholar

All articles

An Overview of Robot Embodied Intelligence Based on Multimodal Models: Tasks, Models, and System Schemes

Abstract

1. Introduction