AEG-bot

We present AEG-bot, a mobile humanoid robot system for end-to-end indoor scene rearrangement. Unlike simulation-based approaches that rely on strong assumptions such as full observability or simplified manipulation, AEG-bot is designed to function autonomously in complex, previously unseen real-world environments.

Household rearrangement requires a robot to detect misplaced objects and reason about their appropriate placements — a process that combines common-sense knowledge with user preference alignment. To enable such reasoning, we propose LLM-enhanced scene graph learning, which transforms an initial scene graph into an Affordance Enhanced Graph (AEG) with enriched object nodes and newly discovered contextual relations. In this representation, receptacle nodes are augmented with context-induced affordances that capture what kinds of objects can be placed, supporting effective task planning for rearrangement.

Building on this capability, AEG-bot unifies active exploration, RGB-D scene reconstruction, scene graph extraction, and autonomous task planning & execution into a single pipeline. With this design, the robot can perceive novel environments, identify misplaced objects, infer suitable placements, and carry out the rearrangement end-to-end—moving a step closer to practical and intelligent household robotics.

Upon entering an unknown environment, the robot executes active exploration and autonomously performs RGB-D reconstruction and extracts a scene graph (A). This graph is then transformed into an Affordance Enhanced Graph (AEG), where affordances and contextual relations are enriched (B). By leveraging the AEG together with selected RGB keyframes, a multi-modal foundation model is prompted to plan and execute the rearrangement task. With this framework, AEG-bot can detect misplaced objects, infer their proper placements (C), and autonomously complete the rearrangement, achieving robust and intelligent scene organization (D).

Beyond autonomous rearrangement, AEG-bot can also perform instructed task planning. By following natural-language instructions, the robot leverages its perception and reasoning capabilities to execute diverse household tasks. Specifically, it employs a retrieval-augmented generation (RAG) strategy to extract task-relevant nodes from the Affordance Enhanced Graph (AEG) and feeds them, together with scene context, into a multi-modal foundation model. The model then generates a sequence of high-level task steps, which are grounded into action primitives that the robot can execute to accomplish the instructed task.

We start from an initial scene graph constructed from an RGB-D sequence and perform context-induced affordance analysis through the following steps:

Initial scene graph. Build an initial scene graph from the RGB-D sequence.
Textual context. For each object node, extract contextual information from its neighboring nodes in the graph to obtain a textual description of local relations.
Visual context. Select key frames that best represent each object and query a Vision–Language Model (VLM) to generate visual context.
Global context. Construct an object–area–room hierarchy, assign key frames to each area, and use a VLM to produce area-level descriptions; infer semantic edges between spatially distant but functionally related objects.
LLM reasoning & graph update. Integrate textual, visual, and global contexts into a unified prompt to the LLM, which performs affordance reasoning and updates the scene graph into an Affordance Enhanced Graph (AEG).

We adopt a multi-query retrieval-augmented approach for instructed task planning, enabling the robot to generate reliable placement decisions:

Affordance databases. From the context-induced affordances in the AEG, an LLM extracts keywords to construct three specialized databases: a Target Database, Location Database, and Scenario Database.
Query reformulation and retrieval. For each carriable object to be arranged, the LLM reformulates the placement query into three query types corresponding to the three databases. Semantic similarity search retrieves candidate receptacles, and an additional LLM-based filtering step removes low-relevance results.
LLM Planning. The top-k candidates with the highest aggregated similarity scores are selected as the final placement options, guiding the robot’s action sequence in instructed task execution.

AEG-bot performs active exploration of unknown environments, leveraging RGB-D perception to reconstruct the scene. The navigation module plans efficient paths and ensures safe movement through cluttered household spaces, enabling the robot to autonomously reach target objects and regions of interest.

The grasping and placement module enables AEG-bot to interact physically with household objects. By integrating affordance-aware grasp detection and context-guided placement strategies, the robot can reliably pick up diverse objects and place them into appropriate receptacles, completing rearrangement tasks with precision and robustness.

AEG-bot: A mobile humanoid robot for Scene Graph guided Household Rearrangement

Abstract

Pipeline

Instructed Task Planning

Affordance-Enhanced Graph (AEG)

Task Planner

Exploration & Navigation

Grasping & Placement