2511.22609 / 2025-12-04 / Robotics

MG-Nav: Dual-Scale Visual Navigation via Sparse Memory Graphs

Bo Wang, Xiaojuan Qi

Core Insight

Store sparse keyframe snapshots (not dense 3D reconstructions) as a memory graph, localize via image-to-instance hybrid retrieval, and run global planning at low frequency while local control operates at high frequency. This mimics how humans actually navigate: "I remember that corner" not "I have a point cloud of every surface."

Why Previous Approaches Failed

Visual navigation approaches fail in complementary ways:

1. Foundation Policies Fail Out-of-View

Pre-trained navigation models (like ViNT, NoMaD) work great when the goal is visible. But once you turn a corner and lose sight of the target, they have no memory of where to go. They're reactive, not planning.

2. RL Methods Don't Generalize

Policies trained in specific environments fail in new buildings. The policy memorizes "turn left at the blue door" instead of learning "navigate toward goal-like features." Zero-shot transfer to unseen environments is poor.

3. Dense 3D Reconstruction is Expensive

SLAM-based approaches build complete 3D maps of the environment. But for navigation, you don't need every surface—you just need enough to know where you've been and where you can go. Dense reconstruction is:

Computationally expensive to build and maintain
Storage-intensive (hundreds of MB for a building)
Fragile to dynamic changes (furniture moves → map invalid)

4. Single-Scale Control is Fragile

Running global planning at high frequency (every step) wastes compute on decisions that don't change fast. Running local obstacle avoidance at low frequency misses dynamic obstacles. You need different temporal scales for different decisions.

The Method

MG-Nav operates at two temporal scales with a sparse memory representation:

Sparse Spatial Memory Graph (SMG)

Instead of dense reconstruction, maintain a graph where:

Nodes: Keyframe images from distinct viewpoints (not every frame)
Edges: Spatial connectivity (which locations are reachable from which)
Features: Multi-view aggregated embeddings + detected object semantics

The graph captures navigational structure without storing every pixel. A building that would need 500MB as a point cloud fits in ~50MB as an SMG.

Global Level (Low Frequency)

Every N steps (e.g., every 10 steps):

Localize: Match current observation to SMG nodes via hybrid retrieval
Plan: Find path through graph from current node to goal-similar nodes
Waypoint: Set next intermediate target (a node ~10-20 steps away)

The hybrid retrieval combines:

Hybrid Retrieval Score

$$s(n, g) = \alpha \cdot \text{sim}_{scene}(n, g) + (1-\alpha) \cdot \text{sim}_{object}(n, g)$$

Scene similarity alone fails when appearance changes (lighting, time of day). Object semantics alone fails in cluttered scenes with many similar objects. The combination is robust.

Local Level (High Frequency)

Every step:

Run pre-trained point-goal policy toward current waypoint
Reactive obstacle avoidance for collision prevention
When near final target, switch to image-goal mode for precise approach

VGGT-Adapter

Lightweight geometric module that projects observation and goal features into 3D-aware space:

VGGT Feature Alignment

$$f_{aligned} = \text{VGGT}(f_{obs}, f_{goal}) \in \mathbb{R}^{3D}$$

This improves robustness when the goal is viewed from a very different angle than the stored keyframe.

Architecture

Click diagram to step through

Key Equations

Hybrid Retrieval Score

$$s(n, g) = \alpha \cdot \text{sim}_{scene}(n, g) + (1-\alpha) \cdot \text{sim}_{object}(n, g)$$

Combines scene-level visual similarity with object-semantic matching. Scene similarity captures "this looks like the goal area." Object similarity captures "I see the same red chair in both images." α balances appearance vs semantic cues.

VGGT Feature Alignment

$$f_{aligned} = \text{VGGT}(f_{obs}, f_{goal}) \in \mathbb{R}^{3D}$$

Projects observation and goal features into shared 3D-aware representation. This helps when the goal image was taken from a very different viewpoint than current observation—the 3D projection normalizes for camera pose differences.

Results

Benchmark	MG-Nav	Dense SLAM	Foundation Policy
HM3D (Success Rate)	78.5%	72.3%	45.2%
MP3D (Success Rate)	83.8%	77.1%	51.4%
Dynamic Scenes	68.6%	52.3%	38.1%
Memory Usage	~50MB	~500MB	~0

Key insight: Sparse memory + hybrid retrieval achieves 6-8% better success rate than dense reconstruction while using 10× less storage. When objects move, sparse keyframes are more robust than dense maps that become stale.

What Actually Matters

Hybrid retrieval is essential. Scene similarity alone fails when appearance changes (lighting, time). Object semantics alone fails in cluttered scenes with many similar objects. The combination handles both.

Dual-frequency control is necessary. Running global planning every step wastes compute on decisions that don't change. Running local control at low frequency misses obstacles. The decoupled approach is 3× more efficient.

VGGT provides consistent 4% gain. Geometric feature alignment helps when goal is viewed from very different angles than stored keyframe. Without it, viewpoint changes break retrieval.

Sparse beats dense in dynamic scenes. When objects move, dense maps become stale and mislead the agent. Sparse keyframes only encode structure, so moved objects don't corrupt the whole map.

Assumptions & Limitations

Requires exploration phase. The SMG must be built before navigation. Can't navigate in completely novel environments without first exploring to build the graph.

Keyframe selection matters. Too few keyframes = gaps in coverage where the agent gets lost. Too many = storage overhead and slower retrieval. Automatic keyframe selection is tuned per environment type.

Object detection quality. Semantic matching depends on detecting objects correctly. If the detector misses objects or hallucinates them, localization fails.

Long-range planning still hard. In very large environments (whole buildings), graph search becomes expensive. May need hierarchical memory (room-level + building-level) for scale.

Bottom Line

Store snapshots, not surfaces. A sparse graph of keyframes + fast local policies achieves state-of-the-art visual navigation at 10× lower memory cost than dense reconstruction. The key insight: navigation needs topological structure, not geometric completeness.