MG-Nav: Dual-Scale Visual Navigation via Sparse Memory Graphs
Bo Wang, Xiaojuan Qi
Core Insight
Store sparse keyframe snapshots (not dense 3D reconstructions) as a memory graph, localize via image-to-instance hybrid retrieval, and run global planning at low frequency while local control operates at high frequency. This mimics how humans actually navigate: "I remember that corner" not "I have a point cloud of every surface."
Why Previous Approaches Failed
Visual navigation approaches fail in complementary ways:
1. Foundation Policies Fail Out-of-View
Pre-trained navigation models (like ViNT, NoMaD) work great when the goal is visible. But once you turn a corner and lose sight of the target, they have no memory of where to go. They're reactive, not planning.
2. RL Methods Don't Generalize
Policies trained in specific environments fail in new buildings. The policy memorizes "turn left at the blue door" instead of learning "navigate toward goal-like features." Zero-shot transfer to unseen environments is poor.
3. Dense 3D Reconstruction is Expensive
SLAM-based approaches build complete 3D maps of the environment. But for navigation, you don't need every surface—you just need enough to know where you've been and where you can go. Dense reconstruction is:
- Computationally expensive to build and maintain
- Storage-intensive (hundreds of MB for a building)
- Fragile to dynamic changes (furniture moves → map invalid)
4. Single-Scale Control is Fragile
Running global planning at high frequency (every step) wastes compute on decisions that don't change fast. Running local obstacle avoidance at low frequency misses dynamic obstacles. You need different temporal scales for different decisions.
The Method
MG-Nav operates at two temporal scales with a sparse memory representation:
Sparse Spatial Memory Graph (SMG)
Instead of dense reconstruction, maintain a graph where:
- Nodes: Keyframe images from distinct viewpoints (not every frame)
- Edges: Spatial connectivity (which locations are reachable from which)
- Features: Multi-view aggregated embeddings + detected object semantics
The graph captures navigational structure without storing every pixel. A building that would need 500MB as a point cloud fits in ~50MB as an SMG.
Global Level (Low Frequency)
Every N steps (e.g., every 10 steps):
- Localize: Match current observation to SMG nodes via hybrid retrieval
- Plan: Find path through graph from current node to goal-similar nodes
- Waypoint: Set next intermediate target (a node ~10-20 steps away)
The hybrid retrieval combines:
Scene similarity alone fails when appearance changes (lighting, time of day). Object semantics alone fails in cluttered scenes with many similar objects. The combination is robust.
Local Level (High Frequency)
Every step:
- Run pre-trained point-goal policy toward current waypoint
- Reactive obstacle avoidance for collision prevention
- When near final target, switch to image-goal mode for precise approach
VGGT-Adapter
Lightweight geometric module that projects observation and goal features into 3D-aware space:
This improves robustness when the goal is viewed from a very different angle than the stored keyframe.
Architecture
Click diagram to step through
Key Equations
Combines scene-level visual similarity with object-semantic matching. Scene similarity captures "this looks like the goal area." Object similarity captures "I see the same red chair in both images." α balances appearance vs semantic cues.
Projects observation and goal features into shared 3D-aware representation. This helps when the goal image was taken from a very different viewpoint than current observation—the 3D projection normalizes for camera pose differences.
Results
| Benchmark | MG-Nav | Dense SLAM | Foundation Policy |
|---|---|---|---|
| HM3D (Success Rate) | 78.5% | 72.3% | 45.2% |
| MP3D (Success Rate) | 83.8% | 77.1% | 51.4% |
| Dynamic Scenes | 68.6% | 52.3% | 38.1% |
| Memory Usage | ~50MB | ~500MB | ~0 |
Key insight: Sparse memory + hybrid retrieval achieves 6-8% better success rate than dense reconstruction while using 10× less storage. When objects move, sparse keyframes are more robust than dense maps that become stale.
What Actually Matters
Hybrid retrieval is essential. Scene similarity alone fails when appearance changes (lighting, time). Object semantics alone fails in cluttered scenes with many similar objects. The combination handles both.
Dual-frequency control is necessary. Running global planning every step wastes compute on decisions that don't change. Running local control at low frequency misses obstacles. The decoupled approach is 3× more efficient.
VGGT provides consistent 4% gain. Geometric feature alignment helps when goal is viewed from very different angles than stored keyframe. Without it, viewpoint changes break retrieval.
Sparse beats dense in dynamic scenes. When objects move, dense maps become stale and mislead the agent. Sparse keyframes only encode structure, so moved objects don't corrupt the whole map.
Assumptions & Limitations
Requires exploration phase. The SMG must be built before navigation. Can't navigate in completely novel environments without first exploring to build the graph.
Keyframe selection matters. Too few keyframes = gaps in coverage where the agent gets lost. Too many = storage overhead and slower retrieval. Automatic keyframe selection is tuned per environment type.
Object detection quality. Semantic matching depends on detecting objects correctly. If the detector misses objects or hallucinates them, localization fails.
Long-range planning still hard. In very large environments (whole buildings), graph search becomes expensive. May need hierarchical memory (room-level + building-level) for scale.
Bottom Line
Store snapshots, not surfaces. A sparse graph of keyframes + fast local policies achieves state-of-the-art visual navigation at 10× lower memory cost than dense reconstruction. The key insight: navigation needs topological structure, not geometric completeness.