Cliff Detection from RGB Images using Depth Anything 3

Vincent Li · Ganesh Arivoli · Shruti Agarwal · Siddharth Barjatya

CS766 Computer Vision · University of Wisconsin–Madison · Spring 2026

Abstract

We present a cliff detection pipeline for autonomous robots using Depth Anything V3 (DA3), a monocular depth estimation model. Given a single RGB camera, the system estimates per-pixel depth, lifts it into a 3D point cloud, fits a ground plane via RANSAC, and flags regions that drop significantly below that plane as cliffs. We evaluate the pipeline on both real-world video and synthetic data generated using VisionSIM, which provides per-pixel ground-truth depth. On a warehouse loading-dock scenario, LS-aligned DA3-Large achieves a median AbsRel of 0.090 and δ<1.25 of 94.96%. The main bottleneck for real-world deployment is inference speed: DA3-Base runs at roughly 0.1 fps on an Apple M1, far below the 10+ fps needed for interactive robot control.

The Problem

Autonomous robots in warehouses rely on LiDAR and cameras for navigation. A persistent failure mode is cliff detection — the inability to identify negative obstacles like stair edges, platform drops, or open bay doors. Standard 2D LiDAR scans a fixed horizontal plane, so a downward drop registers no return ping and is completely invisible to the system.

Our goal is to detect cliff edges from a single RGB camera — hardware already present on most robots, with no additional cost or hardware changes.

Hypothesis: Monocular depth from DA3 produces sufficiently accurate depth discontinuities at cliff edges to enable reliable detection from RGB alone, without additional sensors.

Cliff detection must work independently of the robot's localization — robots are frequently de-localized in real deployments.
A robot falling off a cliff can severely injure people below and cause significant hardware damage.
The algorithm must have a very low false negative rate — missing a cliff is far more costly than a false alarm.
3D LiDAR can address this but adds hardware cost and mounting complexity. An RGB-only approach is retrofit-friendly.

Depth Anything 3

DA3 is a monocular and multi-view depth model from ByteDance Seed. It predicts spatially consistent 3D geometry from any number of RGB views, with or without known camera poses. Our pipeline uses its per-pixel depth output, lifted into a 3D point cloud for cliff detection.

Single plain transformer backbone (DINOv2) with no task-specific heads
Unified depth + ray prediction, removing the need for multi-task learning
Input-adaptive cross-view self-attention for consistent geometry across frames
Trained on public academic datasets via a teacher-student paradigm
Surpasses prior SOTA VGGT by +35.7% in camera pose accuracy and +23.6% in geometric accuracy

Inference time on Apple M1 per frame: Small ~3s / Base ~3s / Large ~6s / Giant ~17s.

Cliff Detection Pipeline

Frame extraction — Video sampled at a configurable rate, trading off accuracy and speed.
Depth estimation — Each frame passed through DA3 to produce a per-pixel depth map.
3D point cloud lift — Depth map back-projected into 3D using camera intrinsics (fx, fy).
Ground plane estimation — Bottom 50% of the point cloud used to fit a ground plane via RANSAC, preventing misidentification of background objects as the floor.
Cliff flagging — Pixels significantly below the estimated ground plane flagged and highlighted in red on the output video.

Tunable parameters: camera intrinsics, RANSAC residual threshold, cliff height threshold, ground region fraction (default 50%).

Simulation for Data Generation and Evaluation

We used VisionSIM to generate synthetic evaluation data with per-pixel ground-truth depth — something real RGB-D sensors cannot reliably provide at depth discontinuities.

Ground truth: Real sensors fail exactly at the depth discontinuities we care about most.
Safety and cost: Driving a real robot toward a cliff is dangerous and expensive to repeat.
Controlled variation: Geometry, textures, lighting, and cliff distance can all be varied systematically.
Reproducibility: A scene script and seed regenerates the exact same frames for identical evaluations.

Pipeline: Blender scene → VisionSIM render (RGB + ground-truth depth + camera transforms) → DA3 inference → scale+shift alignment → metrics. We also implemented weighted sliding-window alignment to handle DA3 scale drift along trajectories.

Top-down view of the scene

Generated scene

Generated camera image and ground truth depth data

Demo Videos

Real-World Cliff Detection

Simulation: Input / Ground Truth / DA3 Aligned

Simulation: Cliff Detection / DA3 Depth

Qualitative Results

Failure Case — False Positive

The algorithm can produce false positives when a fence or distant background creates a depth discontinuity that RANSAC mistakes for the ground level.

False positive: fence misidentified as cliff

False positive: fence in background confuses ground plane estimation

Quantitative Results — Warehouse Scene

Warehouse interior with a loading-dock cliff (3m × 1.5m dock, 0.75m drop). Camera: 640×480, 70° HFOV, −25° pitch, head-on approach at 0.2 m/s. Model: DA3-Large.

Metric	LS Median	LS Mean	RANSAC Median	RANSAC Mean
AbsRel ↓	0.090	0.119	0.099	0.254
RMSE (m) ↓	0.476	0.509	0.279	0.389
δ < 1.25 ↑	94.96	83.82	91.45	84.79
δ < 1.25² ↑	99.67	94.80	99.64	90.37
δ < 1.25³ ↑	99.78	97.26	99.75	92.40

Bold = better per metric. AbsRel = mean |pred−gt|/gt. RMSE in meters. δ < 1.25^k = % pixels where max(p/g, g/p) < 1.25^k.

Best frame (LS): #156 AbsRel 0.009. Worst frame: #120 AbsRel 0.467. Per-frame scale ranged 0.49–2.44, showing DA3 scale drifts along the trajectory. The mean/median gap indicates a few outlier frames dominate the average, motivating sliding-window alignment.

Discussion

Efficiency

DA3 is currently too slow for real-time robotics on consumer hardware. At 2 fps on an Apple M1, DA3-Base takes ~10 minutes for a 1-minute video; DA3-Large takes ~40 minutes. The cliff detection step itself takes only ~7 seconds, so inference is the bottleneck. Real-time deployment would require GPU acceleration, a lighter model variant, or quantization.

Near-Edge Failure Case

When the camera is very close to the cliff, the surface below occupies more than 50% of the bottom frame region. RANSAC incorrectly identifies it as the ground plane and misses the cliff. In practice this is less critical — the cliff should have been detected earlier in the approach, and odometry or localization can help the robot remember and avoid the location.

Future Work

Integrate a physics engine like Project Chrono to add a robot control policy into the evaluation loop, testing whether the robot stops in time.
Benchmark DA3 model size vs. inference speed to identify the minimum viable variant for interactive-rate detection.
Explore quantization and resolution reduction to bring inference closer to real-time on embedded hardware.