Cliff Detection from RGB Images using Depth Anything 3

Vincent Li  ·  Ganesh Arivoli  ·  Shruti Agarwal  ·  Siddharth Barjatya

CS766 Computer Vision  ·  University of Wisconsin–Madison  ·  Spring 2026

Abstract

We present a cliff detection pipeline for autonomous robots using Depth Anything V3 (DA3), a monocular depth estimation model. Given a single RGB camera, the system estimates per-pixel depth, lifts it into a 3D point cloud, fits a ground plane via RANSAC, and flags regions that drop significantly below that plane as cliffs. We evaluate the pipeline on both real-world video and synthetic data generated using VisionSIM, which provides per-pixel ground-truth depth. On a warehouse loading-dock scenario, LS-aligned DA3-Large achieves a median AbsRel of 0.090 and δ<1.25 of 94.96%. The main bottleneck for real-world deployment is inference speed: DA3-Base runs at roughly 0.1 fps on an Apple M1, far below the 10+ fps needed for interactive robot control.

The Problem

Autonomous robots in warehouses rely on LiDAR and cameras for navigation. A persistent failure mode is cliff detection — the inability to identify negative obstacles like stair edges, platform drops, or open bay doors. Standard 2D LiDAR scans a fixed horizontal plane, so a downward drop registers no return ping and is completely invisible to the system.

Our goal is to detect cliff edges from a single RGB camera — hardware already present on most robots, with no additional cost or hardware changes.

Hypothesis: Monocular depth from DA3 produces sufficiently accurate depth discontinuities at cliff edges to enable reliable detection from RGB alone, without additional sensors.

Depth Anything 3

DA3 is a monocular and multi-view depth model from ByteDance Seed. It predicts spatially consistent 3D geometry from any number of RGB views, with or without known camera poses. Our pipeline uses its per-pixel depth output, lifted into a 3D point cloud for cliff detection.

Inference time on Apple M1 per frame: Small ~3s  /  Base ~3s  /  Large ~6s  /  Giant ~17s.

Cliff Detection Pipeline

  1. Frame extraction — Video sampled at a configurable rate, trading off accuracy and speed.
  2. Depth estimation — Each frame passed through DA3 to produce a per-pixel depth map.
  3. 3D point cloud lift — Depth map back-projected into 3D using camera intrinsics (fx, fy).
  4. Ground plane estimation — Bottom 50% of the point cloud used to fit a ground plane via RANSAC, preventing misidentification of background objects as the floor.
  5. Cliff flagging — Pixels significantly below the estimated ground plane flagged and highlighted in red on the output video.

Tunable parameters: camera intrinsics, RANSAC residual threshold, cliff height threshold, ground region fraction (default 50%).

Simulation for Data Generation and Evaluation

We used VisionSIM to generate synthetic evaluation data with per-pixel ground-truth depth — something real RGB-D sensors cannot reliably provide at depth discontinuities.

Pipeline: Blender scene → VisionSIM render (RGB + ground-truth depth + camera transforms) → DA3 inference → scale+shift alignment → metrics. We also implemented weighted sliding-window alignment to handle DA3 scale drift along trajectories.

Top-down view of warehouse scene

Top-down view of the scene

Generated scene render

Generated scene

Generated camera image and ground truth depth

Generated camera image and ground truth depth data

Demo Videos

Real-World Cliff Detection

Simulation: Input / Ground Truth / DA3 Aligned

Simulation: Cliff Detection / DA3 Depth

Qualitative Results

Failure Case — False Positive

The algorithm can produce false positives when a fence or distant background creates a depth discontinuity that RANSAC mistakes for the ground level.

False positive: fence misidentified as cliff

False positive: fence in background confuses ground plane estimation

Quantitative Results — Warehouse Scene

Warehouse interior with a loading-dock cliff (3m × 1.5m dock, 0.75m drop). Camera: 640×480, 70° HFOV, −25° pitch, head-on approach at 0.2 m/s. Model: DA3-Large.

MetricLS MedianLS MeanRANSAC MedianRANSAC Mean
AbsRel ↓0.0900.1190.0990.254
RMSE (m) ↓0.4760.5090.2790.389
δ < 1.25 ↑94.9683.8291.4584.79
δ < 1.25² ↑99.6794.8099.6490.37
δ < 1.25³ ↑99.7897.2699.7592.40

Bold = better per metric. AbsRel = mean |pred−gt|/gt. RMSE in meters. δ < 1.25k = % pixels where max(p/g, g/p) < 1.25k.

Best frame (LS): #156 AbsRel 0.009. Worst frame: #120 AbsRel 0.467. Per-frame scale ranged 0.49–2.44, showing DA3 scale drifts along the trajectory. The mean/median gap indicates a few outlier frames dominate the average, motivating sliding-window alignment.

Discussion

Efficiency

DA3 is currently too slow for real-time robotics on consumer hardware. At 2 fps on an Apple M1, DA3-Base takes ~10 minutes for a 1-minute video; DA3-Large takes ~40 minutes. The cliff detection step itself takes only ~7 seconds, so inference is the bottleneck. Real-time deployment would require GPU acceleration, a lighter model variant, or quantization.

Near-Edge Failure Case

When the camera is very close to the cliff, the surface below occupies more than 50% of the bottom frame region. RANSAC incorrectly identifies it as the ground plane and misses the cliff. In practice this is less critical — the cliff should have been detected earlier in the approach, and odometry or localization can help the robot remember and avoid the location.

Future Work