Give this system an image and the name of an object. It returns the 3D contact point, approach vector, and gripper quaternion needed to pick it up. No depth camera. No labeled dataset. No CAD model.
The stack: Grounded-DINO locates the object by name, SAM 2 segments it to a pixel mask, Depth Anything V2 estimates metric depth, Open3D builds a point cloud and extracts the surface normal, and Shepperd's method converts the gripper frame to a quaternion.
This repository is also a book. Every decision from detection threshold to voxel size is documented in book/chapters/.
You write Python. You are curious about robotics or computer vision. You want to understand not just how to call the models, but why the pipeline is shaped the way it is.
No robotics background required.
$ python -m src.pipeline demo/images/apple.jpg "apple"
Stage 1/5: Detection...
Grounded-DINO loaded on cuda
apple score=0.972 box=[312, 198, 589, 467]
Stage 2/5: Segmentation...
SAM 2 loaded on cuda
mask score=0.996 coverage=14.3%
Stage 3/5: Depth estimation...
Depth Anything V2 loaded on cuda
metric depth=0.098m
Stage 4/5: Point cloud...
742 points
centroid=[0.0019, 0.0025, 0.0799]
normal=[0.2762, -0.0231, -0.9608]
Stage 5/5: Grasp computation...
[apple] Grasp position: [0.0019, 0.0025, 0.0799] m
[apple] Approach vector: [-0.2762, 0.0231, 0.9608]
[apple] Pre-grasp pos: [0.0295, 0.0002, -0.0162] m
[apple] Quaternion: [-0.1395, 0.0114, 0.9902, 0.0016]
Timing: {'detect': 4.76, 'segment': 2.01, 'depth': 1.72, 'pointcloud': 0.2, 'grasp': 0.0, 'total': 9.05}
| Stage | Module | What it does |
|---|---|---|
| 1. Detect | src/detect.py |
Grounded-DINO: language-image cross-attention returns a bounding box for any text label |
| 2. Segment | src/segment.py |
SAM 2: the box prompts a transformer decoder that outputs a pixel-precise binary mask |
| 3. Depth | src/depth.py |
Depth Anything V2: dense relative depth map, scaled to metric using known object size |
| 4. Point cloud | src/pointcloud.py |
Masked pixels back-projected to 3D, voxel-downsampled, surface normal estimated by PCA |
| 5. Grasp | src/grasp.py |
Centroid + normal build a gripper frame; Shepperd's method converts it to a quaternion |
Python 3.12
NVIDIA GPU, CUDA 12.1
8 GB VRAM (6 GB minimum)
Linux (tested on Ubuntu 22.04)
git clone https://github.com/JadELHAJJ-prog/GraspAnything
cd GraspAnything
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
# 1. PyTorch with CUDA 12.1 (must come first, needs the index URL)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 2. Everything else
pip install -r requirements.txt
# 3. SAM 2 (not on PyPI, install from source)
pip install git+https://github.com/facebookresearch/sam2.git
# 4. GroundingDINO from source (apply two patches before installing, see Chapter 1)
git clone https://github.com/IDEA-Research/GroundingDINO.git vendor/GroundingDINO
pip install -e vendor/GroundingDINO --no-build-isolation --no-depsTwo patches are required because GroundingDINO was written for an older transformers. See Chapter 1 for the exact edits.
Download weights:
mkdir -p weights/sam2
# GroundingDINO SwinT OGC
wget -q https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py \
-O weights/GroundingDINO_SwinT_OGC.py
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth \
-O weights/groundingdino_swint_ogc.pth
# SAM 2.1 Large
wget -q https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt \
-O weights/sam2/sam2.1_hiera_large.pt
# Depth Anything V2 downloads automatically from HuggingFace on first runSingle image:
python -m src.pipeline demo/images/apple.jpg "apple"
python -m src.pipeline demo/images/can.jpg "can"
python -m src.pipeline demo/images/packet.jpg "packet"Multiple objects, period-separated prompt:
python -m src.pipeline demo/images/apple.jpg "apple . can . packet"Video with SAM 2 propagation:
python -m src.video_pipeline demo/images/apple_video.mp4 "apple"Antipodal grasp refinement:
python -m src.grasp_refine demo/images/apple.jpg "apple"Run any stage independently:
python -m src.detect demo/images/apple.jpg "apple"
python -m src.segment demo/images/apple.jpg "apple"
python -m src.depth demo/images/apple.jpg "apple"
python -m src.pointcloud demo/images/apple.jpg "apple"
python -m src.grasp demo/images/apple.jpg "apple"Each run saves demo/outputs/pipeline_{label}_result.json:
{
"label": "apple",
"detection_score": 0.9721526503562927,
"mask_score": 0.99609375,
"depth_m": 0.09784047992516917,
"centroid": [
0.001852602543656112,
0.002460536884571966,
0.0799000653476575
],
"normal": [
0.2762273152007533,
-0.023052410458895164,
-0.9608158287148576
],
"grasp_position": [
0.001852602543656112,
0.002460536884571966,
0.0799000653476575
],
"approach_vector": [
-0.2762273152007533,
0.023052410458895164,
0.9608158287148576
],
"pregrasp_position": [
0.02947533406373144,
0.00015529583868244962,
-0.016181517523828265
],
"quaternion": [
-0.1395055584363536,
0.011414237071850443,
0.990154194104964,
0.0016081833782189924
],
"timing": {
"detect": 4.76,
"segment": 2.01,
"depth": 1.72,
"pointcloud": 0.2,
"grasp": 0.0,
"total": 9.05
}
}All positions are in metres, camera optical frame. Quaternion is [qx, qy, qz, qw].
pregrasp_position is 10 cm back along the approach vector.
GraspAnything/
src/
detect.py Grounded-DINO zero-shot detection
segment.py SAM 2 segmentation from bounding box
depth.py Depth Anything V2 monocular depth
pointcloud.py mask + depth to Open3D point cloud
grasp.py centroid + normal to grasp quaternion
grasp_refine.py antipodal grasp candidates + stability scoring
pipeline.py single-call end-to-end runner
video_pipeline.py SAM 2 video propagation + per-frame grasp
visualise.py Open3D offscreen renders
book/
chapters/ one .md file per chapter
assets/ SVG diagrams linked inline
notebooks/
01_detection.ipynb
02_segmentation.ipynb
03_depth.ipynb
04_pointcloud.ipynb
05_full_pipeline.ipynb
demo/
images/ test images and video (committed)
outputs/ generated on first run, not committed
weights/ model checkpoints, downloaded during install, not committed
vendor/
GroundingDINO/ cloned and patched during install, not committed
| Chapter | Topic | Core question |
|---|---|---|
| Prologue | The grasping problem | Why do classical detectors fail? |
| 1 | Grounded-DINO | How does a text prompt find an object? |
| 2 | SAM 2 | How does a box become a pixel mask? |
| 3 | Depth Anything V2 | How do you get metric depth from one image? |
| 4 | Back-projection | How do pixels become 3D points? |
| 5 | Grasp geometry | How do you find the approach direction? |
| 6 | The full pipeline | How do all five stages connect? |
| 7 | Video tracking | How does SAM 2 propagate across frames? |
| 8 | Grasp refinement | What makes one grasp better than another? |
| Epilogue | What this built | Where does it go next? |
Python 3.12 torch 2.5.1+cu121 open3d 0.19.0
Grounded-DINO SAM 2.1 Large Depth Anything V2 Base
CUDA 12.1 transformers 5.5.4 numpy 2.4.3
computer-vision robotics grasping grounded-dino sam2 depth-anything open3d zero-shot foundation-models point-cloud quaternion grasp-pose python