GraspAnything

Text prompt to robot grasp pose in one call, zero training data required

Give this system an image and the name of an object. It returns the 3D contact point, approach vector, and gripper quaternion needed to pick it up. No depth camera. No labeled dataset. No CAD model.

The stack: Grounded-DINO locates the object by name, SAM 2 segments it to a pixel mask, Depth Anything V2 estimates metric depth, Open3D builds a point cloud and extracts the surface normal, and Shepperd's method converts the gripper frame to a quaternion.

This repository is also a book. Every decision from detection threshold to voxel size is documented in book/chapters/.

Who this is for

You write Python. You are curious about robotics or computer vision. You want to understand not just how to call the models, but why the pipeline is shaped the way it is.

No robotics background required.

Terminal output

$ python -m src.pipeline demo/images/apple.jpg "apple"

Stage 1/5: Detection...
  Grounded-DINO loaded on cuda
  apple  score=0.972  box=[312, 198, 589, 467]
Stage 2/5: Segmentation...
  SAM 2 loaded on cuda
  mask score=0.996  coverage=14.3%
Stage 3/5: Depth estimation...
  Depth Anything V2 loaded on cuda
  metric depth=0.098m
Stage 4/5: Point cloud...
  742 points
  centroid=[0.0019, 0.0025, 0.0799]
  normal=[0.2762, -0.0231, -0.9608]
Stage 5/5: Grasp computation...

[apple] Grasp position:    [0.0019, 0.0025, 0.0799] m
[apple] Approach vector:   [-0.2762, 0.0231, 0.9608]
[apple] Pre-grasp pos:     [0.0295, 0.0002, -0.0162] m
[apple] Quaternion:        [-0.1395, 0.0114, 0.9902, 0.0016]

Timing: {'detect': 4.76, 'segment': 2.01, 'depth': 1.72, 'pointcloud': 0.2, 'grasp': 0.0, 'total': 9.05}

How it works

Stage	Module	What it does
1. Detect	`src/detect.py`	Grounded-DINO: language-image cross-attention returns a bounding box for any text label
2. Segment	`src/segment.py`	SAM 2: the box prompts a transformer decoder that outputs a pixel-precise binary mask
3. Depth	`src/depth.py`	Depth Anything V2: dense relative depth map, scaled to metric using known object size
4. Point cloud	`src/pointcloud.py`	Masked pixels back-projected to 3D, voxel-downsampled, surface normal estimated by PCA
5. Grasp	`src/grasp.py`	Centroid + normal build a gripper frame; Shepperd's method converts it to a quaternion

Requirements

Python 3.12
NVIDIA GPU, CUDA 12.1
8 GB VRAM (6 GB minimum)
Linux (tested on Ubuntu 22.04)

Installation

git clone https://github.com/JadELHAJJ-prog/GraspAnything
cd GraspAnything
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

# 1. PyTorch with CUDA 12.1 (must come first, needs the index URL)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. Everything else
pip install -r requirements.txt

# 3. SAM 2 (not on PyPI, install from source)
pip install git+https://github.com/facebookresearch/sam2.git

# 4. GroundingDINO from source (apply two patches before installing, see Chapter 1)
git clone https://github.com/IDEA-Research/GroundingDINO.git vendor/GroundingDINO
pip install -e vendor/GroundingDINO --no-build-isolation --no-deps

Two patches are required because GroundingDINO was written for an older transformers. See Chapter 1 for the exact edits.

Download weights:

mkdir -p weights/sam2

# GroundingDINO SwinT OGC
wget -q https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py \
     -O weights/GroundingDINO_SwinT_OGC.py
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth \
     -O weights/groundingdino_swint_ogc.pth

# SAM 2.1 Large
wget -q https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt \
     -O weights/sam2/sam2.1_hiera_large.pt

# Depth Anything V2 downloads automatically from HuggingFace on first run

Usage

Single image:

python -m src.pipeline demo/images/apple.jpg "apple"
python -m src.pipeline demo/images/can.jpg "can"
python -m src.pipeline demo/images/packet.jpg "packet"

Multiple objects, period-separated prompt:

python -m src.pipeline demo/images/apple.jpg "apple . can . packet"

Video with SAM 2 propagation:

python -m src.video_pipeline demo/images/apple_video.mp4 "apple"

Antipodal grasp refinement:

python -m src.grasp_refine demo/images/apple.jpg "apple"

Run any stage independently:

python -m src.detect     demo/images/apple.jpg "apple"
python -m src.segment    demo/images/apple.jpg "apple"
python -m src.depth      demo/images/apple.jpg "apple"
python -m src.pointcloud demo/images/apple.jpg "apple"
python -m src.grasp      demo/images/apple.jpg "apple"

Output format

Each run saves demo/outputs/pipeline_{label}_result.json:

{
  "label": "apple",
  "detection_score": 0.9721526503562927,
  "mask_score": 0.99609375,
  "depth_m": 0.09784047992516917,
  "centroid": [
    0.001852602543656112,
    0.002460536884571966,
    0.0799000653476575
  ],
  "normal": [
    0.2762273152007533,
    -0.023052410458895164,
    -0.9608158287148576
  ],
  "grasp_position": [
    0.001852602543656112,
    0.002460536884571966,
    0.0799000653476575
  ],
  "approach_vector": [
    -0.2762273152007533,
    0.023052410458895164,
    0.9608158287148576
  ],
  "pregrasp_position": [
    0.02947533406373144,
    0.00015529583868244962,
    -0.016181517523828265
  ],
  "quaternion": [
    -0.1395055584363536,
    0.011414237071850443,
    0.990154194104964,
    0.0016081833782189924
  ],
  "timing": {
    "detect": 4.76,
    "segment": 2.01,
    "depth": 1.72,
    "pointcloud": 0.2,
    "grasp": 0.0,
    "total": 9.05
  }
}

All positions are in metres, camera optical frame. Quaternion is [qx, qy, qz, qw]. pregrasp_position is 10 cm back along the approach vector.

Project structure

GraspAnything/
  src/
    detect.py           Grounded-DINO zero-shot detection
    segment.py          SAM 2 segmentation from bounding box
    depth.py            Depth Anything V2 monocular depth
    pointcloud.py       mask + depth to Open3D point cloud
    grasp.py            centroid + normal to grasp quaternion
    grasp_refine.py     antipodal grasp candidates + stability scoring
    pipeline.py         single-call end-to-end runner
    video_pipeline.py   SAM 2 video propagation + per-frame grasp
    visualise.py        Open3D offscreen renders
  book/
    chapters/           one .md file per chapter
    assets/             SVG diagrams linked inline
  notebooks/
    01_detection.ipynb
    02_segmentation.ipynb
    03_depth.ipynb
    04_pointcloud.ipynb
    05_full_pipeline.ipynb
  demo/
    images/             test images and video (committed)
    outputs/            generated on first run, not committed
  weights/              model checkpoints, downloaded during install, not committed
  vendor/
    GroundingDINO/      cloned and patched during install, not committed

Book

Chapter	Topic	Core question
Prologue	The grasping problem	Why do classical detectors fail?
1	Grounded-DINO	How does a text prompt find an object?
2	SAM 2	How does a box become a pixel mask?
3	Depth Anything V2	How do you get metric depth from one image?
4	Back-projection	How do pixels become 3D points?
5	Grasp geometry	How do you find the approach direction?
6	The full pipeline	How do all five stages connect?
7	Video tracking	How does SAM 2 propagate across frames?
8	Grasp refinement	What makes one grasp better than another?
Epilogue	What this built	Where does it go next?

Stack

Python 3.12          torch 2.5.1+cu121     open3d 0.19.0
Grounded-DINO        SAM 2.1 Large         Depth Anything V2 Base
CUDA 12.1            transformers 5.5.4    numpy 2.4.3

Topics

computer-vision robotics grasping grounded-dino sam2 depth-anything open3d zero-shot foundation-models point-cloud quaternion grasp-pose python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraspAnything

Text prompt to robot grasp pose in one call, zero training data required

Who this is for

Terminal output

How it works

Requirements

Installation

Usage

Output format

Project structure

Book

Stack

Topics

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
book		book
demo/images		demo/images
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GraspAnything

Text prompt to robot grasp pose in one call, zero training data required

Who this is for

Terminal output

How it works

Requirements

Installation

Usage

Output format

Project structure

Book

Stack

Topics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages