Skip to content

JadELHAJJ-prog/GraspAnything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraspAnything

Text prompt to robot grasp pose in one call, zero training data required


Give this system an image and the name of an object. It returns the 3D contact point, approach vector, and gripper quaternion needed to pick it up. No depth camera. No labeled dataset. No CAD model.

The stack: Grounded-DINO locates the object by name, SAM 2 segments it to a pixel mask, Depth Anything V2 estimates metric depth, Open3D builds a point cloud and extracts the surface normal, and Shepperd's method converts the gripper frame to a quaternion.

This repository is also a book. Every decision from detection threshold to voxel size is documented in book/chapters/.


Who this is for

You write Python. You are curious about robotics or computer vision. You want to understand not just how to call the models, but why the pipeline is shaped the way it is.

No robotics background required.


Terminal output

$ python -m src.pipeline demo/images/apple.jpg "apple"

Stage 1/5: Detection...
  Grounded-DINO loaded on cuda
  apple  score=0.972  box=[312, 198, 589, 467]
Stage 2/5: Segmentation...
  SAM 2 loaded on cuda
  mask score=0.996  coverage=14.3%
Stage 3/5: Depth estimation...
  Depth Anything V2 loaded on cuda
  metric depth=0.098m
Stage 4/5: Point cloud...
  742 points
  centroid=[0.0019, 0.0025, 0.0799]
  normal=[0.2762, -0.0231, -0.9608]
Stage 5/5: Grasp computation...

[apple] Grasp position:    [0.0019, 0.0025, 0.0799] m
[apple] Approach vector:   [-0.2762, 0.0231, 0.9608]
[apple] Pre-grasp pos:     [0.0295, 0.0002, -0.0162] m
[apple] Quaternion:        [-0.1395, 0.0114, 0.9902, 0.0016]

Timing: {'detect': 4.76, 'segment': 2.01, 'depth': 1.72, 'pointcloud': 0.2, 'grasp': 0.0, 'total': 9.05}

How it works

Pipeline flow

Stage Module What it does
1. Detect src/detect.py Grounded-DINO: language-image cross-attention returns a bounding box for any text label
2. Segment src/segment.py SAM 2: the box prompts a transformer decoder that outputs a pixel-precise binary mask
3. Depth src/depth.py Depth Anything V2: dense relative depth map, scaled to metric using known object size
4. Point cloud src/pointcloud.py Masked pixels back-projected to 3D, voxel-downsampled, surface normal estimated by PCA
5. Grasp src/grasp.py Centroid + normal build a gripper frame; Shepperd's method converts it to a quaternion

Requirements

Python 3.12
NVIDIA GPU, CUDA 12.1
8 GB VRAM (6 GB minimum)
Linux (tested on Ubuntu 22.04)

Installation

git clone https://github.com/JadELHAJJ-prog/GraspAnything
cd GraspAnything
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

# 1. PyTorch with CUDA 12.1 (must come first, needs the index URL)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. Everything else
pip install -r requirements.txt

# 3. SAM 2 (not on PyPI, install from source)
pip install git+https://github.com/facebookresearch/sam2.git

# 4. GroundingDINO from source (apply two patches before installing, see Chapter 1)
git clone https://github.com/IDEA-Research/GroundingDINO.git vendor/GroundingDINO
pip install -e vendor/GroundingDINO --no-build-isolation --no-deps

Two patches are required because GroundingDINO was written for an older transformers. See Chapter 1 for the exact edits.

Download weights:

mkdir -p weights/sam2

# GroundingDINO SwinT OGC
wget -q https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py \
     -O weights/GroundingDINO_SwinT_OGC.py
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth \
     -O weights/groundingdino_swint_ogc.pth

# SAM 2.1 Large
wget -q https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt \
     -O weights/sam2/sam2.1_hiera_large.pt

# Depth Anything V2 downloads automatically from HuggingFace on first run

Usage

Single image:

python -m src.pipeline demo/images/apple.jpg "apple"
python -m src.pipeline demo/images/can.jpg "can"
python -m src.pipeline demo/images/packet.jpg "packet"

Multiple objects, period-separated prompt:

python -m src.pipeline demo/images/apple.jpg "apple . can . packet"

Video with SAM 2 propagation:

python -m src.video_pipeline demo/images/apple_video.mp4 "apple"

Antipodal grasp refinement:

python -m src.grasp_refine demo/images/apple.jpg "apple"

Run any stage independently:

python -m src.detect     demo/images/apple.jpg "apple"
python -m src.segment    demo/images/apple.jpg "apple"
python -m src.depth      demo/images/apple.jpg "apple"
python -m src.pointcloud demo/images/apple.jpg "apple"
python -m src.grasp      demo/images/apple.jpg "apple"

Output format

Each run saves demo/outputs/pipeline_{label}_result.json:

{
  "label": "apple",
  "detection_score": 0.9721526503562927,
  "mask_score": 0.99609375,
  "depth_m": 0.09784047992516917,
  "centroid": [
    0.001852602543656112,
    0.002460536884571966,
    0.0799000653476575
  ],
  "normal": [
    0.2762273152007533,
    -0.023052410458895164,
    -0.9608158287148576
  ],
  "grasp_position": [
    0.001852602543656112,
    0.002460536884571966,
    0.0799000653476575
  ],
  "approach_vector": [
    -0.2762273152007533,
    0.023052410458895164,
    0.9608158287148576
  ],
  "pregrasp_position": [
    0.02947533406373144,
    0.00015529583868244962,
    -0.016181517523828265
  ],
  "quaternion": [
    -0.1395055584363536,
    0.011414237071850443,
    0.990154194104964,
    0.0016081833782189924
  ],
  "timing": {
    "detect": 4.76,
    "segment": 2.01,
    "depth": 1.72,
    "pointcloud": 0.2,
    "grasp": 0.0,
    "total": 9.05
  }
}

All positions are in metres, camera optical frame. Quaternion is [qx, qy, qz, qw]. pregrasp_position is 10 cm back along the approach vector.


Project structure

GraspAnything/
  src/
    detect.py           Grounded-DINO zero-shot detection
    segment.py          SAM 2 segmentation from bounding box
    depth.py            Depth Anything V2 monocular depth
    pointcloud.py       mask + depth to Open3D point cloud
    grasp.py            centroid + normal to grasp quaternion
    grasp_refine.py     antipodal grasp candidates + stability scoring
    pipeline.py         single-call end-to-end runner
    video_pipeline.py   SAM 2 video propagation + per-frame grasp
    visualise.py        Open3D offscreen renders
  book/
    chapters/           one .md file per chapter
    assets/             SVG diagrams linked inline
  notebooks/
    01_detection.ipynb
    02_segmentation.ipynb
    03_depth.ipynb
    04_pointcloud.ipynb
    05_full_pipeline.ipynb
  demo/
    images/             test images and video (committed)
    outputs/            generated on first run, not committed
  weights/              model checkpoints, downloaded during install, not committed
  vendor/
    GroundingDINO/      cloned and patched during install, not committed

Book

Chapter Topic Core question
Prologue The grasping problem Why do classical detectors fail?
1 Grounded-DINO How does a text prompt find an object?
2 SAM 2 How does a box become a pixel mask?
3 Depth Anything V2 How do you get metric depth from one image?
4 Back-projection How do pixels become 3D points?
5 Grasp geometry How do you find the approach direction?
6 The full pipeline How do all five stages connect?
7 Video tracking How does SAM 2 propagate across frames?
8 Grasp refinement What makes one grasp better than another?
Epilogue What this built Where does it go next?

Stack

Python 3.12          torch 2.5.1+cu121     open3d 0.19.0
Grounded-DINO        SAM 2.1 Large         Depth Anything V2 Base
CUDA 12.1            transformers 5.5.4    numpy 2.4.3

Topics

computer-vision robotics grasping grounded-dino sam2 depth-anything open3d zero-shot foundation-models point-cloud quaternion grasp-pose python

About

Type the name of any object. The system finds it, segments it, estimates its 3D pose, and outputs where a robot would grasp it. Zero training data required.

Topics

Resources

Stars

Watchers

Forks

Contributors