Break free from static formats. Our platform empowers you to transform fixed content into fully manipulatable assets. Powered by SAM 3 and multimodal large models, it enables high-fidelity reconstruction that preserves the original diagram details and logical relationships.
👆 Click above or https://editbanana.anxin6.cn/ to try Edit Banana online! Upload an image to get editable DrawIO (XML) in seconds. Please note: Our GitHub repository currently trails behind our web-based service. For the most up-to-date features and performance, we recommend using our web platform.
Welcome to join our WeChat group to discuss and exchange ideas! Scan the QR code below to join:
Scan to join the Edit Banana community
💡 If the QR code has expired, please submit an Issue to request an updated one.
To demonstrate the high-fidelity conversion effect, we provides one-to-one comparisons between 3 scenarios of "original static formats" and "editable reconstruction results". All elements can be individually dragged, styled, and modified.
✨ Conversion Highlights:
- Preserves the layout logic, color matching, and element hierarchy of the original diagram
- 1:1 restoration of shape stroke/fill and arrow styles (dashed lines/thickness)
- Accurate text recognition, supporting direct subsequent editing and format adjustment
- All elements are independently selectable, supporting native DrawIO template replacement and layout optimization
- Advanced Segmentation: Using our fine-tuned SAM 3 (Segment Anything Model 3) for segmentation of diagram elements.
- Fixed Multi-Round VLM Scanning: An extraction process guided by Multimodal LLMs (Qwen-VL/GPT-4V).
-
Text Recognition:
-
Local OCR (Tesseract) for text localization; easy to install (
pip install pytesseract+ systemtesseract-ocr), runs offline. -
Pix2Text for mathematical formula recognition and LaTeX conversion (
$\int f(x) dx$ ). - Crop-Guided Strategy: Extracts text/formula regions and sends high-res crops to the formula engine.
-
Local OCR (Tesseract) for text localization; easy to install (
-
User System:
- Registration: New users receive 10 free credits.
- Credit System: Pay-per-use model prevents resource abuse.
- Multi-User Concurrency: Built-in support for concurrent user sessions using a Global Lock mechanism for thread-safe GPU access and an LRU Cache (Least Recently Used) to persist image embeddings across requests, ensuring high performance and stability.
- Input: Image (PNG/JPG/BMP/TIFF/WebP).
- Segmentation (SAM3): Using our fine-tuned SAM3 mask decoder.
- Text Extraction (Parallel):
- Local OCR (Tesseract) detects text bounding boxes.
- High-res crops of text/formula regions are sent to Pix2Text for LaTeX conversion.
- DrawIO XML Generation: Merging spatial data from SAM3 and text OCR results.
Edit-Banana/
├── config/ # Configuration files (copy config.yaml.example → config.yaml)
├── flowchart_text/ # OCR & Text Extraction Module (standalone entry)
│ ├── src/
│ └── main.py # OCR-only entry point
├── input/ # [Manual] Input images directory
├── models/ # [Manual] Model weights (SAM3) and optional BPE vocab
├── output/ # [Manual] Results directory
├── sam3/ # SAM3 library (see Installation: install from facebookresearch/sam3)
├── sam3_service/ # SAM3 HTTP service (optional, for multi-process deployment)
├── scripts/
│ ├── setup_sam3.sh # Install SAM3 lib and copy BPE to models/
│ ├── setup_rmbg.py # Download RMBG model from ModelScope to models/rmbg/
│ └── merge_xml.py # XML merge utilities
├── main.py # CLI entry (modular pipeline)
├── server_pa.py # FastAPI backend server
└── requirements.txt # Python dependencies
Follow these steps to set up the project locally.
- Python 3.10+
- CUDA-capable GPU (Highly recommended)
git clone https://github.com/BIT-DataLab/Edit-Banana.git
cd Edit-BananaAfter cloning, you must manually create the following resource directories (ignored by Git):
# Create input/output directories
mkdir -p input
mkdir -p output
mkdir -p sam3_outputThe following large files are not included in this repository. Download them yourself and place them in the paths below. The repo uses .gitignore to exclude models/, sam3_src/, etc. Do not commit these files to Git.
| Asset | Description | Target path | How to get |
|---|---|---|---|
| SAM3 weights | Segmentation checkpoint (must be .pt format) |
models/sam3_ms/sam3.pt or as in config |
ModelScope (recommended) or Hugging Face |
| BPE vocab | SAM3 text encoder vocabulary | models/bpe_simple_vocab_16e6.txt.gz |
Copied when you run scripts/setup_sam3.sh from cloned sam3_src; or from facebookresearch/sam3 repo assets |
| RMBG model (optional) | Background removal for icons/arrows | models/rmbg/model.onnx |
pip install modelscope && python scripts/setup_rmbg.py or download from ModelScope RMBG-2.0 |
See sections 5. Install SAM3 library, 6. Download model weights, and Optional — RMBG below for step-by-step instructions.
Install PyTorch with CUDA support (recommended) or CPU-only. Example for CUDA 11.8:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118For other CUDA versions or CPU, see pytorch.org.
This project uses the SAM3 Python API; the code is not in this repo. Detailed steps: docs/SETUP_SAM3.md.
Quick path (from repo root, with venv activated):
bash scripts/setup_sam3.shThis clones facebookresearch/sam3 into sam3_src, runs pip install -e sam3_src, and copies the BPE vocab to models/bpe_simple_vocab_16e6.txt.gz.
Verify: python -c "from sam3.model_builder import build_sam3_image_model; print('OK')"
Get the SAM 3 checkpoint and place it under models/:
- ModelScope (recommended, no access request): modelscope.cn/models/facebook/sam3
- Hugging Face: facebook/sam3 — request access first.
See docs/SETUP_SAM3.md for download commands and config.yaml setup.
Backend (required):
pip install -r requirements.txtTesseract (default text OCR; install one of Tesseract or PaddleOCR): Install the Tesseract engine on your system. Example on Ubuntu:
sudo apt install tesseract-ocr tesseract-ocr-chi-simIf you use PaddleOCR (ocr.engine: "paddleocr"), Tesseract is optional but recommended as fallback.
Optional — PaddleOCR (better for mixed Chinese/English text): Use PaddlePaddle 3.2.x + PaddleOCR 3.x (recommend 3.2.2; 3.3.0+ has a CPU oneDNN bug and will auto-fallback to Tesseract):
pip uninstall paddleocr paddlepaddle paddlepaddle-gpu paddlex -y
pip install paddlepaddle==3.2.2 paddleocr # CPU; avoids 3.3.0 oneDNN bug
# GPU: pip install paddlepaddle-gpu==3.2.2 paddleocrThen in config/config.yaml set ocr.engine: "paddleocr".
Optional — formula recognition (Pix2Text): For LaTeX formula recognition, install:
pip install pix2text
# GPU: pip install onnxruntime-gpuOptional — RMBG (background removal for icons/arrows): For IconPictureProcessor:
- Install runtime:
pip install onnxruntime(oronnxruntime-gpu). - Download RMBG-2.0 model (e.g. from ModelScope) to
models/rmbg/model.onnx:Or manually: ModelScope RMBG-2.0 — downloadpip install modelscope python scripts/setup_rmbg.py
model.onnxintomodels/rmbg/.
- Config file (required before first run):
Edit
cp config/config.yaml.example config/config.yaml
config/config.yaml: setsam3.checkpoint_pathandsam3.bpe_pathto yourmodels/paths. Optionally setocr.engine: "paddleocr"to use PaddleOCR for text. - Environment variables (optional): Create a
.envfile in the root if you use API keys or custom endpoints.
Recommended versions
| Component | Version | Notes |
|---|---|---|
| Python | 3.10+ | Must be compatible with PyTorch and Paddle |
| PyTorch | 2.x + CUDA to match GPU | Newer GPUs (e.g. Blackwell sm_120) may need cu128; or set sam3.device: "cpu" |
| SAM3 weights | sam3.pt (not safetensors) |
Set config.sam3.checkpoint_path to e.g. models/sam3_ms/sam3.pt |
| PaddleOCR | PaddlePaddle 3.2.2 + PaddleOCR 3.x | 3.3.0+ has CPU oneDNN bug; pipeline will auto-fallback to Tesseract |
| Tesseract | System install | Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim |
| RMBG | onnxruntime + models/rmbg/model.onnx |
Optional; use scripts/setup_rmbg.py or ModelScope to download |
Before first run
- Copy
config/config.yaml.exampletoconfig/config.yamland setsam3.checkpoint_path,sam3.bpe_path - Place SAM3 weights (e.g.
models/sam3_ms/sam3.pt) and BPE (models/bpe_simple_vocab_16e6.txt.gz) undermodels/ - Run
scripts/setup_sam3.shor follow docs/SETUP_SAM3.md to install the SAM3 library - Install Tesseract system-wide, or install PaddleOCR and set
ocr.engine: "paddleocr"
Common issues
- "no kernel image is available for execution on the device" — GPU arch does not match PyTorch CUDA. Set
sam3.device: "cpu"inconfig.yamlor upgrade PyTorch to a matching CUDA build (e.g. cu128). - "Model file not found at .../models/rmbg/model.onnx" — RMBG is optional; safe to ignore if you do not need background removal. To enable:
pip install modelscope && python scripts/setup_rmbg.pyor download from ModelScope RMBG-2.0 intomodels/rmbg/model.onnx. - "PaddleOCR inference failed…fallback to Tesseract" — Paddle/oneDNN incompatibility. Use
paddlepaddle==3.2.2+paddleocr, or setocr.engine: "tesseract". - "Please install PaddleOCR" / "pytesseract not installed" — Install the corresponding OCR stack; for Tesseract only, install system
tesseract-ocrandpip install pytesseract. - "Checking connectivity to the model hosters" hangs —
main.pysetsPADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK=Trueby default; if it still appears, runexport PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK=Truebefore starting.
Supports image files (PNG, JPG, BMP, TIFF, WebP). To process a single image:
python main.py -i input/test_diagram.pngThe output XML will be saved in the output/ directory. For batch processing, put images in input/ and run python main.py without -i.
-
One-time setup
git clone https://github.com/BIT-DataLab/Edit-Banana.git && cd Edit-Banana python3 -m venv .venv && source .venv/bin/activate # Linux/macOS; Windows: .venv\Scripts\activate pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 # or CPU build pip install -r requirements.txt sudo apt install tesseract-ocr tesseract-ocr-chi-sim # OCR (or equivalent on your OS)
Install the SAM3 library (see Install SAM3 library) and download model weights + BPE. Then:
mkdir -p input output cp config/config.yaml.example config/config.yaml # Edit config/config.yaml: set sam3.checkpoint_path and sam3.bpe_path to your models/ paths -
Test with CLI
# Put a diagram image in input/, e.g. input/test.png python main.py -i input/test.png # Output appears under output/<image_stem>/ (DrawIO XML and intermediates)
-
Optional: test the web API
python server_pa.py # In another terminal: curl -X POST http://localhost:8000/convert -F "file=@input/test.png" # Or open http://localhost:8000/docs and use the /convert endpoint with a file upload
Customize the pipeline behavior in config/config.yaml:
- sam3: Adjust score thresholds, NMS (Non-Maximum Suppression) thresholds, max iteration loops.
- paths: Set input/output directories.
- dominant_color: Fine-tune color extraction sensitivity.
| Feature Module | Status | Description |
|---|---|---|
| Core Conversion Pipeline | ✅ Completed | Full pipeline of segmentation, reconstruction and OCR |
| Intelligent Arrow Connection | Automatically associate arrows with target shapes | |
| DrawIO Template Adaptation | 📍 Planned | Support custom template import |
| Batch Export Optimization | 📍 Planned | Batch export to DrawIO files (.drawio) |
| Local LLM Adaptation | 📍 Planned | Support local VLM deployment, independent of APIs |
Contributions of all kinds are welcome (code submissions, bug reports, feature suggestions):
- Fork this repository
- Create a feature branch (
git checkout -b feature/xxx) - Commit your changes (
git commit -m 'feat: add xxx') - Push to the branch (
git push origin feature/xxx) - Open a Pull Request
Bug Reports: Issues Feature Suggestions: Discussions
Thanks to all developers who have contributed to the project and promoted its iteration!
| Name/ID | |
|---|---|
| Chai Chengliang | ccl@bit.edu.cn |
| Zhang Chi | zc315@bit.edu.cn |
| Deng Qiyan | |
| Rao Sijing | |
| Yi Xiangjian | |
| Li Jianhui | |
| Shen Chaoyuan | |
| Zhang Junkai | |
| Han Junyi | |
| You Zirui | |
| Xu Haochen | |
| An Minghao | |
| Yu Mingjie | |
| Yu Xinjiang | |
| Chen Zhuofan | |
| Li Xiangkun |
This project is open-source under the Apache License 2.0, allowing commercial use and secondary development (with copyright notice retained).
🌟 If this project helps you, please star it to show your support!








