Code duplication analyzer and refactoring planner for LLMs.
- 🤖 LLM usage: $7.5000 (65 commits)
- 👤 Human dev: ~$1688 (16.9h @ $100/h, 30min dedup)
Generated on 2026-04-16 using openrouter/qwen/qwen3-coder-next
reDUP scans codebases for duplicated functions, blocks, and structural patterns — then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.
- Exact duplicate detection via SHA-256 block hashing
- Structural clone detection — same AST shape, different variable names
- LSH near-duplicate detection for large code blocks (>50 lines)
- Multi-language support — 35+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Bash, SQL, HTML, CSS, Lua, Scala, Kotlin, Swift, Objective-C, JSON, YAML, TOML, XML, Markdown, GraphQL, Dockerfile, Makefile, Nginx, Vim, Svelte, Vue, and more)
- Parallel scanning for large projects (2x+ performance improvement)
- Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
- Function-level analysis using Python AST and tree-sitter extraction
- Impact scoring — prioritizes duplicates by
saved_lines × similarity - Refactoring planner — generates concrete extract/inline suggestions
- Multiple output formats: JSON, YAML, TOON, Markdown
- Configuration system — TOML files and environment variables
- CLI commands:
scan,compare,diff,check,config,info - Cross-project comparison — detect shared code between projects with merge/extract recommendations
- CI integration with configurable quality gates
- Clean output — no syntax warnings from external libraries
Full MCP (Model Context Protocol) server for AI assistant integration:
# Start MCP server
redup-mcp
# Or HTTP mode
redup-mcp --transport http --port 8000Available Tools:
analyze_project— Full duplication analysisfind_duplicates— Quick duplicate detectioncheck_project— Quality gate checkcompare_projects— Cross-project comparisonsuggest_refactoring— AI-powered refactoring suggestionsproject_info— Project metadata
Cross-language duplicate detection across all 35+ supported languages:
# Detect similar code across languages
redup scan . --fuzzy --fuzzy-threshold 0.65Cross-Language Matching:
- JavaScript ↔ Python functions: ~65% similarity
- Docker ↔ YAML configs: ~40% similarity
- Auth patterns across languages: ~70% similarity
Supported Patterns:
- Functions, classes, API endpoints
- Database queries, web components
- Auth/validation, error handling, logging
- Configuration, infrastructure code
Refactored tree-sitter extraction with clean, modular architecture:
ts_extractor/
├── extractors/ # Modular per-language extractors
│ ├── c_family.py # C, C++, C#, Objective-C
│ ├── go.py # Go
│ ├── java.py # Java, Scala, Kotlin
│ ├── markup.py # HTML, XML, Svelte, Vue
│ ├── web.py # JavaScript, TypeScript
│ └── ...
├── dispatcher.py # Smart language routing
├── config.py # Language registry
└── main.py # Unified API
Benefits:
- Easier to add new languages
- Better testability
- Cleaner separation of concerns
- 35+ languages supported
Cross-language fuzzy matching for detecting similar code patterns across all 35+ supported languages:
# Detect similar patterns across different languages
redup scan . --fuzzy --ext .py,.js,.ts
# Cross-project comparison with fuzzy matching
redup compare ./project-a ./project-b --fuzzy --threshold 0.65Features:
- Detects similar functions, API endpoints, validation logic across languages (e.g., JS ↔ Python)
- Pattern recognition: authentication, error handling, database queries, web components
- Language-agnostic signature generation with identifier normalization
- Complexity scoring (0.0-1.0) for each detected pattern
Example patterns detected:
- Express.js route handler ↔ Flask endpoint (70% similarity)
- Docker Compose service ↔ Kubernetes deployment (40% similarity)
- Auth middleware patterns across frameworks
The tree-sitter multi-language extractor has been refactored from a 782-line god module into a clean package:
redup/core/ts_extractor/
├── extractors/
│ ├── web.py # JavaScript/TypeScript
│ ├── c_family.py # C/C++
│ ├── dotnet.py # C#
│ ├── ruby.py # Ruby
│ ├── php.py # PHP
│ └── ... # 10+ language-specific modules
Benefits:
- Better maintainability (avg 100 lines per module vs 782)
- Easier to add new language extractors
- Shared base utilities for common operations
- Full backward compatibility maintained
The TOON format now includes actionable sections for practical refactoring:
- HOTSPOTS — Top 7 files with most duplicated lines (where to focus effort)
- QUICK_WINS — Low-risk, high-savings suggestions (do first)
- DEPENDENCY_RISK — Duplicates spanning multiple packages (cross-module risk)
- EFFORT_ESTIMATE — Time estimates per task with difficulty (easy/medium/hard)
Generate AI-assisted refactoring TODO lists from cross-project comparisons:
redup compare ./project-a ./project-b --refactor-plan --env .env --output report.json- Uses
litellmfor flexible LLM provider support - Compact metadata-only prompts for efficiency
- Structured JSON output with prioritized tasks
- Token usage tracking
Cross-project comparison reports are now more compact and human-readable:
- Relative file paths instead of absolute
- Matches deduplicated by function pair
- Communities with compact member dicts
- Filtered trivial entries to reduce noise
- ~60% smaller JSON size
pip install redupWith optional dependencies:
pip install redup[all] # Everything
pip install redup[fuzzy] # rapidfuzz for better similarity matching
pip install redup[ast] # tree-sitter for multi-language AST
pip install redup[lsh] # datasketch for LSH near-duplicate detection
pip install redup[compare] # networkx for cross-project community detection
pip install redup[llm] # litellm for LLM-powered refactoring plans# Scan current directory, output TOON to stdout
redup scan .
# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/
# Parallel scanning for large projects
redup scan . --parallel --max-workers 4
# Multi-language scanning with 35+ supported languages
redup scan . --ext ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
# CI gate with thresholds
redup check . --max-groups 10 --max-lines 100
# Compare two scans
redup diff before.json after.json
# Cross-project comparison (merge vs extract decision)
redup compare ./project-a ./project-b --threshold 0.75
# With LLM-powered refactoring plan (requires litellm + .env with API keys)
redup compare ./project-a ./project-b --refactor-plan --env .env --output comparison.json
# Specify custom LLM model
redup compare ./project-a ./project-b --refactor-plan --llm-model openrouter/anthropic/claude-3.5-sonnet
# Initialize configuration
redup config --init# Scan with all formats
redup scan . --format all --output ./redup_output/
# Only function-level duplicates (faster)
redup scan . --functions-only
# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9
# Show installed optional dependencies
redup info
# Export duplications as tasks to TODO.md (requires: pip install redup[tasks])
redup tasks ./my-project
# Export with GitHub sync
redup tasks ./my-project --backend github --milestone "Sprint 1"
# Export with GitLab sync and custom output
redup tasks ./my-project -b gitlab -o refactoring-tasks.md
# Preview tasks without creating files
redup tasks ./my-project --dry-runWhen you install redup[tasks], you can export duplication findings as
actionable tasks in TODO.md format with synchronization to GitHub, GitLab,
or Jira:
# Install with planfile support
pip install redup[tasks]
# Generate TODO.md from duplications
redup tasks ./my-project --output TODO.md
# The generated TODO.md includes:
# - Priority-based task organization (critical/major/minor)
# - Difficulty estimation (easy/medium/hard)
# - Line savings potential
# - Detailed refactoring suggestions
# - Planfile export configurationExample TODO.md output:
# TODO - Duplication Refactoring Tasks
## CRITICAL (3 tasks)
- [ ] **Refactor: process_file (4x duplication)** 🔴
Priority: critical | Savings: 124L
<details>
Extract function to shared utility module.
Files: src/core/scanner.py, src/core/planner.py, ...
</details>
## MAJOR (5 tasks)
- [ ] **Refactor: validate_input (3x duplication)** 🟡
Priority: major | Savings: 45L
...Create a redup.toml file:
[scan]
extensions = ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
min_lines = 3
min_similarity = 0.85
include_tests = false
[lsh]
enabled = true
min_lines = 50
threshold = 0.8
[check]
max_groups = 10
max_lines = 100
[output]
format = "toon"
output = "redup_output"
[reporting]
include_snippets = true
generate_suggestions = trueOr use [tool.redup] in pyproject.toml. Environment variables with REDUP_ prefix override file settings.
from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json
config = ScanConfig(
root=Path("./my_project"),
extensions=[".py", ".js", ".ts", ".go", ".rs", ".java", ".rb", ".php", ".html", ".css"],
min_block_lines=3,
min_similarity=0.85,
)
result = analyze(config=config, function_level_only=True)
print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")
# For LLM consumption
print(to_toon(result))
# For tooling / CI
Path("duplication.json").write_text(to_json(result))# redup/duplication | 15 groups | 86f 10453L | 2026-04-16
SUMMARY:
files_scanned: 86
total_lines: 10453
dup_groups: 15
dup_fragments: 36
saved_lines: 217
scan_ms: 3620
HOTSPOTS[7] (files with most duplication):
src/redup/core/ts_extractor.py dup=74L groups=4 frags=11 (0.7%)
src/redup/core/scanner_utils.py dup=70L groups=3 frags=3 (0.7%)
src/redup/core/scanner_loader.py dup=52L groups=1 frags=1 (0.5%)
DUPLICATES[15] (ranked by impact):
[E0001] ! EXAC _preload_files L=52 N=2 saved=52 sim=1.00
src/redup/core/scanner_loader.py:9-60 (_preload_files)
src/redup/core/scanner_utils.py:53-104 (_preload_files)
REFACTOR[15] (ranked by priority):
[1] ◐ extract_module → src/redup/core/utils/_preload_files.py
WHY: 2 occurrences of 52-line block across 2 files — saves 52 lines
FILES: src/redup/core/scanner_loader.py, src/redup/core/scanner_utils.py
QUICK_WINS[8] (low risk, high savings — do first):
[3] extract_function saved=26L → src/redup/core/utils/find_exact_duplicates_lazy.py
FILES: lazy_grouper.py
[4] extract_function saved=21L → src/redup/core/utils/_extract_functions_go.py
FILES: ts_extractor.py
DEPENDENCY_RISK[3] (duplicates spanning multiple packages):
validate_input packages=2 files=2
api/routes/users.py
services/auth/validate.py
EFFORT_ESTIMATE (total ≈ 8.7h):
hard _preload_files saved=52L ~156min
hard __init__ saved=36L ~108min
medium find_exact_duplicates_lazy saved=26L ~52min
easy _is_test_file saved=12L ~24min
METRICS-TARGET:
dup_groups: 15 → 0
saved_lines: 217 lines recoverable
{
"summary": {
"total_groups": 3,
"total_saved_lines": 84
},
"groups": [
{
"id": "E0001",
"type": "exact",
"normalized_name": "calculate_tax",
"fragments": [
{"file": "billing.py", "line_start": 1, "line_end": 8},
{"file": "shipping.py", "line_start": 1, "line_end": 8}
],
"saved_lines_potential": 16
}
],
"refactor_suggestions": [
{
"priority": 1,
"action": "extract_function",
"new_module": "utils/calculate_tax.py",
"risk_level": "low"
}
]
}The redup compare command analyzes two separate projects to detect shared code and recommends a refactoring strategy:
- Merge projects — if >60% code overlap
- Extract shared library — if 5-60% overlap with well-defined clusters
- Keep separate — if <5% overlap
# Basic comparison
redup compare ./project-a ./project-b --threshold 0.75
# With semantic similarity (slower, more accurate)
redup compare ./project-a ./project-b --semantic --threshold 0.70
# Multi-language projects
redup compare ./backend ./frontend --ext ".py,.js,.ts" --threshold 0.80
# Skip community detection (faster, no networkx required)
redup compare ./a ./b --no-community
# Generate LLM-powered refactoring plan (requires redup[llm])
redup compare ./a ./b --refactor-plan --env .env --output plan.jsonComparing project-a ↔ project-b (threshold=0.75)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Cross-Project Comparison ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Metric │ Value │
├─────────────────────────┼────────────────────────────┤
│ Project A files │ 42 │
│ Project B files │ 38 │
│ Project A lines │ 8500 │
│ Project B lines │ 7200 │
│ Cross matches │ 15 │
│ Shared LOC (potential) │ 1200 │
└─────────────────────────┴────────────────────────────┘
Recommendation: extract_shared_lib
15% overlap (1200 shared lines, 5 clusters). Extract to shared library.
Confidence: 80%
Top Communities (shared code candidates):
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┓
┃ ID ┃ Name ┃ Similarity ┃ LOC ┃ Members ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━┩
│ 0 │ validate_input │ 0.89 │ 180 │ 5 │
│ 1 │ parse_config │ 0.82 │ 140 │ 4 │
│ 2 │ format_response │ 0.76 │ 100 │ 3 │
└────┴──────────────────────┴────────────┴─────┴──────────┘
{
"project_a": "./project-a",
"project_b": "./project-b",
"stats": {
"a": {"files": 42, "lines": 8500},
"b": {"files": 38, "lines": 7200}
},
"total_matches": 15,
"shared_loc_potential": 1200,
"recommendation": {
"decision": "extract_shared_lib",
"rationale": "15% overlap (1200 shared lines, 5 clusters). Extract to shared library.",
"overlap_pct": 0.1523,
"shared_loc": 1200,
"confidence": 0.8
},
"communities": [
{
"name": "validate_input",
"similarity": 0.89,
"loc": 180,
"members": [
{"project": "A", "file": "api/validators.py", "function": "validate_input"},
{"project": "B", "file": "utils/validation.py", "function": "validate_input"}
]
}
],
"matches": [...]
}The comparison uses a 3-tier similarity detection:
- Structural hash — exact AST matches (fast, O(n+m))
- LSH (Locality Sensitive Hashing) — near-duplicates via MinHash
- Semantic similarity — CodeBERT embeddings (optional, slowest)
Matches are deduplicated by (function_a, function_b, file_a, file_b) with the highest similarity score retained.
Requires networkx (pip install redup[compare]).
Uses greedy modularity communities on a similarity graph where:
- Nodes = functions from both projects
- Edges = similarity score (filtered by
--threshold) - Communities = clusters of mutually similar functions
Each community gets a generated name based on longest common prefix of its member functions (e.g., validate_* → validate_input).
src/redup/
├── __init__.py # Public API
├── __main__.py # python -m redup
├── mcp_server.py # MCP server entry point (re-exports from mcp package)
├── mcp/ # MCP server package
│ ├── __init__.py # Public MCP API
│ ├── handlers.py # Tool handlers
│ ├── schemas.py # JSON-RPC schemas
│ ├── server.py # JSON-RPC server core
│ └── utils.py # Shared utilities
├── core/
│ ├── models.py # Pydantic data models
│ ├── scanner.py # File discovery + block extraction
│ ├── scanner/ # Scanner package
│ │ ├── __init__.py # Public scanner API
│ │ ├── cache.py # Memory cache
│ │ ├── filters.py # File filtering
│ │ ├── loader.py # File preloading
│ │ └── types.py # Scanner types
│ ├── hasher.py # SHA-256 / structural fingerprinting
│ ├── matcher.py # Fuzzy similarity comparison
│ ├── planner.py # Refactoring suggestion generator
│ ├── pipeline.py # Legacy: re-exports from pipeline package
│ └── pipeline/ # Pipeline package (new)
│ ├── __init__.py # analyze(), analyze_optimized(), analyze_parallel()
│ ├── phases.py # scan_phase(), process_blocks()
│ ├── duplicate_finder.py # Duplicate finding phases
│ └── groups.py # Group creation, deduplication
│ └── ts_extractor/ # Tree-sitter extraction (35+ languages)
│ ├── __init__.py # Public API
│ ├── main.py # Core extraction API
│ ├── dispatcher.py # Language routing
│ ├── config.py # Language registry
│ └── extractors/ # Per-language extractors
├── reporters/
│ ├── json_reporter.py # JSON output
│ ├── yaml_reporter.py # YAML output
│ └── toon_reporter.py # TOON output (LLM-optimized)
└── cli_app/
└── main.py # Typer CLI
1. SCAN Walk project, read files, extract function-level + sliding-window blocks
2. HASH Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP Remove overlapping groups (keep highest-impact)
6. PLAN Generate prioritized refactoring suggestions with risk assessment
7. REPORT Export to JSON / YAML / TOON
Major internal restructuring for better maintainability and extensibility:
The MCP server has been split from a 675-line monolith into a clean package:
redup/mcp/
├── __init__.py # Public API
├── handlers.py # 8 tool handlers
├── schemas.py # JSON-RPC schemas
├── server.py # Server core
└── utils.py # Utilities
- 82% code reduction in main file
- Backward compatible:
mcp_server.pyre-exports all APIs - Better testability: Isolated handlers can be tested independently
The analysis pipeline (714 lines) now lives in a modular package:
redup/core/pipeline/
├── __init__.py # analyze(), analyze_optimized(), analyze_parallel()
├── phases.py # scan_phase(), process_blocks()
├── duplicate_finder.py # find_exact_groups(), find_structural_groups(), etc.
└── groups.py # deduplicate_groups(), blocks_to_group(), etc.
- 66% reduction in main orchestrator file
- Phases can be used independently for custom workflows
- Cleaner separation of concerns
The scanner has been refactored with extracted helpers:
_init_strategy()- Strategy initialization_process_single_file()- Per-file processing_extract_blocks_for_file()- Block extraction- Reduced CC and fan-out in main
scan_project()function
- Reduced cyclomatic complexity from CC̄=4.2 to CC̄=3.5
- Eliminated all critical functions (CC > 10): 2 → 0
- Achieved HEALTHY status with no structural issues
- Dispatch pattern implementation for AST node processing
- Modular TOON reporter split into 5 focused functions
- CLI refactoring with helper functions for better maintainability
_process_ast_node: CC=14 → CC=6 (dispatch dict pattern)to_toon: CC=12 → CC=8 (5 helper functions)- CLI
scan(): fan-out=18 → ≤10 (4 helper functions) - Code quality: 0 high-complexity functions
- Test coverage: 64/64 tests passing (100%)
- Health status: ✅ HEALTHY (no critical issues)
- Cyclomatic complexity: CC̄=3.5 (target ≤ 3.0 achieved)
- Maximum CC: 9 (target ≤ 10 achieved)
- Code maintainability: Significantly improved
- Duplication: Minimal (2 groups, 6 lines - acceptable patterns)
- Dispatch tables for extensible AST processing
- Single responsibility functions throughout codebase
- Clean separation of concerns in CLI pipeline
- Type safety improvements with proper annotations
- Error handling enhanced for edge cases
reDUP is part of the wronai developer toolchain:
- code2llm — static analysis engine (health diagnostics, complexity)
- reDUP — deep duplication analysis and refactoring planning
- code2docs — automatic documentation generation
- vallm — validation of LLM-generated code proposals
code2llmanalyzes the project →.toondiagnosticsredupfinds duplicates →duplication.toon.yaml- Feed both to an LLM for targeted refactoring
vallmvalidates the LLM's proposals before merging
- LLM-ready: TOON format optimized for LLM consumption
- Actionable: Generates concrete refactoring suggestions
- Prioritized: Ranks duplicates by impact and risk
- Integrated: Works seamlessly with wronai toolchain
- Fast: Scans 1000+ lines in < 1 second
- Clean: No syntax warnings, professional output
git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytestLicensed under Apache-2.0.
Tom Sapletta