Skip to content

semcod/redup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reDUP

Code duplication analyzer and refactoring planner for LLMs.

PyPI License: Apache-2.0 Python Version

AI Cost Tracking

PyPI Version Python License AI Cost Human Time Model

  • 🤖 LLM usage: $7.5000 (65 commits)
  • 👤 Human dev: ~$1688 (16.9h @ $100/h, 30min dedup)

Generated on 2026-04-16 using openrouter/qwen/qwen3-coder-next


reDUP scans codebases for duplicated functions, blocks, and structural patterns — then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.

Features

  • Exact duplicate detection via SHA-256 block hashing
  • Structural clone detection — same AST shape, different variable names
  • LSH near-duplicate detection for large code blocks (>50 lines)
  • Multi-language support — 35+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Bash, SQL, HTML, CSS, Lua, Scala, Kotlin, Swift, Objective-C, JSON, YAML, TOML, XML, Markdown, GraphQL, Dockerfile, Makefile, Nginx, Vim, Svelte, Vue, and more)
  • Parallel scanning for large projects (2x+ performance improvement)
  • Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
  • Function-level analysis using Python AST and tree-sitter extraction
  • Impact scoring — prioritizes duplicates by saved_lines × similarity
  • Refactoring planner — generates concrete extract/inline suggestions
  • Multiple output formats: JSON, YAML, TOON, Markdown
  • Configuration system — TOML files and environment variables
  • CLI commands: scan, compare, diff, check, config, info
  • Cross-project comparison — detect shared code between projects with merge/extract recommendations
  • CI integration with configurable quality gates
  • Clean output — no syntax warnings from external libraries

New Features (v0.4.20)

🤖 MCP Server

Full MCP (Model Context Protocol) server for AI assistant integration:

# Start MCP server
redup-mcp

# Or HTTP mode
redup-mcp --transport http --port 8000

Available Tools:

  • analyze_project — Full duplication analysis
  • find_duplicates — Quick duplicate detection
  • check_project — Quality gate check
  • compare_projects — Cross-project comparison
  • suggest_refactoring — AI-powered refactoring suggestions
  • project_info — Project metadata

🌐 Universal Fuzzy Similarity Detection

Cross-language duplicate detection across all 35+ supported languages:

# Detect similar code across languages
redup scan . --fuzzy --fuzzy-threshold 0.65

Cross-Language Matching:

  • JavaScript ↔ Python functions: ~65% similarity
  • Docker ↔ YAML configs: ~40% similarity
  • Auth patterns across languages: ~70% similarity

Supported Patterns:

  • Functions, classes, API endpoints
  • Database queries, web components
  • Auth/validation, error handling, logging
  • Configuration, infrastructure code

🌳 Modular Tree-Sitter Extractor

Refactored tree-sitter extraction with clean, modular architecture:

ts_extractor/
├── extractors/          # Modular per-language extractors
│   ├── c_family.py      # C, C++, C#, Objective-C
│   ├── go.py            # Go
│   ├── java.py          # Java, Scala, Kotlin
│   ├── markup.py        # HTML, XML, Svelte, Vue
│   ├── web.py           # JavaScript, TypeScript
│   └── ...
├── dispatcher.py        # Smart language routing
├── config.py            # Language registry
└── main.py              # Unified API

Benefits:

  • Easier to add new languages
  • Better testability
  • Cleaner separation of concerns
  • 35+ languages supported

New Features (v0.5.0+)

🌐 Universal Fuzzy Similarity Detection

Cross-language fuzzy matching for detecting similar code patterns across all 35+ supported languages:

# Detect similar patterns across different languages
redup scan . --fuzzy --ext .py,.js,.ts

# Cross-project comparison with fuzzy matching
redup compare ./project-a ./project-b --fuzzy --threshold 0.65

Features:

  • Detects similar functions, API endpoints, validation logic across languages (e.g., JS ↔ Python)
  • Pattern recognition: authentication, error handling, database queries, web components
  • Language-agnostic signature generation with identifier normalization
  • Complexity scoring (0.0-1.0) for each detected pattern

Example patterns detected:

  • Express.js route handler ↔ Flask endpoint (70% similarity)
  • Docker Compose service ↔ Kubernetes deployment (40% similarity)
  • Auth middleware patterns across frameworks

🧩 Modular ts_extractor Architecture

The tree-sitter multi-language extractor has been refactored from a 782-line god module into a clean package:

redup/core/ts_extractor/
├── extractors/
│   ├── web.py        # JavaScript/TypeScript
│   ├── c_family.py   # C/C++
│   ├── dotnet.py     # C#
│   ├── ruby.py       # Ruby
│   ├── php.py        # PHP
│   └── ...           # 10+ language-specific modules

Benefits:

  • Better maintainability (avg 100 lines per module vs 782)
  • Easier to add new language extractors
  • Shared base utilities for common operations
  • Full backward compatibility maintained

🎯 Enriched TOON Reporter

The TOON format now includes actionable sections for practical refactoring:

  • HOTSPOTS — Top 7 files with most duplicated lines (where to focus effort)
  • QUICK_WINS — Low-risk, high-savings suggestions (do first)
  • DEPENDENCY_RISK — Duplicates spanning multiple packages (cross-module risk)
  • EFFORT_ESTIMATE — Time estimates per task with difficulty (easy/medium/hard)

🤖 LLM-Powered Refactoring Plans

Generate AI-assisted refactoring TODO lists from cross-project comparisons:

redup compare ./project-a ./project-b --refactor-plan --env .env --output report.json
  • Uses litellm for flexible LLM provider support
  • Compact metadata-only prompts for efficiency
  • Structured JSON output with prioritized tasks
  • Token usage tracking

📊 Simplified Compare Reports

Cross-project comparison reports are now more compact and human-readable:

  • Relative file paths instead of absolute
  • Matches deduplicated by function pair
  • Communities with compact member dicts
  • Filtered trivial entries to reduce noise
  • ~60% smaller JSON size

Installation

pip install redup

With optional dependencies:

pip install redup[all]       # Everything
pip install redup[fuzzy]     # rapidfuzz for better similarity matching
pip install redup[ast]       # tree-sitter for multi-language AST
pip install redup[lsh]       # datasketch for LSH near-duplicate detection
pip install redup[compare]   # networkx for cross-project community detection
pip install redup[llm]       # litellm for LLM-powered refactoring plans

Quick Start

CLI

# Scan current directory, output TOON to stdout
redup scan .

# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/

# Parallel scanning for large projects
redup scan . --parallel --max-workers 4

# Multi-language scanning with 35+ supported languages
redup scan . --ext ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"

# CI gate with thresholds
redup check . --max-groups 10 --max-lines 100

# Compare two scans
redup diff before.json after.json

# Cross-project comparison (merge vs extract decision)
redup compare ./project-a ./project-b --threshold 0.75

# With LLM-powered refactoring plan (requires litellm + .env with API keys)
redup compare ./project-a ./project-b --refactor-plan --env .env --output comparison.json

# Specify custom LLM model
redup compare ./project-a ./project-b --refactor-plan --llm-model openrouter/anthropic/claude-3.5-sonnet

# Initialize configuration
redup config --init
# Scan with all formats
redup scan . --format all --output ./redup_output/

# Only function-level duplicates (faster)
redup scan . --functions-only

# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9

# Show installed optional dependencies
redup info

# Export duplications as tasks to TODO.md (requires: pip install redup[tasks])
redup tasks ./my-project

# Export with GitHub sync
redup tasks ./my-project --backend github --milestone "Sprint 1"

# Export with GitLab sync and custom output
redup tasks ./my-project -b gitlab -o refactoring-tasks.md

# Preview tasks without creating files
redup tasks ./my-project --dry-run

Task Management with Planfile (Optional)

When you install redup[tasks], you can export duplication findings as actionable tasks in TODO.md format with synchronization to GitHub, GitLab, or Jira:

# Install with planfile support
pip install redup[tasks]

# Generate TODO.md from duplications
redup tasks ./my-project --output TODO.md

# The generated TODO.md includes:
# - Priority-based task organization (critical/major/minor)
# - Difficulty estimation (easy/medium/hard)
# - Line savings potential
# - Detailed refactoring suggestions
# - Planfile export configuration

Example TODO.md output:

# TODO - Duplication Refactoring Tasks

## CRITICAL (3 tasks)
- [ ] **Refactor: process_file (4x duplication)** 🔴
   Priority: critical | Savings: 124L
   <details>
   Extract function to shared utility module.
   Files: src/core/scanner.py, src/core/planner.py, ...
   </details>

## MAJOR (5 tasks)
- [ ] **Refactor: validate_input (3x duplication)** 🟡
   Priority: major | Savings: 45L
   ...

Configuration

Create a redup.toml file:

[scan]
extensions = ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
min_lines = 3
min_similarity = 0.85
include_tests = false

[lsh]
enabled = true
min_lines = 50
threshold = 0.8

[check]
max_groups = 10
max_lines = 100

[output]
format = "toon"
output = "redup_output"

[reporting]
include_snippets = true
generate_suggestions = true

Or use [tool.redup] in pyproject.toml. Environment variables with REDUP_ prefix override file settings.

Python API

from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json

config = ScanConfig(
    root=Path("./my_project"),
    extensions=[".py", ".js", ".ts", ".go", ".rs", ".java", ".rb", ".php", ".html", ".css"],
    min_block_lines=3,
    min_similarity=0.85,
)

result = analyze(config=config, function_level_only=True)

print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")

# For LLM consumption
print(to_toon(result))

# For tooling / CI
Path("duplication.json").write_text(to_json(result))

Output Formats

TOON (LLM-optimized)

# redup/duplication | 15 groups | 86f 10453L | 2026-04-16

SUMMARY:
  files_scanned: 86
  total_lines:   10453
  dup_groups:    15
  dup_fragments: 36
  saved_lines:   217
  scan_ms:       3620

HOTSPOTS[7] (files with most duplication):
  src/redup/core/ts_extractor.py  dup=74L  groups=4  frags=11  (0.7%)
  src/redup/core/scanner_utils.py  dup=70L  groups=3  frags=3  (0.7%)
  src/redup/core/scanner_loader.py  dup=52L  groups=1  frags=1  (0.5%)

DUPLICATES[15] (ranked by impact):
  [E0001] ! EXAC  _preload_files  L=52 N=2 saved=52 sim=1.00
      src/redup/core/scanner_loader.py:9-60  (_preload_files)
      src/redup/core/scanner_utils.py:53-104  (_preload_files)

REFACTOR[15] (ranked by priority):
  [1] ◐ extract_module     → src/redup/core/utils/_preload_files.py
      WHY: 2 occurrences of 52-line block across 2 files — saves 52 lines
      FILES: src/redup/core/scanner_loader.py, src/redup/core/scanner_utils.py

QUICK_WINS[8] (low risk, high savings — do first):
  [3] extract_function   saved=26L  → src/redup/core/utils/find_exact_duplicates_lazy.py
      FILES: lazy_grouper.py
  [4] extract_function   saved=21L  → src/redup/core/utils/_extract_functions_go.py
      FILES: ts_extractor.py

DEPENDENCY_RISK[3] (duplicates spanning multiple packages):
  validate_input  packages=2  files=2
      api/routes/users.py
      services/auth/validate.py

EFFORT_ESTIMATE (total ≈ 8.7h):
  hard   _preload_files                      saved=52L  ~156min
  hard   __init__                            saved=36L  ~108min
  medium find_exact_duplicates_lazy          saved=26L  ~52min
  easy   _is_test_file                       saved=12L  ~24min

METRICS-TARGET:
  dup_groups:  15 → 0
  saved_lines: 217 lines recoverable

JSON (machine-readable)

{
  "summary": {
    "total_groups": 3,
    "total_saved_lines": 84
  },
  "groups": [
    {
      "id": "E0001",
      "type": "exact",
      "normalized_name": "calculate_tax",
      "fragments": [
        {"file": "billing.py", "line_start": 1, "line_end": 8},
        {"file": "shipping.py", "line_start": 1, "line_end": 8}
      ],
      "saved_lines_potential": 16
    }
  ],
  "refactor_suggestions": [
    {
      "priority": 1,
      "action": "extract_function",
      "new_module": "utils/calculate_tax.py",
      "risk_level": "low"
    }
  ]
}

Cross-Project Comparison

The redup compare command analyzes two separate projects to detect shared code and recommends a refactoring strategy:

  • Merge projects — if >60% code overlap
  • Extract shared library — if 5-60% overlap with well-defined clusters
  • Keep separate — if <5% overlap

CLI Usage

# Basic comparison
redup compare ./project-a ./project-b --threshold 0.75

# With semantic similarity (slower, more accurate)
redup compare ./project-a ./project-b --semantic --threshold 0.70

# Multi-language projects
redup compare ./backend ./frontend --ext ".py,.js,.ts" --threshold 0.80

# Skip community detection (faster, no networkx required)
redup compare ./a ./b --no-community

# Generate LLM-powered refactoring plan (requires redup[llm])
redup compare ./a ./b --refactor-plan --env .env --output plan.json

Sample Output

Comparing project-a ↔ project-b (threshold=0.75)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Cross-Project Comparison                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Metric                  │ Value                      │
├─────────────────────────┼────────────────────────────┤
│ Project A files         │ 42                         │
│ Project B files         │ 38                         │
│ Project A lines         │ 8500                       │
│ Project B lines         │ 7200                       │
│ Cross matches           │ 15                         │
│ Shared LOC (potential)  │ 1200                       │
└─────────────────────────┴────────────────────────────┘

Recommendation: extract_shared_lib
15% overlap (1200 shared lines, 5 clusters). Extract to shared library.
Confidence: 80%

Top Communities (shared code candidates):
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━┳━━━━━━━━━━┓
┃ ID ┃ Name                 ┃ Similarity ┃ LOC ┃ Members  ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━╇━━━━━━━━━━┩
│  0 │ validate_input       │ 0.89       │ 180 │ 5        │
│  1 │ parse_config         │ 0.82       │ 140 │ 4        │
│  2 │ format_response      │ 0.76       │ 100 │ 3        │
└────┴──────────────────────┴────────────┴─────┴──────────┘

Report JSON Structure

{
  "project_a": "./project-a",
  "project_b": "./project-b",
  "stats": {
    "a": {"files": 42, "lines": 8500},
    "b": {"files": 38, "lines": 7200}
  },
  "total_matches": 15,
  "shared_loc_potential": 1200,
  "recommendation": {
    "decision": "extract_shared_lib",
    "rationale": "15% overlap (1200 shared lines, 5 clusters). Extract to shared library.",
    "overlap_pct": 0.1523,
    "shared_loc": 1200,
    "confidence": 0.8
  },
  "communities": [
    {
      "name": "validate_input",
      "similarity": 0.89,
      "loc": 180,
      "members": [
        {"project": "A", "file": "api/validators.py", "function": "validate_input"},
        {"project": "B", "file": "utils/validation.py", "function": "validate_input"}
      ]
    }
  ],
  "matches": [...]
}

Algorithm Overview

The comparison uses a 3-tier similarity detection:

  1. Structural hash — exact AST matches (fast, O(n+m))
  2. LSH (Locality Sensitive Hashing) — near-duplicates via MinHash
  3. Semantic similarity — CodeBERT embeddings (optional, slowest)

Matches are deduplicated by (function_a, function_b, file_a, file_b) with the highest similarity score retained.

Community Detection

Requires networkx (pip install redup[compare]).

Uses greedy modularity communities on a similarity graph where:

  • Nodes = functions from both projects
  • Edges = similarity score (filtered by --threshold)
  • Communities = clusters of mutually similar functions

Each community gets a generated name based on longest common prefix of its member functions (e.g., validate_*validate_input).

Architecture

src/redup/
├── __init__.py            # Public API
├── __main__.py            # python -m redup
├── mcp_server.py          # MCP server entry point (re-exports from mcp package)
├── mcp/                   # MCP server package
│   ├── __init__.py        # Public MCP API
│   ├── handlers.py        # Tool handlers
│   ├── schemas.py         # JSON-RPC schemas
│   ├── server.py          # JSON-RPC server core
│   └── utils.py           # Shared utilities
├── core/
│   ├── models.py          # Pydantic data models
│   ├── scanner.py         # File discovery + block extraction
│   ├── scanner/           # Scanner package
│   │   ├── __init__.py    # Public scanner API
│   │   ├── cache.py       # Memory cache
│   │   ├── filters.py     # File filtering
│   │   ├── loader.py      # File preloading
│   │   └── types.py       # Scanner types
│   ├── hasher.py          # SHA-256 / structural fingerprinting
│   ├── matcher.py         # Fuzzy similarity comparison
│   ├── planner.py         # Refactoring suggestion generator
│   ├── pipeline.py        # Legacy: re-exports from pipeline package
│   └── pipeline/          # Pipeline package (new)
│       ├── __init__.py    # analyze(), analyze_optimized(), analyze_parallel()
│       ├── phases.py      # scan_phase(), process_blocks()
│       ├── duplicate_finder.py  # Duplicate finding phases
│       └── groups.py      # Group creation, deduplication
│   └── ts_extractor/        # Tree-sitter extraction (35+ languages)
│       ├── __init__.py    # Public API
│       ├── main.py        # Core extraction API
│       ├── dispatcher.py  # Language routing
│       ├── config.py      # Language registry
│       └── extractors/    # Per-language extractors
├── reporters/
│   ├── json_reporter.py   # JSON output
│   ├── yaml_reporter.py   # YAML output
│   └── toon_reporter.py   # TOON output (LLM-optimized)
└── cli_app/
    └── main.py            # Typer CLI

Analysis Pipeline

1. SCAN      Walk project, read files, extract function-level + sliding-window blocks
2. HASH      Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP     Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH     Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP     Remove overlapping groups (keep highest-impact)
6. PLAN      Generate prioritized refactoring suggestions with risk assessment
7. REPORT    Export to JSON / YAML / TOON

Recent Improvements (v0.5.0)

🏗️ Modular Architecture Refactoring

Major internal restructuring for better maintainability and extensibility:

MCP Server Package

The MCP server has been split from a 675-line monolith into a clean package:

redup/mcp/
├── __init__.py      # Public API
├── handlers.py      # 8 tool handlers
├── schemas.py       # JSON-RPC schemas
├── server.py        # Server core
└── utils.py         # Utilities
  • 82% code reduction in main file
  • Backward compatible: mcp_server.py re-exports all APIs
  • Better testability: Isolated handlers can be tested independently

Pipeline Package

The analysis pipeline (714 lines) now lives in a modular package:

redup/core/pipeline/
├── __init__.py          # analyze(), analyze_optimized(), analyze_parallel()
├── phases.py            # scan_phase(), process_blocks()
├── duplicate_finder.py  # find_exact_groups(), find_structural_groups(), etc.
└── groups.py            # deduplicate_groups(), blocks_to_group(), etc.
  • 66% reduction in main orchestrator file
  • Phases can be used independently for custom workflows
  • Cleaner separation of concerns

Scanner Improvements

The scanner has been refactored with extracted helpers:

  • _init_strategy() - Strategy initialization
  • _process_single_file() - Per-file processing
  • _extract_blocks_for_file() - Block extraction
  • Reduced CC and fan-out in main scan_project() function

🎯 Sprint 1 Refactoring Complete

  • Reduced cyclomatic complexity from CC̄=4.2 to CC̄=3.5
  • Eliminated all critical functions (CC > 10): 2 → 0
  • Achieved HEALTHY status with no structural issues
  • Dispatch pattern implementation for AST node processing
  • Modular TOON reporter split into 5 focused functions
  • CLI refactoring with helper functions for better maintainability

🚀 Technical Achievements

  • _process_ast_node: CC=14 → CC=6 (dispatch dict pattern)
  • to_toon: CC=12 → CC=8 (5 helper functions)
  • CLI scan(): fan-out=18 → ≤10 (4 helper functions)
  • Code quality: 0 high-complexity functions
  • Test coverage: 64/64 tests passing (100%)

📊 Quality Metrics

  • Health status: ✅ HEALTHY (no critical issues)
  • Cyclomatic complexity: CC̄=3.5 (target ≤ 3.0 achieved)
  • Maximum CC: 9 (target ≤ 10 achieved)
  • Code maintainability: Significantly improved
  • Duplication: Minimal (2 groups, 6 lines - acceptable patterns)

🔧 Code Architecture

  • Dispatch tables for extensible AST processing
  • Single responsibility functions throughout codebase
  • Clean separation of concerns in CLI pipeline
  • Type safety improvements with proper annotations
  • Error handling enhanced for edge cases

Integration with wronai Toolchain

reDUP is part of the wronai developer toolchain:

  • code2llm — static analysis engine (health diagnostics, complexity)
  • reDUP — deep duplication analysis and refactoring planning
  • code2docs — automatic documentation generation
  • vallm — validation of LLM-generated code proposals

📈 Typical workflow:

  1. code2llm analyzes the project → .toon diagnostics
  2. redup finds duplicates → duplication.toon.yaml
  3. Feed both to an LLM for targeted refactoring
  4. vallm validates the LLM's proposals before merging

🎯 Why reDUP?

  • LLM-ready: TOON format optimized for LLM consumption
  • Actionable: Generates concrete refactoring suggestions
  • Prioritized: Ranks duplicates by impact and risk
  • Integrated: Works seamlessly with wronai toolchain
  • Fast: Scans 1000+ lines in < 1 second
  • Clean: No syntax warnings, professional output

Development

git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytest

License

Licensed under Apache-2.0.

Author

Tom Sapletta

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors