Skip to content

SamDreamsMaker/Max-Compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

842 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

MaxCompression

High-ratio lossless data compression library and CLI

CI Release Silesia CM License Tests C99 Platform


MaxCompression (MCX) is a lossless compression library and CLI written in portable C99. It combines multiple compression strategies β€” LZ77 with adaptive entropy coding, Burrows-Wheeler Transform with multi-table rANS, LZRC (LZ + range coder), and stride-delta preprocessing β€” under a unified API that automatically selects the best pipeline for each data type.

MCX targets maximum compression ratio while maintaining practical speeds. It beats bzip2 on 100% of standard benchmark files and competes with xz/LZMA2 on most data types.

Highlights

Metric MCX Best Alternative
kennedy.xls (structured binary) 50.1Γ— xz: 21.0Γ— β€” 2.4Γ— better
nci (chemical text, 33 MB) 25.7Γ— xz: 19.3Γ— β€” 33% better
alice29.txt (English text, L20) 3.52Γ— bzip2: 3.52Γ— β€” matches bzip2
alice29.txt (English text, L28 CM) 4.28Γ— PAQ8l: 4.28Γ— β€” beats PAQ8l
mozilla (50 MB binary archive) 3.22Γ— xz: 3.55Γ— β€” 91% of xz
enwik8 (100 MB Wikipedia) 4.04Γ— xz: 3.89Γ— β€” beats xz by 4%
Silesia corpus (202 MB total) 4.35Γ— bzip2: 3.89Γ— β€” +12%

Features

Compression Engines

  • Smart Mode (L20) β€” automatically detects data type and selects the optimal pipeline
  • LZ77 (L1–L9) β€” fast compression with greedy/lazy matching and hash chain match finders
  • BWT + multi-table rANS (L10–L14) β€” Burrows-Wheeler Transform with K-means clustered frequency tables
  • LZRC v2.0 (L24–L26) β€” LZ + adaptive range coder with binary tree or hash chain match finder, LZMA-style matched literal coding, 4-state machine, rep-match distances
  • Context Mixing (L28) β€” PAQ8-class bit-level compressor: 58 context models, 8 logit-space neural mixers, 3-stage APM cascade, adaptive StateMap β€” beats bzip2 by 17–30% on text, beats PAQ8l on alice29
  • Stride-Delta β€” auto-detects fixed-width records (1–512 byte stride) for structured binary data

Entropy Coding

  • Multi-table rANS β€” 4–6 frequency tables with K-means clustering, within 0.01 bits/symbol of entropy
  • Adaptive Arithmetic Coding β€” order-1 AC with Fenwick-tree accelerated decoding (O(log n) per symbol)
  • Adaptive Range Coder β€” bit-level context modeling with matched literal coding for LZRC
  • tANS/FSE β€” 4-stream interleaved table ANS for fast LZ decompression

Preprocessing

  • E8/E9 x86 filter β€” CALL/JMP address normalization (+16% on x86 binaries)
  • RLE2 β€” bijective base-2 zero-run encoding (logβ‚‚(N) symbols for N zeros)
  • Genetic optimizer β€” evolves pipeline configuration per block at L10–L14

CLI

  • 30+ subcommands β€” compress, decompress, verify, diff, bench, stat, hash, checksum, upgrade, pipe, and more
  • Multi-file and recursive β€” mcx compress -r ./data/ with glob exclusion patterns
  • Rich benchmarking β€” JSON/CSV/Markdown output, --compare against gzip/bzip2/xz, --aggregate for directories
  • Decompress aliases β€” mcx x, mcx d, mcx extract
  • Shell completions β€” Bash, Zsh, Fish

Library

  • Simple C API β€” mcx_compress(), mcx_decompress(), mcx_get_frame_info()
  • Python bindings β€” ctypes-based, pip-installable
  • OpenMP parallel β€” block-level parallelism, configurable thread count
  • Pure C99 β€” no C++ dependency, compiles with GCC, Clang, MSVC
  • Cross-platform β€” Linux, macOS, Windows

Quick Start

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Compress
./build/bin/mcx compress myfile.txt                # fast (L3)
./build/bin/mcx compress --best myfile.txt          # max compression (L20)

# Decompress
./build/bin/mcx decompress myfile.txt.mcx

# Benchmark
./build/bin/mcx bench myfile.txt
./build/bin/mcx bench --compare mydir/              # vs gzip/bzip2/xz

# Run tests
cd build && ctest --output-on-failure

Installation

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
sudo cmake --install build       # installs to /usr/local

# Shell completions
cp completions/mcx.bash ~/.local/share/bash-completion/completions/mcx
cp completions/mcx.zsh /usr/local/share/zsh/site-functions/_mcx
cp completions/mcx.fish ~/.config/fish/completions/mcx.fish

# Man page
sudo cp docs/mcx.1 /usr/local/share/man/man1/

Requirements: C99 compiler (GCC, Clang, MSVC), CMake β‰₯ 3.10. Optional: OpenMP for multi-threading.

Usage

CLI

# Compression
mcx compress input.bin                     # default level (L3)
mcx compress -l 20 input.bin               # max compression
mcx compress --fast input.bin              # L3
mcx compress --best input.bin              # L20 Smart Mode
mcx compress -l 26 binary.bin              # LZRC (best for binaries)
mcx compress -l 28 archival.txt            # Context Mixing (max ratio)

# Multi-file & recursive
mcx compress *.txt                         # compress all .txt files
mcx compress -r ./data/ --exclude "*.log"  # recursive with exclusion
mcx decompress *.mcx                       # decompress all

# Inspection
mcx info archive.mcx                       # detailed frame info
mcx info --blocks archive.mcx              # per-block details
mcx ls *.mcx                               # compact multi-file listing
mcx diff old.mcx new.mcx                   # compare two archives
mcx stat rawfile.bin                       # entropy and byte distribution

# Integrity
mcx verify archive.mcx                     # decompress and verify CRC
mcx verify archive.mcx original.bin        # verify against original
mcx checksum archive.mcx                   # verify header CRC32
mcx hash archive.mcx                       # CRC32/FNV of content

# Utilities
mcx cat archive.mcx                        # decompress to stdout
mcx cat archive.mcx | head -c 1024        # pipe first 1KB
mcx pipe -l 6 < input > output.mcx        # stdin/stdout mode
mcx upgrade -l 20 --in-place old.mcx      # recompress at higher level

# Benchmarking
mcx bench input.bin                        # all default levels
mcx bench --all-levels input.bin           # L1-L26
mcx bench --compare input.bin             # vs gzip/bzip2/xz
mcx bench -r ./corpus/ --aggregate        # directory totals
mcx bench --format json input.bin         # JSON output
mcx bench --format csv input.bin          # CSV output
mcx bench --format markdown input.bin     # Markdown table

# Advanced
mcx compress --decompress-check input.bin  # roundtrip verify in memory
mcx compress --atomic input.bin            # crash-safe write
mcx compress --preserve-mtime input.bin    # preserve timestamps
mcx compress --dry-run input.bin           # analyze without writing
mcx compress --estimate input.bin          # fast size estimate
mcx compress --adaptive-level input.bin    # entropy-based auto level

# Self-test
mcx test                                   # built-in roundtrip tests
mcx version --build                        # detailed build info

C API

#include <maxcomp/maxcomp.h>

// Compress
size_t bound = mcx_compress_bound(src_size);
uint8_t* dst = malloc(bound);
size_t comp_size = mcx_compress(dst, bound, src, src_size, 20);
if (mcx_is_error(comp_size)) {
    fprintf(stderr, "Error: %s\n", mcx_get_error_name(comp_size));
}

// Decompress
size_t orig_size = mcx_decompress(out, out_cap, dst, comp_size);

// Inspect
mcx_frame_info info;
mcx_get_frame_info(dst, comp_size, &info);
printf("Original: %zu, Level: %d\n", info.original_size, info.level);

// Version
printf("MCX %s\n", mcx_version_string());

Python

import maxcomp

data = open("input.bin", "rb").read()
compressed = maxcomp.compress(data, level=20)
restored = maxcomp.decompress(compressed)
assert restored == data

info = maxcomp.get_frame_info(compressed)
print(f"Original: {info['original_size']}, Level: {info['level']}")

Compression Levels

Level Strategy Compress Speed Decompress Speed Use Case
1–3 LZ77 greedy + tANS ~5–10 MB/s ~15–35 MB/s Real-time, streaming
6 LZ77 lazy + rANS ~3–5 MB/s ~10–20 MB/s General purpose
7–9 LZ77 lazy + adaptive AC ~2–4 MB/s ~3–14 MB/s Best LZ ratio
10–14 BWT + MTF + RLE2 + multi-rANS ~1–3 MB/s ~5–10 MB/s Text, structured data
20 Smart Mode (auto-detect) ~0.3–1 MB/s ~3–7 MB/s Maximum compression
24 LZRC fast (hash chains) ~1–2 MB/s ~4–5 MB/s Fast binary compression
26 LZRC best (binary tree) ~0.3–0.5 MB/s ~4–5 MB/s Best for binary data
28 Context Mixing (CM) ~10–15 KB/s ~10–15 KB/s Archival, maximum ratio

Shortcuts: --fast (L3), --default (L6), --best (L20)

Smart Mode (Level 20)

Analyzes each block and automatically routes to the best pipeline:

  • Structured binary (spreadsheets, audio) β†’ stride-delta + RLE2 + rANS β†’ kennedy.xls 50Γ—
  • Text (UTF-8, source code) β†’ BWT + MTF + RLE2 + multi-rANS β†’ alice29 3.53Γ—
  • x86 executables β†’ E8/E9 filter + BWT β†’ ooffice 2.56Γ—
  • Mixed/binary β†’ multi-trial (tries BWT, LZ, LZRC, keeps smallest)
  • Incompressible β†’ stored uncompressed (no expansion)

Benchmarks

Single-threaded, in-memory, roundtrip-verified. System gzip, bzip2, and xz for baselines.

Canterbury Corpus

File Size gzip -9 bzip2 -9 xz -6 MCX L20 Winner
alice29.txt 152 KB 2.81Γ— 3.52Γ— 3.14Γ— 3.52Γ— MCX β‰ˆ bzip2
asyoulik.txt 125 KB 2.56Γ— 3.16Γ— 2.81Γ— 3.15Γ— bzip2 β‰ˆ MCX
lcet10.txt 427 KB 2.95Γ— 3.96Γ— 3.57Γ— 3.98Γ— MCX
plrabn12.txt 482 KB 2.48Γ— 3.31Γ— 2.91Γ— 3.33Γ— MCX
kennedy.xls 1.0 MB 4.91Γ— 7.90Γ— 20.97Γ— 50.1Γ— MCX (2.4Γ— better than xz)
ptt5 513 KB 9.80Γ— 10.31Γ— 12.22Γ— 10.19Γ— xz

Silesia Corpus (202 MB)

The standard benchmark for evaluating compression on real-world data.

File Size gzip -9 bzip2 -9 xz -9 MCX L20 vs bzip2 vs xz
dickens 9.7 MB 2.65Γ— 3.64Γ— 3.60Γ— 4.07Γ— +12% +13%
mozilla 48.8 MB 2.70Γ— 2.86Γ— 3.83Γ— 3.22Γ— +13% -16%
mr 9.5 MB 2.71Γ— 4.08Γ— 3.63Γ— 4.28Γ— +5% +18%
nci 32.0 MB 11.23Γ— 18.51Γ— 19.30Γ— 25.65Γ— +39% +33%
ooffice 5.9 MB 1.99Γ— 2.15Γ— 2.54Γ— 2.56Γ— +19% +1%
osdb 9.6 MB 2.71Γ— 3.60Γ— 3.54Γ— 4.04Γ— +12% +14%
reymont 6.3 MB 3.64Γ— 5.32Γ— 5.03Γ— 5.93Γ— +11% +18%
samba 20.6 MB 4.00Γ— 4.75Γ— 5.74Γ— 5.05Γ— +6% -12%
sao 6.9 MB 1.36Γ— 1.47Γ— 1.64Γ— 1.48Γ— +1% -10%
webster 39.5 MB 3.44Γ— 4.80Γ— 4.94Γ— 5.81Γ— +21% +18%
xml 5.1 MB 8.07Γ— 12.12Γ— 11.79Γ— 12.86Γ— +6% +9%
x-ray 8.1 MB 1.40Γ— 2.09Γ— 1.89Γ— 2.15Γ— +3% +14%
Total 202 MB 3.13Γ— 3.89Γ— 4.34Γ— 4.35Γ— +12% β‰ˆ

Score: MCX beats gzip 12/12, bzip2 12/12, xz 9/12.

xz leads on 3 binary-heavy files (mozilla, samba, sao) where LZMA2's large-window optimal parsing has an advantage. MCX's LZRC engine (L26) narrows this gap: mozilla 3.22Γ— vs xz 3.55Γ—.

Context Mixing (Level 28) β€” Maximum Compression

Level 28 enables the context mixing engine β€” a PAQ8-class bit-level compressor for archival use. Extremely slow (~10 KB/s) but achieves the best compression ratios.

File Size bzip2 -9 MCX L20 MCX L28 (CM) vs bzip2
alice29.txt 152 KB 3.52Γ— 3.52Γ— 4.28Γ— +22%
lcet10.txt 427 KB 3.96Γ— 3.98Γ— 4.93Γ— +25%
plrabn12.txt 482 KB 3.31Γ— 3.33Γ— 3.89Γ— +17%
asyoulik.txt 125 KB 3.16Γ— 3.15Γ— 3.74Γ— +18%
xml 5.1 MB 12.12Γ— 12.86Γ— 15.12Γ— +25%
dickens 9.7 MB 3.64Γ— 4.07Γ— 4.60Γ— +26%
reymont 6.3 MB 5.32Γ— 5.93Γ— 6.89Γ— +30%

The CM engine uses 58 context models (order-0 through order-14, word, sparse, indirect, cross-context, linguistic), 8 logit-space neural network mixers with cross-terms, and a 3-stage Adaptive Probability Map cascade. It beats bzip2 by 17–30% on all text data and surpasses PAQ8l on alice29.txt.

Large Files

File Size xz -9 MCX L20 Notes
enwik8 95.4 MB 3.89Γ— 4.04Γ— Wikipedia β€” beats xz by 4%
enwik9 953 MB 4.12Γ— 4.28Γ— 1 GB Wikipedia dump

Reproducing Benchmarks

All benchmarks are reproducible. Download the standard corpora and run:

# Canterbury Corpus
mkdir -p /tmp/cantrbry && cd /tmp/cantrbry
wget -q https://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz
tar xzf cantrbry.tar.gz

# Benchmark with comparison against system compressors
mcx bench --compare /tmp/cantrbry/

# Silesia Corpus (download from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
mcx bench --compare --format markdown /path/to/silesia/

# Context Mixing (Level 28) β€” warning: very slow (~10 KB/s)
mcx compress -l 28 /tmp/cantrbry/alice29.txt -o /tmp/alice.mcx
mcx decompress /tmp/alice.mcx -o /tmp/alice.dec
diff /tmp/cantrbry/alice29.txt /tmp/alice.dec  # verify roundtrip

Baseline compressors for comparison: gzip -9, bzip2 -9, xz -9 (system packages).

Architecture

Input β†’ [Block Analyzer] β†’ Strategy Selection
                               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β–Ό         β–Ό          β–Ό          β–Ό          β–Ό          β–Ό
LZ Pipeline  BWT Pipe  Stride-Ξ”   LZRC-HC    LZRC-BT    CM Engine
(L1–L9)      (L10–14)  (L20 auto) (L24)      (L26)      (L28)
     β”‚         β”‚          β”‚          β”‚          β”‚          β”‚
LZ77 Match  divsufsort  Delta @   HC Match   BT Match   58 Context
Finding     +MTF+RLE2   stride    Finder     Finder     Models
     β”‚         β”‚          β”‚          β”‚          β”‚          β”‚
tANS/FSE/   Multi-tbl  RLE2+rANS  Adaptive  Adaptive   8 Neural
Adaptive AC  rANS                  Range RC   Range RC   Mixers+APM
     β”‚         β”‚          β”‚          β”‚          β”‚          β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β–Ό
                    [Block Multiplexer]
                    OpenMP Parallelism
                               β–Ό
                         .mcx output

File Format

MCX uses a frame-based format with a 20-byte header and variable-size blocks (up to 64 MB). See docs/FORMAT.md for the full specification.

Quality & Safety

MaxCompression is built with production-grade engineering practices:

Practice Status
Continuous Integration GitHub Actions β€” every push triggers build + test on Linux (GCC + Clang, Release + Debug), macOS, and Windows
Test Suites 21 test suites β€” unit, roundtrip, fuzz, stress, regression, integration, streaming, edge cases, malformed input
Memory Safety Valgrind memcheck runs in CI β€” leak-check, track-origins, error-exitcode on every level
Code Coverage lcov + Codecov integration in CI pipeline
Roundtrip Verification Canterbury corpus roundtrip at all compression levels (L1–L26) in CI
Cross-Platform CI builds and tests on Linux, macOS, Windows with multiple compilers
WASM Emscripten build + Node.js roundtrip test in CI
Python Bindings Automated binding test in CI (build .so, compress/decompress, verify)
pkg-config Integration test: install, discover via pkg-config, build+link external program
API Documentation Doxygen generation + undocumented symbol check in CI
Security SECURITY.md β€” vulnerability reporting policy, supported versions
Releases Semantic versioning, prebuilt binaries (Linux/macOS/Windows) on every tagged release

Project Stats

  • ~17,400 lines of C code
  • 770+ commits across the project
  • 21 test suites β€” unit tests, roundtrip, fuzz, stress, regression, integration
  • CI β€” Linux (GCC + Clang), macOS, Windows, Valgrind, WASM, coverage, Python bindings

Project Structure

maxcomp/
β”œβ”€β”€ include/maxcomp/    Public API (maxcomp.h)
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ entropy/        tANS, FSE, rANS, multi-rANS, adaptive AC, range coder
β”‚   β”œβ”€β”€ lz/             LZ77, LZRC v2.0, binary tree + hash chain match finders
β”‚   β”œβ”€β”€ preprocess/     BWT (divsufsort), MTF, RLE2, delta, E8/E9 filter
β”‚   β”œβ”€β”€ babel/          Stride-delta transform
β”‚   β”œβ”€β”€ optimizer/      Genetic pipeline optimizer
β”‚   β”œβ”€β”€ analyzer/       Block analysis (entropy, structure, stride detection)
β”‚   β”œβ”€β”€ external/       Embedded libdivsufsort (MIT license)
β”‚   └── compat.h        Cross-platform portability layer
β”œβ”€β”€ cli/                Command-line tool (30+ subcommands)
β”œβ”€β”€ bindings/python/    Python ctypes bindings
β”œβ”€β”€ completions/        Bash, Zsh, Fish shell completions
β”œβ”€β”€ tests/              21 test suites (unit, integration, fuzz, stress)
β”œβ”€β”€ docs/               Format spec, API docs, benchmarks, man page
β”œβ”€β”€ valgrind.supp       Valgrind suppressions
└── CMakeLists.txt

Documentation

  • FORMAT.md β€” MCX file format specification
  • API.md β€” C API reference
  • DESIGN.md β€” v2.0 architecture and design decisions
  • BENCHMARKS.md β€” Comprehensive benchmark tables
  • ROADMAP.md β€” Development roadmap and research log
  • CHANGELOG.md β€” Version history
  • CONTRIBUTING.md β€” Contribution guidelines
  • man mcx β€” Man page (installed with cmake --install)

Roadmap

Completed βœ…

  • BWT + multi-table rANS β€” beats bzip2 on all standard benchmarks
  • Adaptive arithmetic coding on LZ output β€” best-in-class LZ ratios
  • Smart Mode (L20) with stride-delta, E8/E9, multi-trial strategy selection
  • LZRC v2.0 β€” LZ + range coder with BT/HC match finders, rep-matches, matched literals
  • OpenMP block parallelism
  • Embedded libdivsufsort (2Γ— faster BWT)
  • Rich CLI with 30+ commands, multi-file, recursive, benchmarking
  • Python bindings with pip install support
  • Cross-platform CI (Linux, macOS, Windows)

Future

  • Context-mixed literal coding for LZRC
  • ARM/ARM64 BCJ filter
  • Streaming API for arbitrary-length input
  • WASM build for browser usage
  • v3.0 format: Huffman-coded LZ tokens (close gap with gzip at same speed)

Contributing

Contributions are welcome! Please ensure all changes pass the test suite:

cd build && ctest --output-on-failure

For compression ratio changes, include before/after benchmarks on Canterbury and Silesia corpora. See CONTRIBUTING.md.

License

GNU General Public License v3.0 β€” Free for everyone, forever.


MaxCompression is developed by Dreams-Makers Studio.

About

MaxCompression is a cross-platform, lossless compression library written in C99. It aims to push compression ratios beyond the current state of the art by fusing techniques from information theory, nature-inspired algorithms, fractal mathematics, and predictive coding.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors