MAJEC

Momentum-Accelerated Junction-Enhanced Counting — unified gene, isoform, and locus-level transposable element quantification from RNA-seq.

MAJEC jointly quantifies genes, transcript isoforms, and individual TE loci from standard BAM alignments in a single pass. By operating a probabilistically resolved Expectation-Maximization (EM) algorithm on a joint gene+TE feature space, MAJEC eliminates the systematic signal misattribution that plagues existing TE quantification tools.

The Gene-TE Overlap Problem

Approximately 45% of the human genome consists of TEs, many of which are embedded directly within gene bodies.

TE-only tools: Operate blind to genes. When a host gene is transcribed, its reads are falsely attributed to the overlapping TE, driving massive false-positive TE reactivation calls.

The MAJEC Solution: Genes and TEs compete for reads probabilistically. MAJEC uses empirical splice junction evidence to heavily penalize unsupported transcript isoforms, preventing them from stealing genuine TE reads, while correctly assigning spliced genic reads to their host genes.

Key Features

Joint gene+TE feature space: genes and TE loci compete for reads probabilistically, preventing systematic misattribution at overlapping loci
Junction-informed priors: splice junction evidence from the BAM drives isoform-level accuracy and helps distinguish genic from TE-derived signal
Locus-level TE resolution: individual TE insertions are quantified — not just subfamily aggregates
Confidence metrics: per-transcript distinguishability scores, assignment entropy, and discord scores for transparent quality assessment
Fast: momentum-accelerated EM typically converges in ~15 iterations; multiprocessing across samples

Installation

From source

mamba env create -f majec.yml
conda activate majec
pip install -e .

bioconda

conda install bioconda::majec

Dependencies

Installed automatically via the conda environment:

Subread (featureCounts) >= 2.0
samtools >= 1.20
bedtools >= 2.31

Quick Start

1. Precompute Annotations

Combine your gene and TE GTFs into a unified, mathematically optimized index.

majec_precompute_annotations \
    --gene_gtf gencode.v44.annotation.gtf \
    --te_gtf hg38_rmsk_TE.gtf \
    --output my_annotations

This produces my_annotations_annotations.pkl.gz (and optionally _subset_coverage_features.bed if --generate_rescue_features is used).

2. Run the Pipeline

Quantify all your samples jointly in a single command.

majec_run_pipeline \
    --annotation my_annotations_annotations.pkl.gz \
    --bams sample1.bam sample2.bam \
    --prefix my_experiment \
    --paired_end \
    --strandedness 2 \
    --use_subset_penalty \
    --use_junction_completeness \
    --library_type dT \
    --terminal_relax \
    --output_confidence \
    --use_cache

Input BAMs should be coordinate-sorted and produced by a splice-aware aligner such as STAR. For TE quantification, use --outFilterMultimapNmax 100 (or similar) during alignment to retain multimapping reads.

Outputs

MAJEC automatically aggregates locus-level estimates and provides DESeq2-ready matrices at multiple resolutions:

_total_EM_aggregated_counts.tsv — Gene and subfamily-level counts
_total_EM_counts.tsv — Isoform and locus-level counts
_transcript_metrics_SPARSE.tsv.gz — Confidence scores and junction evidence

How It Works

Read assignment: featureCounts assigns reads to features from the joint gene+TE annotation, forming equivalence classes of reads that map to the same set of transcripts.
Junction extraction: splice junctions reported by featureCounts (.jcounts) are matched to annotated transcript structures, providing isoform-discriminating evidence.
Prior construction: junction evidence, completeness scores, subset relationships, and (optionally) TSL annotations are combined into per-transcript priors that seed the EM.
EM with momentum: the Expectation-Maximization algorithm iteratively refines transcript abundance estimates. Momentum acceleration speeds convergence, typically reaching stable estimates within ~15 iterations.
Output: final gene-level, transcript-level, and (if applicable) TE locus-level count matrices, plus optional confidence metrics.

Optional Workflow Tools

MAJEC includes additional tools for downstream analysis. See the full documentation for details.

majec_build_db — Consolidate results from one or more pipeline runs into a single, queryable SQLite database. Supports merging chunked runs, attaching sample metadata, and serves as the input for all downstream tools.
majec_prepare_deseq2 — Generate ready-to-run DESeq2 analysis packages directly from the database. Supports flexible sample group definitions, confidence-weighted variance modeling, gene/transcript/differential-splicing analysis levels, and batch correction.
majec_visualize — Generate interactive, multi-panel HTML reports for individual genes. Includes junction arc plots, differential splicing heatmaps, per-sample penalty diagnostics, and optional Excel export.

Key Parameters

Run majec_run_pipeline --help or majec_precompute_annotations --help for the full list of options. The most important flags are shown in the Quick Start above. A few others worth noting:

Flag	Description
`--use_subset_coverage_data`	Use read coverage to inform subset penalties (requires `--generate_rescue_features` during annotation precompute)
`--output_tpm`	Output TPM values in addition to counts
`--light`	Skip prior tracking and confidence metrics for faster, leaner runs
`--use_cache`	Cache featureCounts results to speed up reruns with different parameters
`--config`	Load settings from a JSON file (CLI flags override)

Documentation

BAM Preparation — STAR alignment parameters for TE-aware quantification
Algorithm Details — Two-phase EM, prior adjustment models, momentum acceleration, confidence metrics
Output Files — Column descriptions for all output tables and confidence reports
Post-Quantification Workflows — Database building, DESeq2 integration, and visualization
Usage Guide — Configuration templates, utility tools, resource requirements, and HPC deployment

Citation

If you use MAJEC in your research, please cite:

Lim, T.-Y. & Firestone, A.J. (2026) MAJEC: unified gene, isoform, and locus-level transposable element quantification from RNA-seq. bioRxiv.

License

Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
conda-recipe		conda-recipe
docs		docs
images		images
src/majec		src/majec
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
majec.yml		majec.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAJEC

The Gene-TE Overlap Problem

Key Features

Installation

From source

bioconda

Dependencies

Quick Start

1. Precompute Annotations

2. Run the Pipeline

Outputs

How It Works

Optional Workflow Tools

Key Parameters

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAJEC

The Gene-TE Overlap Problem

Key Features

Installation

From source

bioconda

Dependencies

Quick Start

1. Precompute Annotations

2. Run the Pipeline

Outputs

How It Works

Optional Workflow Tools

Key Parameters

Documentation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages