RNA-seq Differential Gene Expression Workflow (Snakemake)

This repository contains a reproducible RNA-seq analysis workflow implemented using Snakemake. The pipeline performs quality control, read trimming, genome alignment, gene quantification, and differential expression analysis between two experimental conditions.

This workflow was redesigned from the Biomedical Data Analysis Master’s module of Life Science Informatics. By integrating isolated Conda environments, the analysis follows current bioinformatics best practices, ensuring the pipeline remains scalable and reproducible across different computing platforms.

Workflow Overview

The pipeline performs the following steps:

Quality control
- Tool: FastQC
- Input: raw FASTQ files
- Output: HTML quality reports
Adapter trimming and quality filtering
- Tool: Cutadapt
- Removes low-quality bases and short reads
Genome indexing
- Tool: STAR
- Generates genome index from reference FASTA and GTF files
Read alignment
- Tool: STAR
- Maps reads to the reference genome
Gene-level quantification
- Tool: featureCounts (Subread)
- Produces gene count matrix
Differential expression analysis
- Tool: DESeq2 (R)
- Identifies significantly up-regulated and down-regulated genes

Workflow Diagram

graph LR
    A[FASTQ files] --> B[FastQC]
    B --> C[Cutadapt]
    C --> D[STAR alignment]
    D --> E[featureCounts]
    E --> F[DESeq2]
    F --> G[Results]

Repository Structure

.
├── Snakefile              # Pipeline logic
├── deseq_rScript.r        # Differential expression analysis
├── README.md
├── .gitignore
├── envs/                  # Conda environment definitions
│   ├── fastqc.yaml
│   ├── cutadapt.yaml
|   ├── star.yaml
|   ├── subread.yaml
│   └── deseq2.yaml
├── featureCounts_output.txt
├── deseq2_results.txt
├── deseq2_up.txt
└── deseq2_down.txt

Requirements

Install one of the following:

Conda (Miniconda or Anaconda)
Snakemake

Recommended installation:

conda install -c bioconda -c conda-forge snakemake

Running the Workflow

Place your RNA-seq FASTQ files in the project directory using the naming format:

sample_0.fastq.gz
sample_1.fastq.gz
...
sample_5.fastq.gz

Also provide reference files:
genome.fa
annotation.gtf

Run the workflow:

snakemake --use-conda --cores 4

Snakemake will automatically:

create environments
run each analysis step
generate output files

Output Files

Main outputs include:

File	Description
featureCounts_output.txt	Gene count matrix
deseq2_results.txt	Full differential expression results
deseq2_up.txt	Up-regulated genes (log2FC ≥ 2)
deseq2_down.txt	Down-regulated genes (log2FC ≤ −2)
fastqc_reports/	Quality control reports

Reproducibility

Each step runs in an isolated Conda environment defined in: \

 envs/*.yaml

This ensures consistent and reproducible results across systems.

Tools Used

Snakemake
FastQC
Cutadapt
STAR aligner
Subread (featureCounts)
DESeq2 (R)

Author

Bioinformatics workflow developed as part of RNA-seq data analysis training and portfolio project.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
envs		envs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Differential Gene Expression Workflow (Snakemake)

Workflow Overview

Workflow Diagram

Repository Structure

Requirements

Running the Workflow

Output Files

Reproducibility

Tools Used

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Differential Gene Expression Workflow (Snakemake)

Workflow Overview

Workflow Diagram

Repository Structure

Requirements

Running the Workflow

Output Files

Reproducibility

Tools Used

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages