ATAC-seq Cross-Species Analysis Pipeline Tutorial

1

Decompress IDR Peak Files

unzip_narrowPeak.sh bash

🧬 What this step does

Decompresses the gzipped idr.conservative_peak.narrowPeak.gz files for both species so downstream tools can read them directly.

ℹ️

Conservative IDR peaks are the high-confidence reproducible peaks obtained by the ATAC-seq processing pipeline. The .narrowPeak format is a 10-column BED extension used by and HOMER.

unzip_narrowPeak.sh

# Decompress mouse peaks
cd "${BASE}/MouseAtac/AdrenalGland/peak/idr_reproducibility"
zcat idr.conservative_peak.narrowPeak.gz > idr.conservative_peak.narrowPeak

# Decompress human peaks
cd "${BASE}/HumanAtac/peak/idr_reproducibility"
zcat idr.conservative_peak.narrowPeak.gz > idr.conservative_peak.narrowPeak

Type	Path pattern	Description
Input	`…/idr_reproducibility/idr.conservative_peak.narrowPeak.gz`	Gzipped IDR peaks (mouse & human)
Output	`…/idr_reproducibility/idr.conservative_peak.narrowPeak`	Plain-text narrowPeak (10-col BED)

2

Cross-Species Peak Mapping with HALPER

integrate_halper.sh · halper_map_peak_orthologs.sh bash

🔗 What this step does

Uses halLiftover and HALPER (HAL Peak Ortholog Regions) to lift mouse ATAC-seq peaks from mm10 into hg38 coordinates using a whole-genome multiple-alignment (the 10-way .hal file).

The -s Mouse -t Human flags specify the source and target species in the alignment tree.
HALPER applies filters to retain only peaks with good orthologous coverage (the "conservative" output).
A final awk join rebuilds mouse-specific peaks back in original mm10 coordinates using the HAL name field (chr:start-end:summit) as the join key.

integrate_halper.sh – Usage

bash "$SCRIPT_DIR/integrate_halper.sh" \
  --base       "$SCRIPT_DIR" \
  --hal-file   "$HAL_FILE" \
  --halper-map "$HALPER_MAP"

Core HALPER invocation (inside integrate_halper.sh)

export PATH="${HAL_BIN}:${PATH}"
export PYTHONPATH="${HALPER_PP}:${PYTHONPATH:-}"
mkdir -p "${HALPER_OUT}"

bash "${HALPER_MAP_SH}" \
  -b "${MOUSE_CONSERVATIVE_NARROWPEAK}" \  # input mouse peaks
  -o "${HALPER_OUT}/"                    \  # output directory
  -s Mouse                              \  # source species
  -t Human                              \  # target species
  -c "${HAL_FILE}"                       # whole-genome alignment

⚠️

Environment requirement: HALtools binaries must be on $PATH and halLiftover-postprocessing must be on $PYTHONPATH. These are exported directly by the script; the conda environment is activated beforehand by complete_analysis_pipeline.sh.

Type	File	Description
Input	`idr.conservative_peak.narrowPeak`	Mouse peaks (mm10)
Input	`10plusway-master.hal`	Multi-species Cactus alignment
Output	`idr.conservative_peak.MouseToHuman.HALPER.narrowPeak.gz`	Mouse peaks lifted to hg38

3

Peak Classification with BEDTools

bedtool_evaluation.sh bash

✂️ What this step does

Performs three BEDTools intersect operations to classify peaks into three mutually exclusive categories, all in human (hg38) coordinates:

🟢 Shared peaks

Mouse-lifted peaks that overlap human IDR peaks → conserved accessible chromatin

🔵 Human-specific

Human IDR peaks that do not overlap lifted mouse peaks

🟣 Mouse-specific

Lifted mouse peaks that do not overlap human IDR peaks (non-reciprocal)

bedtool_evaluation.sh – key commands

# Sort both files by chromosome then start position
sort -k1,1 -k2,2n mouse_to_human_conservative.narrowPeak \
     > mouse_to_human_conservative.sorted.narrowPeak

sort -k1,1 -k2,2n "${HUMAN_CONSERVATIVE_NARROWPEAK}" \
     > human_conservative.sorted.narrowPeak

# Shared: mouse peaks that overlap human peaks
bedtools intersect \
  -a mouse_to_human_conservative.sorted.narrowPeak \
  -b human_conservative.sorted.narrowPeak \
  -u > shared_peaks_conservative.narrowPeak

# Mouse-specific: lifted mouse peaks NOT in human
bedtools intersect \
  -a mouse_to_human_conservative.sorted.narrowPeak \
  -b human_conservative.sorted.narrowPeak \
  -v > mouse_specific_peaks_conservative.narrowPeak

# Human-specific: human peaks NOT in lifted mouse
bedtools intersect \
  -a human_conservative.sorted.narrowPeak \
  -b mouse_to_human_conservative.sorted.narrowPeak \
  -v > human_specific_peaks_conservative.narrowPeak

🧠

The -u flag reports each query feature once if it overlaps any subject feature (regardless of how many times). The -v flag is the complement — query features with no overlap.

Type	File	Description
Output	`shared_peaks_conservative.narrowPeak`	Shared peaks (human coords, hg38)
Output	`human_specific_peaks_conservative.narrowPeak`	Human-specific peaks (hg38)
Output	`mouse_specific_peaks_conservative.narrowPeak`	Mouse-specific peaks (human coords, hg38)
Output	`mouse_specific_peaks_conservative.mouse_coords.narrowPeak`	Mouse-specific peaks in mm10 (awk join)

4

HOMER Motif Finding & Peak Annotation

run_homer_on_dir.sh · run_full_annotatePeaks_findMotifs.sh bash SLURM

🧬 What this step does

For each of the three peak sets (shared, human_specific, mouse_specific):

Convert narrowPeak → HOMER-compatible 6-column BED (centering each peak at its summit).
Run findMotifsGenome.pl — de-novo and known-motif enrichment, genome-masked, 200 bp windows.
Run annotatePeaks.pl — annotate each peak with nearest gene, TSS distance, and motif instance hits from the nonRedundant motif set.
Genome: hg38 for human/shared peaks, mm10 for mouse-specific peaks (auto-detected from filename).

run_homer_on_dir.sh – summit-centered BED conversion

awk 'BEGIN{OFS="\t"} {
    chr=$1; start=$2; end=$3
    name=$4; score=$5; strand=$6; summit=$10
    if (summit != -1 && summit != ".") {
        center = start + summit
        print chr, center-1, center, name, score, \
              (strand=="." ? "+" : strand)
    } else {
        print chr, start, end, name, score, \
              (strand=="." ? "+" : strand)
    }
}' "$NARROWPEAK_FILE" > "$BED_FILE"

HOMER commands

# [1/2] Find enriched motifs
findMotifsGenome.pl \
    "$BED_FILE" "$GENOME" "$MOTIF_DIR" \
    -size 200 -mask

# [2/2] Annotate peaks with motif instances
annotatePeaks.pl \
    "$BED_FILE" "$GENOME" \
    -m  "${MOTIF_DIR}/nonRedundant.motifs" \
    -mbed "${SAMPLE_DIR}/motif_instances.bed" \
    -size 200 \
    > "$ANNO_FILE"

📂

Output directory structure:

homer_results/
  shared_peaks/
    peaks_homer.bed
    motifs/
      nonRedundant.motifs
      homerResults.html
    annotated_peaks.txt
    motif_instances.bed
  human_specific/  (same layout)
  mouse_specific/  (same layout)

⚠️

This step is submitted as a SLURM job (#SBATCH --cpus-per-task=4 --mem=8000M --time=8:00:00). Ensure HOMER is in $PATH by providing <bin_path> as the second argument.

5

Enhancer & Promoter Extraction

extract_promoters_enhancers.sh · plot_promoters_enhancers.R bash R

🧾 What this step does

Parses each annotated_peaks.txt file with awk to extract a minimal table containing only peaks annotated as enhancers or promoters:

Enhancer: intergenic or intronic regions >2 kb from any TSS, or anything already labeled "enhancer" by HOMER.
Promoter: any region containing "promoter" in the HOMER annotation.
Motif hit columns (Distance From Peak) are preserved for downstream analysis.
After filtering, calls plot_promoters_enhancers.R to generate bar charts and motif_table.R to summarize the top motifs.

extract_promoters_enhancers.sh – AWK filter

awk -F'\t' 'BEGIN {OFS="\t"}
NR==1 {
    # Detect column positions dynamically from header
    for (i=2; i<=NF; i++) {
        if ($i=="Annotation")      annot = i
        if ($i=="Distance to TSS") dist  = i
        if ($i ~ /Distance From Peak/) motif_cols[n_motifs++] = i
    }
    # Build output header: PeakID + Annotation + motif cols
    header = "PeakID\tAnnotation"
    for (m=0; m<n_motifs; m++) header = header "\t" $motif_cols[m]
    print header; next
}
{
    new_annot = ""
    # Enhancer: intergenic/intronic AND > 2000 bp from TSS, or labeled enhancer
    if (((tolower($annot) ~ /intergenic/ || tolower($annot) ~ /intron/)
         && $dist >= 2000) || (tolower($annot) ~ /enhancer/))
        new_annot = "enhancer"
    # Promoter: labeled "promoter"
    else if (tolower($annot) ~ /promoter/)
        new_annot = "promoter"
    # Only print classified peaks
    if (new_annot != "") { ... print ... }
}'

Type	File	Description
Input	`homer_results/*/annotated_peaks.txt`	Full HOMER annotation (one per peak set)
Output	`filtered_annotations/shared_peaks.txt`	Enhancer/promoter-filtered rows for the shared peak set
Output	`filtered_annotations/human_specific.txt`	Filtered annotations for the human-specific peak set
Output	`filtered_annotations/mouse_specific.txt`	Filtered annotations for the mouse-specific peak set
Plot	`filtered_annotations/*.png`	Enhancer vs promoter bar charts
Table	`filtered_annotations/*_motif_table.tsv`	Top-N motifs per sample with annotations

6

Motif Analysis by Biological Condition

enhancer_promoter_analysis.py Python

🔬 What this step does

Reads the filtered_annotations/ tables generated from HOMER and produces motif presence lists per biological assignment condition. A motif is included if it has ≥ 1 non-empty hit in a given peak set and annotation category:

ENHANCERS

Human enhancers = shared + human-specific
Mouse enhancers = shared + mouse-specific
Shared enhancers = shared only
Human-specific enhancers
Mouse-specific enhancers

PROMOTERS

Shared promoters = shared only
Human promoters = shared + human-specific
Mouse promoters = shared + mouse-specific

enhancer_promoter_analysis.py – Usage

python enhancer_promoter_analysis.py \
  --input-dir  ./filtered_annotations \
  --output-dir ./motif_annotation_split

📂

Output: motif_annotation_split/assignment_conditions/

human_enhancers_motifs.txt
mouse_enhancers_motifs.txt
human_promoters_motifs.txt
mouse_promoters_motifs.txt
shared_promoters_across_species_motifs.txt
shared_enhancers_across_species_motifs.txt
human_specific_enhancers_motifs.txt
mouse_specific_enhancers_motifs.txt

7

GO Enrichment with rGREAT

run_GO_pipeline.sh · rgreat_pipeline.R bash R

📈 What this step does

Runs rGREAT (Genomic Regions Enrichment of Annotations Tool) on the four peak sets resolved by run_GO_pipeline.sh to find enriched Gene Ontology terms:

Reads 3-column BED (chr, start, end) from each narrowPeak file; non-standard chromosomes are filtered out.
Runs GO enrichment for BP, MF, and CC ontologies.
Generates .tsv result tables and publication-style dot-plots (top 20 terms, fold enrichment vs −log₁₀ p-adj).

Peak set	Genome
mouse_specific	mm10
human_specific	hg38
conserved (mouse in human coords)	hg38
conserved (human in mouse coords)	mm10

run_GO_pipeline.sh – Usage

bash run_GO_pipeline.sh \
  ./filtered_annotations \   # directory scanned for the 4 expected peak files
  ./rGREAT_results           # output directory (default: rGREAT_results)

⚠️

run_GO_pipeline.sh currently locates input files by filename pattern, including *mouse_specific_peaks_conservative*.narrowPeak, *human_specific_peaks_conservative*.narrowPeak, *mouse_to_human_conservative*.narrowPeak, and *human_conservative*.narrowPeak.

rgreat_pipeline.R – core GREAT call

# Read narrowPeak as GRanges (3-col BED, standard chroms only)
gr <- read_bed3(file)

# GO:BP enrichment
job_bp <- great(gr = gr, tss_source = genome, gene_sets = "GO:BP")
tbl_bp <- getEnrichmentTable(job_bp)

# Also run GO:MF and GO:CC
job_mf <- great(gr = gr, tss_source = genome, gene_sets = "GO:MF")
job_cc <- great(gr = gr, tss_source = genome, gene_sets = "GO:CC")

# Publication dot-plot (top 20 terms)
p <- ggplot(plot_tbl, aes(x = fold_enrichment, y = description,
          size = observed_region_hits, fill = log10_p_adjust)) +
  geom_point(alpha = 0.9, shape = 21) +
  scale_fill_distiller(palette = "YlOrRd", direction = 1) +
  theme_classic(base_size = 14)

📈

Output files in rGREAT_results/:

mouse_specific_GO_BP.tsv (and MF, CC variants)
human_specific_GO_BP.tsv
conserved_*_GO_BP.tsv
ALL_significant_terms_BP.tsv — merged table of all significant terms (p_adj < 0.05)
*_GO_BP_dotplot.png — publication-ready figures

⚙️

Quickstart — Run the Full Pipeline

complete_analysis_pipeline.sh bash SLURM

ℹ️

The master script complete_analysis_pipeline.sh orchestrates mapping, HOMER annotation, promoter/enhancer splitting, motif summarization, and GO analysis end-to-end. Submit it to SLURM with the two required arguments.

One-command submission

cd RUN_FULL_PIPELINE

sbatch complete_analysis_pipeline.sh \
  /path/to/10plusway-master.hal \            # ARG 1: .hal alignment file
  /path/to/halper_map_peak_orthologs.sh      # ARG 2: HALPER mapping script

Before you submit

Run from the RUN_FULL_PIPELINE/ directory, because the script uses ${SLURM_SUBMIT_DIR} as its base path.
Place HOMER executables under RUN_FULL_PIPELINE/bin/; the pipeline now picks up BIN_PATH="$SCRIPT_DIR/bin" automatically.
Keep the repository helper scripts in the same directory as complete_analysis_pipeline.sh.
Ensure the HAL alignment file and halper_map_peak_orthologs.sh path are absolute or resolvable from the submit location.

Pipeline order

integrate_halper.sh
run_full_annotatePeaks_findMotifs.sh
enhancer_promoter_analysis.py
run_GO_pipeline.sh

Intermediate directories are created relative to the submission directory.

⚙️ SLURM resource request

Parameter	Value	Note
CPUs	4	Matches `--cpus-per-task=4` in the script header
Memory	8 GB	Matches `--mem=8000M`
Wall time	18 h	Matches `--time=18:00:00`; HALPER and HOMER remain the slowest stages
Account	bio230007p	PSC Bridges-2 allocation

🗂️ Expected output layout after full run

$WORKDIR/
├── bin/
│   ├── annotatePeaks.pl
│   └── findMotifsGenome.pl
├── narrowPeak_for_homer/
│   ├── shared_peaks.narrowPeak
│   ├── human_specific.narrowPeak
│   └── mouse_specific.narrowPeak
├── homer_results/
│   ├── shared_peaks/
│   │   ├── annotated_peaks.txt
│   │   ├── motif_instances.bed
│   │   └── motifs/
│   │       ├── nonRedundant.motifs
│   │       └── homerResults.html
│   ├── human_specific/     (same layout)
│   └── mouse_specific/     (same layout)
├── filtered_annotations/
│   ├── shared_peaks.txt
│   ├── human_specific.txt
│   ├── mouse_specific.txt
│   ├── *.png               ← enhancer/promoter bar charts
│   └── *_motif_table.tsv   ← top motifs per sample
├── motif_annotation_split/
│   └── assignment_conditions/
│       ├── human_enhancers_motifs.txt
│       ├── mouse_enhancers_motifs.txt
│       └── …
└── rGREAT_results/
    ├── *_GO_BP.tsv
    ├── *_GO_MF.tsv
    ├── *_GO_CC.tsv
    ├── ALL_significant_terms_BP.tsv
    └── *_dotplot.png

Dependencies & Requirements

⚙️ Software

HALtools — whole-genome alignment tools (conda env)
HALPER — halLiftover-postprocessing suite
BEDTools ≥ 2.29 (module load bedtools)
HOMER — annotatePeaks.pl, findMotifsGenome.pl, placed under RUN_FULL_PIPELINE/bin/
Python ≥ 3.8 (stdlib only)
R ≥ 4.1 with rGREAT, ggplot2, tidyverse

🧬 R packages (auto-installed)

rGREAT — Bioconductor GO enrichment
GenomicRanges / IRanges
TxDb.Mmusculus.UCSC.mm10.knownGene
ggplot2, dplyr, stringr
RColorBrewer, tidyverse

ATAC-seq Cross-Species
Comparative Analysis

Species

Ortholog Mapping

Peak Classes

Functional Analysis

Full Pipeline Flow

Decompress IDR Peak Files

Cross-Species Peak Mapping with HALPER

Peak Classification with BEDTools

HOMER Motif Finding & Peak Annotation

Enhancer & Promoter Extraction

Motif Analysis by Biological Condition

GO Enrichment with rGREAT

Quickstart — Run the Full Pipeline

Dependencies & Requirements

ATAC-seq Cross-SpeciesComparative Analysis

Species

Ortholog Mapping

Peak Classes

Functional Analysis

Full Pipeline Flow

Decompress IDR Peak Files

Cross-Species Peak Mapping with HALPER

Peak Classification with BEDTools

HOMER Motif Finding & Peak Annotation

Enhancer & Promoter Extraction

Motif Analysis by Biological Condition

GO Enrichment with rGREAT

Quickstart — Run the Full Pipeline

Dependencies & Requirements

ATAC-seq Cross-Species
Comparative Analysis