Bioinformatics Pipeline

ATAC-seq Cross-Species
Comparative Analysis

A step-by-step tutorial for mapping, classifying, and functionally annotating open-chromatin peaks across human and mouse genomes using HALPER, BEDTools, HOMER, and rGREAT.

🧬

Species

Human (hg38) & Mouse (mm10) ATAC-seq IDR peaks

🔗

Ortholog Mapping

HALtools + HALPER lifts mouse peaks to human coordinates

✂️

Peak Classes

Shared, human-specific, and mouse-specific open-chromatin regions

📈

Functional Analysis

HOMER motifs, enhancer/promoter annotation, and rGREAT GO enrichment

Full Pipeline Flow

🗜️
unzip_narrowPeak.sh
Decompress IDR peaks
🔗
halper_map_peak_orthologs
Mouse → Human coords
✂️
bedtool_evaluation.sh
Intersect & classify
🧩
integrate_halper.sh
Mouse-coord narrowPeak rebuild
🧬
HOMER annotation
annotatePeaks + findMotifs
🧾
extract_promoters_enhancers.sh
Filter + classify regions
🔬
enhancer_promoter_analysis.py
Motif-per-condition lists
📈
rgreat_pipeline.R
GO:BP / MF / CC enrichment
1

Decompress IDR Peak Files

unzip_narrowPeak.sh bash
🧬 What this step does

Decompresses the gzipped idr.conservative_peak.narrowPeak.gz files for both species so downstream tools can read them directly.

ℹ️
Conservative IDR peaks are the high-confidence reproducible peaks obtained by the ATAC-seq processing pipeline. The .narrowPeak format is a 10-column BED extension used by and HOMER.
unzip_narrowPeak.sh
# Decompress mouse peaks
cd "${BASE}/MouseAtac/AdrenalGland/peak/idr_reproducibility"
zcat idr.conservative_peak.narrowPeak.gz > idr.conservative_peak.narrowPeak

# Decompress human peaks
cd "${BASE}/HumanAtac/peak/idr_reproducibility"
zcat idr.conservative_peak.narrowPeak.gz > idr.conservative_peak.narrowPeak
TypePath patternDescription
Input…/idr_reproducibility/idr.conservative_peak.narrowPeak.gzGzipped IDR peaks (mouse & human)
Output…/idr_reproducibility/idr.conservative_peak.narrowPeakPlain-text narrowPeak (10-col BED)
2

Cross-Species Peak Mapping with HALPER

integrate_halper.sh · halper_map_peak_orthologs.sh bash
🔗 What this step does

Uses halLiftover and HALPER (HAL Peak Ortholog Regions) to lift mouse ATAC-seq peaks from mm10 into hg38 coordinates using a whole-genome multiple-alignment (the 10-way .hal file).

  • The -s Mouse -t Human flags specify the source and target species in the alignment tree.
  • HALPER applies filters to retain only peaks with good orthologous coverage (the "conservative" output).
  • A final awk join rebuilds mouse-specific peaks back in original mm10 coordinates using the HAL name field (chr:start-end:summit) as the join key.
integrate_halper.sh – Usage
bash "$SCRIPT_DIR/integrate_halper.sh" \
  --base       "$SCRIPT_DIR" \
  --hal-file   "$HAL_FILE" \
  --halper-map "$HALPER_MAP"
Core HALPER invocation (inside integrate_halper.sh)
export PATH="${HAL_BIN}:${PATH}"
export PYTHONPATH="${HALPER_PP}:${PYTHONPATH:-}"
mkdir -p "${HALPER_OUT}"

bash "${HALPER_MAP_SH}" \
  -b "${MOUSE_CONSERVATIVE_NARROWPEAK}" \  # input mouse peaks
  -o "${HALPER_OUT}/"                    \  # output directory
  -s Mouse                              \  # source species
  -t Human                              \  # target species
  -c "${HAL_FILE}"                       # whole-genome alignment
⚠️
Environment requirement: HALtools binaries must be on $PATH and halLiftover-postprocessing must be on $PYTHONPATH. These are exported directly by the script; the conda environment is activated beforehand by complete_analysis_pipeline.sh.
TypeFileDescription
Inputidr.conservative_peak.narrowPeakMouse peaks (mm10)
Input10plusway-master.halMulti-species Cactus alignment
Outputidr.conservative_peak.MouseToHuman.HALPER.narrowPeak.gzMouse peaks lifted to hg38
3

Peak Classification with BEDTools

bedtool_evaluation.sh bash
✂️ What this step does

Performs three BEDTools intersect operations to classify peaks into three mutually exclusive categories, all in human (hg38) coordinates:

🟢 Shared peaks

Mouse-lifted peaks that overlap human IDR peaks → conserved accessible chromatin

🔵 Human-specific

Human IDR peaks that do not overlap lifted mouse peaks

🟣 Mouse-specific

Lifted mouse peaks that do not overlap human IDR peaks (non-reciprocal)

bedtool_evaluation.sh – key commands
# Sort both files by chromosome then start position
sort -k1,1 -k2,2n mouse_to_human_conservative.narrowPeak \
     > mouse_to_human_conservative.sorted.narrowPeak

sort -k1,1 -k2,2n "${HUMAN_CONSERVATIVE_NARROWPEAK}" \
     > human_conservative.sorted.narrowPeak

# Shared: mouse peaks that overlap human peaks
bedtools intersect \
  -a mouse_to_human_conservative.sorted.narrowPeak \
  -b human_conservative.sorted.narrowPeak \
  -u > shared_peaks_conservative.narrowPeak

# Mouse-specific: lifted mouse peaks NOT in human
bedtools intersect \
  -a mouse_to_human_conservative.sorted.narrowPeak \
  -b human_conservative.sorted.narrowPeak \
  -v > mouse_specific_peaks_conservative.narrowPeak

# Human-specific: human peaks NOT in lifted mouse
bedtools intersect \
  -a human_conservative.sorted.narrowPeak \
  -b mouse_to_human_conservative.sorted.narrowPeak \
  -v > human_specific_peaks_conservative.narrowPeak
🧠
The -u flag reports each query feature once if it overlaps any subject feature (regardless of how many times). The -v flag is the complement — query features with no overlap.
TypeFileDescription
Outputshared_peaks_conservative.narrowPeakShared peaks (human coords, hg38)
Outputhuman_specific_peaks_conservative.narrowPeakHuman-specific peaks (hg38)
Outputmouse_specific_peaks_conservative.narrowPeakMouse-specific peaks (human coords, hg38)
Outputmouse_specific_peaks_conservative.mouse_coords.narrowPeakMouse-specific peaks in mm10 (awk join)
4

HOMER Motif Finding & Peak Annotation

run_homer_on_dir.sh · run_full_annotatePeaks_findMotifs.sh bash SLURM
🧬 What this step does

For each of the three peak sets (shared, human_specific, mouse_specific):

  • Convert narrowPeak → HOMER-compatible 6-column BED (centering each peak at its summit).
  • Run findMotifsGenome.pl — de-novo and known-motif enrichment, genome-masked, 200 bp windows.
  • Run annotatePeaks.pl — annotate each peak with nearest gene, TSS distance, and motif instance hits from the nonRedundant motif set.
  • Genome: hg38 for human/shared peaks, mm10 for mouse-specific peaks (auto-detected from filename).
run_homer_on_dir.sh – summit-centered BED conversion
awk 'BEGIN{OFS="\t"} {
    chr=$1; start=$2; end=$3
    name=$4; score=$5; strand=$6; summit=$10
    if (summit != -1 && summit != ".") {
        center = start + summit
        print chr, center-1, center, name, score, \
              (strand=="." ? "+" : strand)
    } else {
        print chr, start, end, name, score, \
              (strand=="." ? "+" : strand)
    }
}' "$NARROWPEAK_FILE" > "$BED_FILE"
HOMER commands
# [1/2] Find enriched motifs
findMotifsGenome.pl \
    "$BED_FILE" "$GENOME" "$MOTIF_DIR" \
    -size 200 -mask

# [2/2] Annotate peaks with motif instances
annotatePeaks.pl \
    "$BED_FILE" "$GENOME" \
    -m  "${MOTIF_DIR}/nonRedundant.motifs" \
    -mbed "${SAMPLE_DIR}/motif_instances.bed" \
    -size 200 \
    > "$ANNO_FILE"
📂
Output directory structure:
homer_results/
  shared_peaks/
    peaks_homer.bed
    motifs/
      nonRedundant.motifs
      homerResults.html
    annotated_peaks.txt
    motif_instances.bed
  human_specific/  (same layout)
  mouse_specific/  (same layout)
⚠️
This step is submitted as a SLURM job (#SBATCH --cpus-per-task=4 --mem=8000M --time=8:00:00). Ensure HOMER is in $PATH by providing <bin_path> as the second argument.
5

Enhancer & Promoter Extraction

extract_promoters_enhancers.sh · plot_promoters_enhancers.R bash R
🧾 What this step does

Parses each annotated_peaks.txt file with awk to extract a minimal table containing only peaks annotated as enhancers or promoters:

  • Enhancer: intergenic or intronic regions >2 kb from any TSS, or anything already labeled "enhancer" by HOMER.
  • Promoter: any region containing "promoter" in the HOMER annotation.
  • Motif hit columns (Distance From Peak) are preserved for downstream analysis.
  • After filtering, calls plot_promoters_enhancers.R to generate bar charts and motif_table.R to summarize the top motifs.
extract_promoters_enhancers.sh – AWK filter
awk -F'\t' 'BEGIN {OFS="\t"}
NR==1 {
    # Detect column positions dynamically from header
    for (i=2; i<=NF; i++) {
        if ($i=="Annotation")      annot = i
        if ($i=="Distance to TSS") dist  = i
        if ($i ~ /Distance From Peak/) motif_cols[n_motifs++] = i
    }
    # Build output header: PeakID + Annotation + motif cols
    header = "PeakID\tAnnotation"
    for (m=0; m<n_motifs; m++) header = header "\t" $motif_cols[m]
    print header; next
}
{
    new_annot = ""
    # Enhancer: intergenic/intronic AND > 2000 bp from TSS, or labeled enhancer
    if (((tolower($annot) ~ /intergenic/ || tolower($annot) ~ /intron/)
         && $dist >= 2000) || (tolower($annot) ~ /enhancer/))
        new_annot = "enhancer"
    # Promoter: labeled "promoter"
    else if (tolower($annot) ~ /promoter/)
        new_annot = "promoter"
    # Only print classified peaks
    if (new_annot != "") { ... print ... }
}'
TypeFileDescription
Inputhomer_results/*/annotated_peaks.txtFull HOMER annotation (one per peak set)
Outputfiltered_annotations/shared_peaks.txtEnhancer/promoter-filtered rows for the shared peak set
Outputfiltered_annotations/human_specific.txtFiltered annotations for the human-specific peak set
Outputfiltered_annotations/mouse_specific.txtFiltered annotations for the mouse-specific peak set
Plotfiltered_annotations/*.pngEnhancer vs promoter bar charts
Tablefiltered_annotations/*_motif_table.tsvTop-N motifs per sample with annotations
6

Motif Analysis by Biological Condition

enhancer_promoter_analysis.py Python
🔬 What this step does

Reads the filtered_annotations/ tables generated from HOMER and produces motif presence lists per biological assignment condition. A motif is included if it has ≥ 1 non-empty hit in a given peak set and annotation category:

ENHANCERS
  • Human enhancers = shared + human-specific
  • Mouse enhancers = shared + mouse-specific
  • Shared enhancers = shared only
  • Human-specific enhancers
  • Mouse-specific enhancers
PROMOTERS
  • Shared promoters = shared only
  • Human promoters = shared + human-specific
  • Mouse promoters = shared + mouse-specific
enhancer_promoter_analysis.py – Usage
python enhancer_promoter_analysis.py \
  --input-dir  ./filtered_annotations \
  --output-dir ./motif_annotation_split
📂
Output: motif_annotation_split/assignment_conditions/
human_enhancers_motifs.txt
mouse_enhancers_motifs.txt
human_promoters_motifs.txt
mouse_promoters_motifs.txt
shared_promoters_across_species_motifs.txt
shared_enhancers_across_species_motifs.txt
human_specific_enhancers_motifs.txt
mouse_specific_enhancers_motifs.txt
7

GO Enrichment with rGREAT

run_GO_pipeline.sh · rgreat_pipeline.R bash R
📈 What this step does

Runs rGREAT (Genomic Regions Enrichment of Annotations Tool) on the four peak sets resolved by run_GO_pipeline.sh to find enriched Gene Ontology terms:

  • Reads 3-column BED (chr, start, end) from each narrowPeak file; non-standard chromosomes are filtered out.
  • Runs GO enrichment for BP, MF, and CC ontologies.
  • Generates .tsv result tables and publication-style dot-plots (top 20 terms, fold enrichment vs −log₁₀ p-adj).
Peak setGenome
mouse_specificmm10
human_specifichg38
conserved (mouse in human coords)hg38
conserved (human in mouse coords)mm10
run_GO_pipeline.sh – Usage
bash run_GO_pipeline.sh \
  ./filtered_annotations \   # directory scanned for the 4 expected peak files
  ./rGREAT_results           # output directory (default: rGREAT_results)
⚠️
run_GO_pipeline.sh currently locates input files by filename pattern, including *mouse_specific_peaks_conservative*.narrowPeak, *human_specific_peaks_conservative*.narrowPeak, *mouse_to_human_conservative*.narrowPeak, and *human_conservative*.narrowPeak.
rgreat_pipeline.R – core GREAT call
# Read narrowPeak as GRanges (3-col BED, standard chroms only)
gr <- read_bed3(file)

# GO:BP enrichment
job_bp <- great(gr = gr, tss_source = genome, gene_sets = "GO:BP")
tbl_bp <- getEnrichmentTable(job_bp)

# Also run GO:MF and GO:CC
job_mf <- great(gr = gr, tss_source = genome, gene_sets = "GO:MF")
job_cc <- great(gr = gr, tss_source = genome, gene_sets = "GO:CC")

# Publication dot-plot (top 20 terms)
p <- ggplot(plot_tbl, aes(x = fold_enrichment, y = description,
          size = observed_region_hits, fill = log10_p_adjust)) +
  geom_point(alpha = 0.9, shape = 21) +
  scale_fill_distiller(palette = "YlOrRd", direction = 1) +
  theme_classic(base_size = 14)
📈
Output files in rGREAT_results/:
  • mouse_specific_GO_BP.tsv (and MF, CC variants)
  • human_specific_GO_BP.tsv
  • conserved_*_GO_BP.tsv
  • ALL_significant_terms_BP.tsv — merged table of all significant terms (p_adj < 0.05)
  • *_GO_BP_dotplot.png — publication-ready figures
⚙️

Quickstart — Run the Full Pipeline

complete_analysis_pipeline.sh bash SLURM
ℹ️
The master script complete_analysis_pipeline.sh orchestrates mapping, HOMER annotation, promoter/enhancer splitting, motif summarization, and GO analysis end-to-end. Submit it to SLURM with the two required arguments.
One-command submission
cd RUN_FULL_PIPELINE

sbatch complete_analysis_pipeline.sh \
  /path/to/10plusway-master.hal \            # ARG 1: .hal alignment file
  /path/to/halper_map_peak_orthologs.sh      # ARG 2: HALPER mapping script
Before you submit
  • Run from the RUN_FULL_PIPELINE/ directory, because the script uses ${SLURM_SUBMIT_DIR} as its base path.
  • Place HOMER executables under RUN_FULL_PIPELINE/bin/; the pipeline now picks up BIN_PATH="$SCRIPT_DIR/bin" automatically.
  • Keep the repository helper scripts in the same directory as complete_analysis_pipeline.sh.
  • Ensure the HAL alignment file and halper_map_peak_orthologs.sh path are absolute or resolvable from the submit location.
Pipeline order
  • integrate_halper.sh
  • run_full_annotatePeaks_findMotifs.sh
  • enhancer_promoter_analysis.py
  • run_GO_pipeline.sh
Intermediate directories are created relative to the submission directory.
⚙️ SLURM resource request
ParameterValueNote
CPUs4Matches --cpus-per-task=4 in the script header
Memory8 GBMatches --mem=8000M
Wall time18 hMatches --time=18:00:00; HALPER and HOMER remain the slowest stages
Accountbio230007pPSC Bridges-2 allocation
🗂️ Expected output layout after full run
$WORKDIR/
├── bin/
│   ├── annotatePeaks.pl
│   └── findMotifsGenome.pl
├── narrowPeak_for_homer/
│   ├── shared_peaks.narrowPeak
│   ├── human_specific.narrowPeak
│   └── mouse_specific.narrowPeak
├── homer_results/
│   ├── shared_peaks/
│   │   ├── annotated_peaks.txt
│   │   ├── motif_instances.bed
│   │   └── motifs/
│   │       ├── nonRedundant.motifs
│   │       └── homerResults.html
│   ├── human_specific/     (same layout)
│   └── mouse_specific/     (same layout)
├── filtered_annotations/
│   ├── shared_peaks.txt
│   ├── human_specific.txt
│   ├── mouse_specific.txt
│   ├── *.png               ← enhancer/promoter bar charts
│   └── *_motif_table.tsv   ← top motifs per sample
├── motif_annotation_split/
│   └── assignment_conditions/
│       ├── human_enhancers_motifs.txt
│       ├── mouse_enhancers_motifs.txt
│       └── …
└── rGREAT_results/
    ├── *_GO_BP.tsv
    ├── *_GO_MF.tsv
    ├── *_GO_CC.tsv
    ├── ALL_significant_terms_BP.tsv
    └── *_dotplot.png

Dependencies & Requirements

⚙️ Software
  • HALtools — whole-genome alignment tools (conda env)
  • HALPER — halLiftover-postprocessing suite
  • BEDTools ≥ 2.29 (module load bedtools)
  • HOMERannotatePeaks.pl, findMotifsGenome.pl, placed under RUN_FULL_PIPELINE/bin/
  • Python ≥ 3.8 (stdlib only)
  • R ≥ 4.1 with rGREAT, ggplot2, tidyverse
🧬 R packages (auto-installed)
  • rGREAT — Bioconductor GO enrichment
  • GenomicRanges / IRanges
  • TxDb.Mmusculus.UCSC.mm10.knownGene
  • ggplot2, dplyr, stringr
  • RColorBrewer, tidyverse