1 Introduction

scPipe is a package initially designed to process single-cell RNA-sequencing (scRNA-seq) data generated by different protocols. We have modified it to accommodate pre-processing capability of single-cell ATAC-Seq (Assay for Transposase-Accessible Chromatin using sequencing) data pre-processing. scPipe ATAC-Seq module is designed for protocols without UMIs, but can also adapt to any UMI protocols if the need arise.

scPipe ATAC-Seq module consists of two major components. The first is data pre-processing with barcodes as raw fastq as input and a feature count matrix as output. Second is the data pre-processing with barcode CSV file as input and a feature count matrix as output.

10X ATAC method currently is the most popular method to generate scATAC-Seq data with higher sensitivity and lower cost. The structure of the 10X ATAC library is shown below.

The output fastq files from a 10X ATAC experiment is paired-ended and data is contained within both reads.

2 Workflow

The pipeline is visually described via the workflow diagram depicted below. The functions will be explained in greater detail.

3 Getting started

It is not mandatory to specify an output folder even though it can be specified. If no output_folder is defined a folder named scPipe-atac-output will get created in the working directory.

We begin by loading the library.

library(scPipe)
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE)

Specify the output folder.

output_folder <- paste0(tempdir(), "/scPipeATACVignette")

4 Fastq reformatting

To process the data, we need the fastq files (both compressed and uncompressed versions are accepted) and a cell barcode annotation. We have supplied very small sample FASTQ files of chr21. The barcode annotation could be either in a .fastq format or a .csv file with at least two columns, where the first column has the cell id and the second column contains the barcode sequence. All these files can be found in the data folder of the scPipe package:

# data fastq files
r1      <- file.path(data.folder, "small_chr21_R1.fastq.gz") 
r2      <- file.path(data.folder, "small_chr21_R3.fastq.gz") 

# barcodes in fastq format
barcode_fastq      <- file.path(data.folder, "small_chr21_R2.fastq.gz") 

# barcodes in .csv format
barcode_1000       <- file.path(data.folder, "chr21_modified_barcode_1000.csv")

The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences (if available) to the read name and leave the transcript sequence as is. This outputs a read name that looks like @[barcode_sequence]_[UMI_sequence]#[readname]. Usually the scATAC-Seq data is paired-end and a 16bp long barcode is located on both reads. Here the barcode information is located on a separate fastq file and the length of the barcode fastq file matches the length of the reads files. Therefore, you need a minimal example like below to generate the output.

sc_atac_trim_barcode (r1            = r1, 
                      r2            = r2, 
                      bc_file       = barcode_fastq, 
                      rmN           = FALSE,
                      rmlow         = FALSE,
                      output_folder = output_folder)
## Output Directory Does Not Exist. Created Directory: /tmp/Rtmpwxfbe7/scPipeATACVignette
## Saving the output at location: 
## /tmp/Rtmpwxfbe7/scPipeATACVignette
## No valid_barcode_file provided; no barcode error correction will occur.
## Total reads: 35000
## Total reads removed due to N's in barcodes: 0
## Total reads removed due to low quality barcodes: 0
## Total barcodes provided in FASTQ file: 16567
## time elapsed: 326 milliseconds

This generates two output fastq files that are appended by the prefix demux_ to notify that the new files are the reformatted (a.k.a. demultiplexed) .fastq files. These will get saved in the scPipe-atac-output directory if the user has not specified an output_folder.

However, if the barcodes are in the form of a .csv file, some extra information on 0-indexed barcode start (id1_st, id2_st) and the barcode length (id1_len, id2_len) are also required to be entered by the user. The algorithm is flexible to do a “look-around” to identify whether the correct parameters are used for a subset of data (hence saving time) and report back to the console if it believes the barcode position is incorrect and/or should be shifted.

sc_atac_trim_barcode (r1            = r1, 
                      r2            = r2, 
                      bc_file       = barcode_1000, 
                      id1_st        = -1,  
                      id1_len       = -1,  
                      id2_st        = -1,  
                      id2_len       = -10,  
                      output_folder = output_folder, 
                      rmN           = TRUE)

To accommodate combinatorial indexing in some scATAC-seq protocols, the bc_file parameter will accept a list of barcode files as well (currently only implemented for the fastq approach).

Additionally, rmN, rmlow and min_qual parameters define the quality thresholds for the reads. If there are Ns in the barcode or UMI positions those will be discarded by rmN = TRUE. rmlow =TRUE will remove reads having lower quality than what is defined by min_qual (default: 20).

Completion of this function will output three different outputs depending on the findings; * complete matches: When the barcode is completely matched and identified in the correct position * partial matches: When the barcode is identified in the location specified but corrected with hamming distance approach * unmatched: no barcode match is found in the given position even after hamming distance corrections are applied

NOTE: we use a zero based index system, so the indexing of the sequence starts at zero.

5 Aligning reads to a reference genome

Next, we align reads to the genome. This example uses Rsubread but any aligner that support RNA-seq alignment and gives standard BAM output can be used here.

demux_r1        <- file.path(output_folder, "demux_completematch_small_chr21_R1.fastq.gz")
demux_r2        <- file.path(output_folder, "demux_completematch_small_chr21_R3.fastq.gz")
reference = file.path(data.folder, "small_chr21.fa")


aligned_bam <- sc_aligning(ref = reference, 
                R1 = demux_r1, 
                R2 = demux_r2, 
                nthreads  = 12,
                output_folder = output_folder)
## ATAC-Seq mode is selected...
## Genome index location not specified. Looking for the index in/tmp/Rtmpwxfbe7/scPipeATACVignette
## Genome index not found. Creating one in /tmp/Rtmpwxfbe7/scPipeATACVignette ...
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.16.0
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## ||                Index name : genome_index                                   ||
## ||               Index space : base space                                     ||
## ||               Index split : no-split                                       ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||              Gapped index : no                                             ||
## ||                                                                            ||
## ||       Free / total memory : 85.3GB / 125.4GB                               ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o small_chr21.fa                               ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || No format issues were found                                                ||
## || Scan uninformative subreads in reference sequences ...                     ||
## || 3 uninformative subreads were found.                                       ||
## || These subreads were excluded from index building.                          ||
## || Estimate the index size...                                                 ||
## ||    8%,   0 mins elapsed, rate=405.5k bps/s                                 ||
## ||   16%,   0 mins elapsed, rate=741.6k bps/s                                 ||
## ||   24%,   0 mins elapsed, rate=1070.0k bps/s                                ||
## ||   33%,   0 mins elapsed, rate=1341.0k bps/s                                ||
## ||   41%,   0 mins elapsed, rate=1616.2k bps/s                                ||
## ||   49%,   0 mins elapsed, rate=1832.3k bps/s                                ||
## ||   58%,   0 mins elapsed, rate=2028.5k bps/s                                ||
## ||   66%,   0 mins elapsed, rate=2249.8k bps/s                                ||
## ||   74%,   0 mins elapsed, rate=2410.0k bps/s                                ||
## ||   83%,   0 mins elapsed, rate=2603.9k bps/s                                ||
## ||   91%,   0 mins elapsed, rate=2824.6k bps/s                                ||
## || 3.0 GB of memory is needed for index building.                             ||
## || Build the index...                                                         ||
## ||    8%,   0 mins elapsed, rate=44.7k bps/s                                  ||
## ||   16%,   0 mins elapsed, rate=88.0k bps/s                                  ||
## ||   24%,   0 mins elapsed, rate=130.6k bps/s                                 ||
## ||   33%,   0 mins elapsed, rate=171.2k bps/s                                 ||
## ||   41%,   0 mins elapsed, rate=211.8k bps/s                                 ||
## ||   49%,   0 mins elapsed, rate=250.0k bps/s                                 ||
## ||   58%,   0 mins elapsed, rate=287.1k bps/s                                 ||
## ||   66%,   0 mins elapsed, rate=325.0k bps/s                                 ||
## ||   74%,   0 mins elapsed, rate=359.8k bps/s                                 ||
## ||   83%,   0 mins elapsed, rate=395.9k bps/s                                 ||
## ||   91%,   0 mins elapsed, rate=433.2k bps/s                                 ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.2 minutes.                      ||
## |Index /tmp/Rtmpwxfbe7/scPipeATACVignette/genome_index was successfully built. ||
## ||                                                                            ||
## \\============================================================================//
## No partial matches, checking for reads with non-matched barcodes.
## No reads found with non-matching barcodes.
## Outputted demultiplexing stats file to/tmp/Rtmpwxfbe7/scPipeATACVignette/scPipe_atac_stats//demultiplexing_stats.csv
## Output file name is not provided. Aligned reads are saved in /tmp/Rtmpwxfbe7/scPipeATACVignette/demux_completematch_small_chr21_R1_aligned.bam
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.16.0
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (DNA-Seq)                                   ||
## || Input file 1  : demux_completematch_small_chr21_R1.fastq.gz                ||
## || Input file 2  : demux_completematch_small_chr21_R3.fastq.gz                ||
## || Output file   : demux_completematch_small_chr21_R1_aligned.bam (BAM),  ... ||
## || Index name    : genome_index                                               ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 12                                 ||
## ||                          Phred offset : 33                                 ||
## ||               # of extracted subreads : 10                                 ||
## ||                        Min read1 vote : 3                                  ||
## ||                        Min read2 vote : 1                                  ||
## ||                     Max fragment size : 600                                ||
## ||                     Min fragment size : 50                                 ||
## ||                        Max mismatches : 3                                  ||
## ||                      Max indel length : 5                                  ||
## ||            Report multi-mapping reads : yes                                ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //=============== Running (24-Oct-2023 19:00:16, pid=3734380) ================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [2,37]                   ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||   Estimated fragment length : 123 bp                                       ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||             Total fragments : 35,000                                       ||
## ||                      Mapped : 18,851 (53.9%)                               ||
## ||             Uniquely mapped : 18,057                                       ||
## ||               Multi-mapping : 794                                          ||
## ||                                                                            ||
## ||                    Unmapped : 16,149                                       ||
## ||                                                                            ||
## ||             Properly paired : 14,524                                       ||
## ||         Not properly paired : 4,327                                        ||
## ||                   Singleton : 1,458                                        ||
## ||                    Chimeric : 0                                            ||
## ||       Unexpected strandness : 114                                          ||
## ||  Unexpected fragment length : 180                                          ||
## ||       Unexpected read order : 2,575                                        ||
## ||                                                                            ||
## ||                      Indels : 197                                          ||
## ||                                                                            ||
## ||                Running time : 0.5 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//

6 Demultiplexing the BAM file

Next, the BAM file needs to be modified in a way one or two new columns are generated for the cellular barcode tag and the molecular barcode (i.e. UMI) tag denoted by CB:Z: and OX:Z:, respectively.

sorted_tagged_bam <- sc_atac_bam_tagging (inbam = aligned_bam, 
                       output_folder =  output_folder, 
                       bam_tags      = list(bc="CB", mb="OX"), 
                       nthreads      =  12)
## Detected bc_len: 16  Detected UMI len:  0
## updating progress every 3 minutes...
## number of read processed:0
## 0    0   0   0   
## 70000 reads processed, 70k reads/sec
## number of read processed: 70000
## time elapsed: 1022 milliseconds
## Demultiplexed BAM file sorted and indexed ...
## Using default value for barcode length (bc_length = 16)
## Generating mapping statistics per barcode
## Iterating through 5,000,000 reads at a time
## chunk1
## Completed generating mapping statistics per barcode, saved to /tmp/Rtmpwxfbe7/scPipeATACVignette/scPipe_atac_stats/mapping_stats_per_barcode.csv
## The output tagged and sorted BAM file was sent to /tmp/Rtmpwxfbe7/scPipeATACVignette

7 Remove duplicates

Next, PCR duplicate reads are removed from the BAM file. If samtools is installed, then sc_atac_remove_duplicates can be used, with the installation path of samtools being an optional argument if a particular version is preferred. However, if samtools isn’t installed, the PCR duplicate removal should be performed externally.

removed <- sc_atac_remove_duplicates(sorted_tagged_bam,
                          output_folder = output_folder)
if (!isFALSE(removed))
  sorted_tagged_bam <- removed

8 Gemerating a fragment file

Next, a fragment file in a .bed format is generated, where each line represents a unique fragment generated by the assay. This file is used to generate useful quality control statistics, as well as for the filter and cellranger cell calling methods.

sc_atac_create_fragments(inbam = sorted_tagged_bam,
                         output_folder = output_folder)
## Output folder name is: /tmp/Rtmpwxfbe7/scPipeATACVignette
## Output BED file: /tmp/Rtmpwxfbe7/scPipeATACVignette/fragments.bed

9 Peak calling

Similar to many other tools, the the ability to call peaks on a pseudo-bulk level on the demultiplexed reads has also been incorporated. MACS3 is used underneath to achieve this functionality. If the genome size is not provided then it will be roughly estimated.

# CHECK IF THE .narrowPeak file is small enough to include in the package,
# otherwise, we need to make this basilisk call work??
sc_atac_peak_calling(inbam = sorted_tagged_bam, 
                     ref = reference,
                     genome_size = NULL,
                     output_folder = output_folder)

10 Assigning reads to features and feature counting

After the read alignment and BAM file demultiplexing, a feature set can be used to find the overlap between the aligned reads and the features using the sc_atac_feature_counting function.

Accepted format of the feature_input should be either a BED format (i.e. format should have three main columns; chromosome, feature start and feature end, extension of the file is not considered) or a genome.fasta file. If using a BED format as the feature_input, the feature_type should be either “peak” or “tss”. If using a .fasta for the feature_input, the feature_type needs to be defined as genome_bin. This genome.fasta file will be used within the function to generate a genome_bin that defines the array of features. The size of the bins can be set using the bin_size parameter (default: 2000).

Note: avoid using input BAM files larger than 8GB for best performance

features          <- file.path(output_folder, "NA_peaks.narrowPeak")

sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = features, 
                          bam_tags      = list(bc="CB", mb="OX"), 
                          feature_type  = "peak", 
                          organism      = "hg38",
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )

If genome_bin approach is selected, the following can be used:

reference       <- file.path(data.folder, "small_chr21.fa")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = reference,
                          bam_tags      = list(bc="CB", mb="OX"),
                          feature_type  = "genome_bin",
                          organism      = "hg38",
                          cell_calling  = FALSE,
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )

Call calling is performed on the data prior to finding overlaps. The cell calling methods available are the emptyDrops method from DropletUtils, filter which filters the cells based on various QC cut-offs, and cellranger which models the signal and noise as a mixture of two negative binomial distributions, though the filter method is recommended as it is most commonly used and the best-suited for scATAC-seq data.

For the filter cell-calling method, there are a variety of QC metrics that can be used, including:

  • min_uniq_frags
  • max_uniq_frags
  • min_frac_peak
  • min_frac_tss
  • min_frac_enhancer
  • min_frac_promoter
  • max_frac_mito
features          <- file.path(output_folder, "NA_peaks.narrowPeak")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
                          feature_input = features, 
                          bam_tags      = list(bc="CB", mb="OX"), 
                          feature_type  = "peak",
                          organism      = "hg38",
                          cell_calling  = "filter",
                          min_uniq_frags = 0,
                          min_frac_peak = 0,
                          min_frac_promoter = 0,
                          yieldsize     = 1000000,
                          exclude_regions = TRUE,
                          output_folder = output_folder,
                          fix_chr       = "none"
                          )
## `peak`, `tss` or `gene` feature_type is selected for feature input
## Creating Galignment object for the feature input ...
## hg38is a recognized organism. Using annotation files in repository.
## Generating TSS plot data
## Generating feature-barcode fragment count matrix
## Average no. of fragments per barcode: 1.34885245901639
## Raw feature matrix generated: /tmp/Rtmpwxfbe7/scPipeATACVignette/unfiltered_feature_matrix.rds
## Calculating TSS enrichment scores
## Generating QC metrics for cells
## `filter` function is used for cell calling ...
## cell called and stored in /tmp/Rtmpwxfbe7/scPipeATACVignette
## Cells with less than 200 counts are filtered out.
## Number of cells to remove:32
## No cells were filtered out since otherwise there would be too few left.
## features with less than 10 counts are filtered out.
## Number of features to remove:8
## No features were filtered out since otherwise there would be too few left.
## making sparse
## Sparse matrix generated
## Sparse count matrix is saved in
## /tmp/Rtmpwxfbe7/scPipeATACVignette/sparse_matrix.mtx
## Binary matrix is saved in:
## /tmp/Rtmpwxfbe7/scPipeATACVignette/binary_matrix.rds
## Computing feature QC metrics
## writing to csv
## SCE object is saved in:
##  /tmp/Rtmpwxfbe7/scPipeATACVignette/scPipe_atac_SCEobject.rds
## sc_atac_feature_counting completed in 10.7910225391388 seconds

Mapping quality (MAPQ) value can be set to filter data further in step (default: 30). Additionally, regions that are considered anomalous, unstructured, or high signal in next-generation sequencing experiments are excluded using an inbuilt excluded_regions file (available are for organism hg19, hg38, and mm10) or a user provided excluded_regions_filename file in BED format.

The discrepancy of chr between the alignment file and the feature file/excluded regions file can also be fixed (using fix_char parameter) if needed by adding the string chr to the beginning of either the features, and/or excluded regions. So the options for fix_char are “none”, “excluded_regions”, “feature”, “both”.

NOTE genome_bin approach may be more reliable in detecting sensitive features than using a pseudo-bulk peak calling approach, hence it would make the function slower as well.

The sc_atac_feature_counting function generates a matrix format of the feature by cell matrix (features as rows, cell barcodes as columns) in several formats (raw, sparse, binary, jaccard) that can be used downstream to generate a singleCellExperiment, SCE object.

feature_matrix <- readRDS(file.path(output_folder, "unfiltered_feature_matrix.rds"))
dplyr::glimpse(feature_matrix)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:1559] 3 5 3 3 3 3 3 3 3 1 ...
##   ..@ p       : int [1:1526] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ Dim     : int [1:2] 11 1525
##   ..@ Dimnames:List of 2
##   .. ..$ feature: chr [1:11] "chr21:284477-284941" "chr21:287515-288874" "chr21:325151-325387" "chr21:414105-416460" ...
##   .. ..$ barcode: chr [1:1525] "AAACGAACAGTAAGCG" "AAACGAACATGGATGG" "AAACGAATCGGACGAA" "AAACGAATCTGTGTGA" ...
##   ..@ x       : num [1:1559] 1 1 1 1 1 1 2 1 1 1 ...
##   ..@ factors : list()
sparseM <- readRDS(file.path(output_folder, "sparse_matrix.rds"))
dplyr::glimpse(sparseM)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:50] 3 3 3 3 3 1 3 3 3 3 ...
##   ..@ p       : int [1:33] 0 1 2 3 4 5 7 8 9 11 ...
##   ..@ Dim     : int [1:2] 11 32
##   ..@ Dimnames:List of 2
##   .. ..$ feature: chr [1:11] "chr21:284477-284941" "chr21:287515-288874" "chr21:325151-325387" "chr21:414105-416460" ...
##   .. ..$ barcode: chr [1:32] "AAATGAGCAATCAGGG" "AACTTGGCATGGCCGT" "AATACGCAGTTGGAAT" "ACAGCGCAGATGCGCA" ...
##   ..@ x       : num [1:50] 7 12 16 3 6 1 21 6 6 9 ...
##   ..@ factors : list()