1 Introduction

This package provides basic functions for analyzing next-generation sequencing data of cell-free DNA (cfDNA). The package focuses on extracting the length of cfDNA sample and visualization of genome-wide enrichment of short-fragmented cfDNA. The aberration of fragmentation profile, denoted modified ctDNA estimation score (CES), allows quantification of circulating tumor DNA (ctDNA). We recommend using this package to analysis shallow whole-genome sequencing data (~0.3X or more). This package was complemented by Bioconductor packages e.g. QDNAseq, Rsamtools and GenomicRanges which could further expand the functionality of this package in the future.

1.1 Installation

1.1.1 Install via the Bioconductor repository

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("cfdnakit")

1.1.2 Install the latest version via github

To install this package is via this github repository, please follow the instruction below.

Install prerequisites packages

if(! "devtools" %in% rownames(installed.packages()))
    install.packages("devtools")
if(! "BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager")

Install cfdnakit package

library(devtools)  ### use devtools
install_github("Pitithat-pu/cfdnakit") ### install cfDNAKit 

The installation should work fine without non-zero exit status. Try load cfdnakit package into current R session

library(cfdnakit) ### Load cfdnakit package

2 Prepare input BAM

A coordination-sorted BAM file of cfDNA from any liquid biopsy source (e.g. blood-plasma, CSF, urine) should be applicable. You should examine if the sequencing coverage reach coverage threshold before the analysis. Please make sure that the sequencing reads are mapped onto the following version of reference genome. cfdnakit supports the human reference genome GRCh37(hg19) and GRCh38(hg38). Preliminary fragment-length analysis can be performed using mouse cfDNA when mapping onto the mouse reference genome GRCm38(mm10).

3 Read the BAM file with read_bamfile

Let’s read sequence alignment file (.bam) using function read_bamfile. A BAM index file (.bai) is necessary for efficiently reading the file. If it doesn’t exist in the same directory, this function will automatically create one. This function will split sequencing reads into equal-size non-overlapping windows. Possible size of bin are 100, 500, and 1000 KB. A path to the input file is given to the function read_bamfile. A SampleBam object will be created as the result.

For demonstration, we read an example sequence file ex.plasma.bam.

library(cfdnakit)
sample_bamfile <- system.file("extdata",
                             "ex.plasma.bam",
                package = "cfdnakit")
plasma_SampleBam <- read_bamfile(sample_bamfile,
                                apply_blacklist = FALSE)
#> The BAM index file (.bai) is missing. Creating an index file
#> Bam index file created.
#> Reading bamfile

By default, running read_bamfile will split reads into 1000 KB non-overlapping bins based-on the human reference genome GRCh37.

Reading the file should take a while depending on the size of your BAM. We recommend to save the result as RDS file using saveRDS function.

### Optional
saveRDS(plasma_SampleBam, file = "patientcfDNA_SampleBam.RDS")

4 Analyse the Fragment Length Distribution

Fragment-length distribution of the sample can be visualized with function plot_fragment_dist. In the top-right legend, the modal length (bp) is shown in the parenthesis behind each sample name. The x-axis is the length of cfDNA; y-axis is the proportion of cfDNA fragment having a specific length.

plot_fragment_dist(list("Plasma.Sample"=plasma_SampleBam))

This document will demonstate by using SampleBam object (example_patientcfDNA_SampleBam.RDS). Now we load this file into R session.

example_RDS <- "example_patientcfDNA_SampleBam.RDS"
example_RDS_file <-
    system.file("extdata",example_RDS,
                package = "cfdnakit")
sample_bins <- readRDS(example_RDS_file)

In general, plasma cfDNA should show non-random fragmentation pattern. The modal length of cfDNA is 167 bp and 10-bp periodic peaks. However, tumor-derived cfDNA are observed to be shorter (<150 bp) than cfDNA of non-tumor origin. Here, we compare the fragment length distribution of patient’s cfDNA with healthy individual. We derived a healthy cfDNA from Snyder et al. (2016) and create a RData file “BH01_chr15.RDS”. This file can be loaded in R environment with readRDS function.

control_rds<-"BH01_CHR15.SampleBam.rds"
control_RDS_file <-
    system.file("extdata",control_rds,
                package = "cfdnakit")
control_bins <-
    readRDS(control_RDS_file)

To provide visual comparison of cell-free DNA fragmentation, cfdnakit provide a function that allows plotting multiple distribution plots. We create a list of SampleBam files and plot their distribution together with function plot_fragment_dist. Each element in the list must be given a distinct sample name (e.g. Healthy.cfDNA).

comparing_list <- list("Healthy.cfDNA"=control_bins,
                      "Patient.1"=sample_bins)
plot_fragment_dist(comparing_list)

Optional Parameters

maximum_length = Maximum length of cfDNA (Default 550)

minimum_length = Minimum length of cfDNA (Default 20)

5 Quantification of Short Fragmented CfDNA

We can extract genome-wide fragment-length profile. First we define the short fragment as fragments having size (by default) between 100 - 150 base and 151 - 250 as the long fragment. Providing a SampleBam object to the function get_fragment_profile will return a SampleFragment object.

sample_profile <- 
  get_fragment_profile(sample_bins,
                       sample_id = "Patient.1")

This SampleFragment contains a dataframe named sample_profile. This table contains information about the BAM file and the ratio number of Short/Long fragments. The table below describes important variables.

sample_profile$sample_profile
#>   Sample.ID Total.Fragments Read.Pairs.in.range Read.Pairs.in.range..corrected.
#> 1 Patient.1           68094               65421                        62091.41
#>   N.Short N.Long Mode Median   Mean   Mad S.L.Ratio S.L.Ratio_corrected
#> 1   23503  41918  167    159 163.23 20.76      0.56                0.56
#>   Bin.Size.KB.
#> 1         1000
Variable Description
Total.Fragments Number of DNA fragments (not the number of reads)
Read.Pairs.in.range Number of fragments within the defined range (short and long)
Mode Fragment length of the majority (bp)
Median Median Fragment length (bp)
Mean Average Fragment length (bp)
Mad Fragment length median absolute deviation (MAD)
N.Short Number of short fragments (n)
N.Long Number of long fragments (n)
S.L.Ratio N.Short/N.Long; Ratio of short fragment over long fragment
S.L.Ratio_corrected Ratio of short fragment over long fragment after GC-bias correction
Bin.Size.KB. Size of genomic bin of the analysis (KB)

6 Plot Genome-wide Short-fragmented Ratio

We can plot genome-wide short-fragment ratio with the function plot_sl_ratio. Given short-fragment profile, short-fragment ratio per bin infer contribution of circulating tumor DNA (ctDNA) into cfDNA. The enrichment of short-fragment cfDNA in a large genomic region could infer the copy-number aberration status. The higher short-fragment ratio indicate amplification event whereas deletion would have relatively lower short-fragment ratio.

## For this demenstration, we load a real patient-derived cfDNA profile.
patient.SampleFragment.file <-
    system.file("extdata",
                "example_patientcfDNA_SampleFragment.RDS",
                package = "cfdnakit")
patient.SampleFragment <- readRDS(patient.SampleFragment.file)
plot_sl_ratio(patient.SampleFragment)