cluster_irr
get_graph
get_joint_graph
detect_communities
dco
get_beta_violins
get_beta_scatterplot
library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
theme_set(new = theme_bw(base_size = 10))
library(ggrepel)
library(patchwork)
Adaptive immunity employs diverse immune receptor repertoires (IRRs; B- and T-cell receptors) to combat evolving pathogens including viruses, bacteria, and cancers. Receptor diversity arises through V(D)J recombination - combinatorial assembly of germline genes generating unique sequences. Each naive lymphocyte consequently expresses a distinct receptor, enabling broad antigen recognition.
B cells engage antigens directly via BCR complementarity-determining regions (CDRs), while T cells recognize peptide-MHC complexes through TCR CDRs. Antigen recognition triggers clonal expansion, producing effector populations that mount targeted immune responses.
High-throughput sequencing (HT-seq) enables tracking of TCR/BCR community dynamics across biological conditions (e.g., pre-/post-treatment), offering insights into responses to immunotherapy and vaccination. However, two key challenges complicate this approach:
Extreme diversity and privacy: TCRs/BCRs are highly personalized, with incomplete sampling particularly problematic in clinical settings where sample volumes are limited. Even comprehensive sampling reveals minimal repertoire overlap between individuals.
Similar TCRs/BCRs recognize similar antigens
This vignette introduces ClustIRR, a computational method that addresses these challenges by: (1) Identifying specificity-associated receptor communities through sequence clustering, and (2) Applying Bayesian models to quantify differential community abundance across conditions.
ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.
To install ClustIRR please start R and enter:
if(!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("ClustIRR")
The main input of ClustIRR is a repertoire (s
), which
should be provided as data.frame. The rows in the data.frame correspond
to clones (clone = group of cells derived from a common parent cell
by clonal expansion). We use the following data from each clone:
In a typical scenario, the user will have more than one repertoire (see workflow). For instance, the user will analyze longitudinal repertoire data, i.e., two or three repertoires taken at different time points; or across different biological conditions.
Let’s look at dataset D1
that is provided within
ClustIRR. D1
contains three TCR\(\alpha\beta\)
repertoires: \(a\), \(b\), and \(c\) and their metadata: ma
, mb
and mc
.
data("D1", package = "ClustIRR")
str(D1)
List of 6
$ a :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "a" "a" "a" "a" ...
..$ clone_size: num [1:500] 3 2 2 5 3 5 5 5 3 4 ...
$ b :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "b" "b" "b" "b" ...
..$ clone_size: num [1:500] 3 2 2 7 4 5 5 7 3 4 ...
$ c :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "c" "c" "c" "c" ...
..$ clone_size: num [1:500] 3 2 2 9 5 5 5 9 3 4 ...
$ ma:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD4" "CD8" "CD8" ...
..$ HLA_A : chr [1:500] "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" ...
..$ age : num [1:500] 24 24 24 24 24 24 24 24 24 24 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
$ mb:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD4" "CD4" "CD4" ...
..$ HLA_A : chr [1:500] "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" ...
..$ age : num [1:500] 30 30 30 30 30 30 30 30 30 30 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
$ mc:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD8" "CD8" "CD8" ...
..$ HLA_A : chr [1:500] "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" ...
..$ age : num [1:500] 40 40 40 40 40 40 40 40 40 40 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
Extract the data.frames for each TCR repertoire and their metadata:
# Extract TCRab repertoires: a, b and c
a <- D1$a
b <- D1$b
c <- D1$c
# Extract metadata for TCRab repertoires 'a', 'b' and 'c'.
# This data is optional. It may contain and may contain diverse
# supplementary (clone-specific and repertoire-specific) data for
# each TCRab repertoire.
meta_a <- D1$ma
meta_b <- D1$mb
meta_c <- D1$mc
ClustIRR performs the following steps.
Compute similarities between T-cell clones within each TCR repertoire
Construct a graph from each TCR repertoire
Construct a joint similarity graph (\(J\))
Detect communities in \(J\)
Analyze Differential Community Occupancy (DCO)
a. Between individual TCR repertoires with model \(M\)
b. Between groups of TCR repertoires from biological conditions with model \(M_h\)
Inspect results
cluster_irr
ClustIRR aims to quantify the similarity between pairs of TCR clones based on the similarities of their CDR3s sequences. For this it employs Basic Local-Alignment Search Tool (BLAST) via the R-package blaster. Briefly, a protein database is constructed from all CDR3 sequences, and each CDR3 sequence is used as a query. This enables fast sequence similarity searches. Furthermore, only CDR3 sequences matches with \(\geq\) 70% sequence identity to the query are retained. This step reduces the computational and memory requirements, without impacting downstream community analyses, as CDR3 sequences with lower typically yield low similarity scores.
For matched CDR3 pair, an alignment score (\(\omega\)) is computed using BLOSUM62 substitution scores with gap opening penalty of -10 and gap extension penalty of -4. \(\omega\) is the sum of substitution scores and gap penalties in the alignment. Identical or highly similar CDR3 sequence pairs receive large positive \(\omega\) scores, while dissimilar pairs receive low or negative \(\omega\). To normalize \(\omega\) for alignment length, ClustIRR computes \(\bar{\omega} = \omega/l\), where \(l\) is the alignment length yielding normalized alignment score \(\bar{\omega}\). This normalization, also used in iSMART (Zhang, 2020), ensures comparability across CDR3 pairs of varying lengths.
ClustIRR also computes alignment scores for the CDR3 core regions (\(\omega^c\) and \(\bar{\omega}^c\)). The CDR3 core, representing the central loop region with high antigen-contact probability (Glanville, 2017), is generated by trimming three residues from each end of the CDR3 sequence. Comparing \(\bar{\omega}^c\) and \(\bar{\omega}\) allows assessment of whether sequence similarity is concentrated in the core or flanking regions.
cluster_irr
Step 1. involves calling the function clust_irr
which returns an S4
object of class clust_irr
.
# perform clust_irr analysis of repertoire a, b and c
cl_a <- cluster_irr(s = a, meta = meta_a, control = list(gmi = 0.7))
cl_b <- cluster_irr(s = b, meta = meta_b, control = list(gmi = 0.7))
cl_c <- cluster_irr(s = c, meta = meta_c, control = list(gmi = 0.7))
Next, we show the chain-specific similarity scores between CDR3s sequences. Each row is a pair of CDR3 sequences from the repertoire. For each pair we have the following metrics:
max_len
: length of the longer CDR3 sequence in the pairmax_clen
: length of the longer CDR3 core sequence in the pairweight
: \(\omega\) = BLOSUM62 score of the complete CDR3
alignmentcweight
= \(\omega_c\): BLOSUM62 score of CDR3 coresnweight
= \(\bar{\omega}\): normalized weight
by max_len
ncweight
= \(\bar{\omega}_c\): normalized cweight
by max_clen
The results for CDR3\(\alpha\) sequence pairs from repertoire a
:
kable(head(cl_a@clust$CDR3a), digits = 2)
from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
---|---|---|---|---|---|---|---|
CASRRGGYNEQFF | CSVRRGGYNEQFF | 117 | 68 | 9.00 | 9.71 | 13 | 7 |
CASRRGGYNEQFF | CASSPPGYNEQFF | 105 | 47 | 8.08 | 6.71 | 13 | 7 |
CASRRGGYNEQFF | CASRVDASYNEQFF | 93 | 35 | 6.64 | 4.38 | 14 | 8 |
CASRRGGYNEQFF | CASSRQGHNEQFF | 107 | 49 | 8.23 | 7.00 | 13 | 7 |
CASSLSSGNTIYF | CASSLSISGNTIYF | 105 | 47 | 7.50 | 5.88 | 14 | 8 |
CASSLSSGNTIYF | CASSRGHSSGNTIYF | 95 | 37 | 6.33 | 4.11 | 15 | 9 |
… the same table for CDR3\(\beta\) sequence pairs from repertoire a
:
kable(head(cl_a@clust$CDR3b), digits = 2)
from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
---|---|---|---|---|---|---|---|
CASSPGGDEAFF | CASSQGRGTDEAFF | 81 | 24 | 5.79 | 3.00 | 14 | 8 |
CASSPGGDEAFF | CASSPRTTEAFF | 92 | 35 | 7.67 | 5.83 | 12 | 6 |
CASKVPSGESSYNEQFF | CASSHPSSSYNEQFF | 110 | 52 | 6.47 | 4.73 | 17 | 11 |
CASKVPSGESSYNEQFF | CASSVASGGTYNEQFF | 116 | 58 | 6.82 | 5.27 | 17 | 11 |
CASGLNVNTEAFF | CASSLNRVNTEAFF | 101 | 44 | 7.21 | 5.50 | 14 | 8 |
CASGLNVNTEAFF | CASGLVGNTEAFF | 105 | 48 | 8.08 | 6.86 | 13 | 7 |
Notice that very similar CDR3\(\alpha\) or CDR3\(\beta\) pairs have high
normalized alignment scores (\(\bar{\omega}\)). We have similar tables for
repertoire b
and c
.
The function clust_irr
performs automatic annotation of TCR clones
based on multiple databases (DBs), including: VDJdb, TCR3d, McPAS-TCR.
Each TCR clone recieves two types of annotations (per chain and per
database):
Ag_species: the name of the antigen species recognized by the clone’s CDR3
Ag_gene: the name of the antigen gene recognized by clone’s CDR3
Annotation process: annotation is performed based on an edit distance
threshold, controlled by the user through the parameter
control$db_edit
(default = 0). If the edit distance between an
input CDR3 sequence and a database (DB) CDR3 sequence is less than or
equal to control$db_edit
, the input CDR3 inherits the annotation
data of the matched DB CDR3.
This information will be used for various downstream analyses (see the next steps).
get_graph
Next, ClustIRR builds a graph. If we analyze one
TCR\(\alpha\beta\) repertoire, then we may employ the function
get_graph
, which converts the clust_irr
object into an igraph
object. Meanwhile, if we are analyzing two or more repertoires, then we
can use the function get_joint_graph
to generate a joint igraph
object. In this case, edges between TCR clones from different TCR
repertoires are computed using the same procedure outlined in step 1.
The graphs have nodes and weighted edges:
g <- get_graph(clust_irr = cl_a)
plot_graph
We can use the function plot_graph
for interactive visualization of
relatively small graphs.
For instance, we can highligh TCR clones that recognize certain antigenic species or genes (see dropdown menu), or use the hoovering function to look at the CDR3 sequences of nodes.
Do this now!
plot_graph(g, select_by = "Ag_species", as_visnet = TRUE)
get_graph
and get_joint_graph
generate igraph
objectsWe can use igraph functions to inspect various properties of the graph.
For instance, below, we extract the edge attributes and visualize the
distributions of the edge attributes ncweight
and nweight
for all
CDR3\(\alpha\) and CDR3\(\beta\) sequence pairs.
# data.frame of edges and their attributes
e <- igraph::as_data_frame(x = g$graph, what = "edges")
ggplot(data = e)+
geom_density(aes(ncweight, col = chain))+
geom_density(aes(nweight, col = chain), linetype = "dashed")+
xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
theme(legend.position = "top")
Can you guess why we observe bimodal distributions? (Hint: CDR3s have different lengths)
get_joint_graph
In dataset D1
we have three TCR repertoires. Hence, we can use the function
get_joint_graph
to create a joint graph. This function computes
similarities between CDR3 sequences of clones from different repertoires,
creating new edges between nodes from different graphs with \(\bar{\omega}\)
and \(\bar{\omega}^c\) as edge attributes. The joint graph \(J\) is structured
as an igraph object.
g <- get_joint_graph(clust_irrs = c(cl_a, cl_b, cl_c))
Once again, we will use the function plot_graph
for interactive
visualization. However, even in this small case study–where each
TCR repertoire contains only 500 TCR clones–we observe that the
joint graph is complex! This complexity makes qualitative analyses
impractical. To address this, ClustIRR focuses on
quantitative analyses (see the next step).
plot_graph(g = g, select_by = "Ag_gene", as_visnet = TRUE, node_opacity = 0.8)
detect_communities
ClustIRR employs graph-based community detection (GCD) algorithms, such as Louvain, Leiden or InfoMap, to identify communities of nodes that have high density of edges among each other, and low density of edges with nodes outside the community.
First, the similarity score between T-cell clones \(i\) and \(j\) is defined as the average CDR3\(\alpha\) and CDR3\(\beta\) alignment scores:
\[\begin{align} \omega(i,j) = \dfrac{{\bar{\omega}}_\alpha + {\bar{\omega}}_\beta}{2} \end{align}\]where \(\bar{\omega}_\alpha\) and \(\bar{\omega}_\beta\) are the alignment scores for the CDR3\(\alpha\) and CDR3\(\beta\), respectively. If a chain is missing, its alignment score is set to 0.
The user has the following options:
algorithm
: “leiden” (default) “louvain”, or “infomap”resolution
: GCD resolution = 1 (default)iterations
: number of clustering iterations (default = 100) to
ensure robust resultsweight
: “ncweight” (default) or “nweight”chains
: “CDR3a” or “CDR3b” or c(“CDR3a”, “CDR3b”)gcd <- detect_communities(graph = g$graph,
algorithm = "leiden",
resolution = 1,
iterations = 100,
weight = "ncweight",
chains = c("CDR3a", "CDR3b"))
detect_communities
The function detect_communities
generates a complex output. Lets
investigate its elements:
names(gcd)
[1] "community_occupancy_matrix" "community_summary"
[3] "node_summary" "graph"
[5] "graph_structure_quality" "input_config"
The main element is community_occupancy_matrix
, which contains the
number of T-cells in each community (row) and repertoire (column). Here
we have three repertoires (three columns) and about 260 communities.
This matrix is the main input of the function dco
(step 5.), to detect
differences in the community occupancy between repertoires.
dim(gcd$community_occupancy_matrix)
[1] 261 3
head(gcd$community_occupancy_matrix)
a b c
1 10 10 10
2 2 2 2
3 2 2 2
4 5 7 9
5 28 31 34
6 5 5 5
get_honeycombs
honeycomb <- get_honeycombs(com = gcd$community_occupancy_matrix)
In the honeycomb plots shown below, several communities (black dots) appear far from the diagonal. This indicates that these communities contain more cells in repertoires b and c (y-axes in panels A and B) compared to repertoire a (x-axes in panels A and B). Meanwhile, the same points are generally close to the diagonal in panel C but remain slightly more abundant in repertoire c (y-axis) compared to b (x-axis).
In step 5, ClustIRR will provide a quantitative answer to the question: Which communities are differentially abundant between pairs of repertoires?
Importantly, the color of the hexagons encodes the density of communities in the 2D scatterplots: dark hexagons indicate high a frequency of communities, while light hexagons represent sparsely populated regions.
wrap_plots(honeycomb, nrow = 1)+
plot_annotation(tag_levels = 'A')
Also see community_summary
. In the data.frame wide
we provide
community summaries in each row across all samples, including:
clones_a
, clone_b
, clone_c
, clones_n
: the frequency of
clones in the community coming from repertoire a
, b
, c
and in
total (n)cells_a
, cells_b
, cells_c
, cells_n
: the frequency of cell in
the community coming from repertoires a
, b
, c
and in total (n)w
: average inter-clone similarityncweight_CDR3a/b
, nweight_CDR3a/b
: raw and normalized similarity
for CDR3\(\alpha\) and CDR3\(\beta\) sequencesn_CDR3a
, n_CDR3b
: number of edges between CDR3\(\alpha\) and
CDR3\(\beta\) sequenceskable(head(gcd$community_summary$wide), digits = 1)
community | clones_a | clones_b | clones_c | clones_n | cells_a | cells_b | cells_c | cells_n | w | ncweight_CDR3a | ncweight_CDR3b | nweight_CDR3a | nweight_CDR3b | n_CDR3a | n_CDR3b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 3 | 3 | 9 | 10 | 10 | 10 | 30 | 5.2 | 3.1 | 3.1 | 3.9 | 3.1 | 27 | 9 |
2 | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 6 | 9.3 | NaN | 9.3 | NaN | 9.6 | 3 | 3 |
3 | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 6 | 9.4 | NaN | 9.4 | NaN | 9.6 | 3 | 3 |
4 | 1 | 1 | 1 | 3 | 5 | 7 | 9 | 21 | 9.1 | NaN | 9.1 | NaN | 9.5 | 3 | 3 |
5 | 6 | 6 | 6 | 18 | 28 | 31 | 34 | 93 | 4.0 | 0.0 | 4.0 | 0.0 | 4.7 | 18 | 108 |
6 | 1 | 1 | 1 | 3 | 5 | 5 | 5 | 15 | 9.6 | NaN | 9.6 | NaN | 9.8 | 3 | 3 |
In the data.frame tall
we provide community and sample/repertoire
summaries in each row.
kable(head(gcd$community_summary$tall), digits = 2)
community | sample | clones | cells | w | ncweight_CDR3a | ncweight_CDR3b | nweight_CDR3a | nweight_CDR3b | n_CDR3a | n_CDR3b | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | a | 3 | 10 | 5.17 | 3.12 | 3.09 | 3.86 | 3.14 | 27 | 9 |
2 | 1 | b | 3 | 10 | 5.17 | 3.12 | 3.09 | 3.86 | 3.14 | 27 | 9 |
3 | 1 | c | 3 | 10 | 5.17 | 3.12 | 3.09 | 3.86 | 3.14 | 27 | 9 |
334 | 112 | a | 5 | 30 | 3.33 | 2.24 | 1.43 | 3.37 | 1.47 | 96 | 15 |
335 | 112 | b | 5 | 31 | 3.33 | 2.24 | 1.43 | 3.37 | 1.47 | 96 | 15 |
336 | 112 | c | 5 | 32 | 3.33 | 2.24 | 1.43 | 3.37 | 1.47 | 96 | 15 |
Node-specific (TCR clone-specific) summaries are provided in
node_summary
kable(head(gcd$node_summary), digits = 2)
name | clone_id | Ag_gene | Ag_species | info_tcr3d_CDR3a | db_tcr3d_CDR3a | info_tcr3d_CDR3b | db_tcr3d_CDR3b | info_mcpas_CDR3a | db_mcpas_CDR3a | info_mcpas_CDR3b | db_mcpas_CDR3b | info_vdjdb_CDR3a | db_vdjdb_CDR3a | info_vdjdb_CDR3b | db_vdjdb_CDR3b | clone_size | sample | CDR3b | CDR3a | cell | HLA_A | age | TRAV | TRAJ | TRBV | TRBJ | community | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a|1 | a|1 | 1 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 3 | a | CASTVTSGSNEKLFF | CASSLRGAHNEQFF | CD8 | HLA-A∗24 | 24 | TRAV27 | TRAJ2-1 | TRBV6-8 | TRBJ1-4 | 1 | ||
a|2 | a|2 | 2 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 2 | a | CASSLTGTGYTF | CASGLRQGYGYTF | CD4 | HLA-A∗24 | 24 | TRAV27 | TRAJ1-2 | TRBV7-3 | TRBJ1-2 | 2 | ||
a|3 | a|3 | 3 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 2 | a | CSALTPGLIYNEQFF | CSAGGFRETQYF | CD8 | HLA-A∗24 | 24 | TRAV20-1 | TRAJ2-5 | TRBV20-1 | TRBJ2-1 | 3 | ||
a|4 | a|4 | 4 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 5 | a | CSARASWGTNEKLFF | CGSSLSQGSEQYF | CD8 | HLA-A∗24 | 24 | TRAV7-7 | TRAJ2-7 | TRBV20-1 | TRBJ1-4 | 4 | ||
a|5 | a|5 | 5 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 3 | a | CASSVREAENQPQHF | CSPRGPYGYTF | CD8 | HLA-A∗24 | 24 | TRAV20-1 | TRAJ1-2 | TRBV4-1 | TRBJ1-5 | 5 | ||
a|6 | a|6 | 6 | <db:tcr3d|chain:CDR3a|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:tcr3d|chain:CDR3b|Antigen_species:|Antigen_gene:|Reference:> | 0 | <db:mcpas|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:mcpas|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3a|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | <db:VDJdb|chain:CDR3b|Antigen_species:|Antigen_gene:|CDR3_species:|Reference:> | 0 | 5 | a | CASCTVGNQPQHF | CSLGPSKSQYF | CD8 | HLA-A∗24 | 24 | TRAV29-1 | TRAJ2-3 | TRBV19 | TRBJ1-5 | 6 |
decode_community
Community with ID=8 contains 42 TCR clones. These are connected based on their CDR3\(\beta\) sequences.
ns_com_8 <- gcd$node_summary[gcd$node_summary$community == 8, ]
table(ns_com_8$sample)
a b c
14 14 14
We can extract it and visualize it using igraph:
par(mai = c(0,0,0,0))
plot(subgraph(graph = gcd$graph, vids = which(V(gcd$graph)$community == 8)))
Furthermore, we can “decompose” this graph using decode_community
based
on the attributes of the edges (edge_at
) and nodes (node_at
).
# we can pick from these edge attributes
edge_attr_names(graph = gcd$graph)
[1] "nweight" "ncweight" "chain" "w"
# This would tell ClustIRR to only keep edges with nweight>=4 & ncweight>=4
edge_at <- list(list(name = "nweight", val = 3, op = ">="),
list(name = "ncweight", val = 3, op = ">="))
# we can pick from these node attributes
vertex_attr_names(graph = gcd$graph)
[1] "name" "clone_id" "Ag_gene" "Ag_species"
[5] "info_tcr3d_CDR3a" "db_tcr3d_CDR3a" "info_tcr3d_CDR3b" "db_tcr3d_CDR3b"
[9] "info_mcpas_CDR3a" "db_mcpas_CDR3a" "info_mcpas_CDR3b" "db_mcpas_CDR3b"
[13] "info_vdjdb_CDR3a" "db_vdjdb_CDR3a" "info_vdjdb_CDR3b" "db_vdjdb_CDR3b"
[17] "clone_size" "sample" "CDR3b" "CDR3a"
[21] "cell" "HLA_A" "age" "TRAV"
[25] "TRAJ" "TRBV" "TRBJ" "community"
# This would tell ClustIRR to only keep nodes with TRBV and TRBJ
node_at <- list(list(name = "TRBV"), list(name = "TRBJ"))
c8 <- decode_communities(community_id = 8,
graph = gcd$graph,
edge_at = edge_at,
node_at = node_at)
par(mar = c(0, 0, 0, 0))
plot(c8, vertex.size = 10)
as_data_frame(x = c8, what = "vertices")[,c("name", "component_id",
"CDR3b", "TRBV", "TRBJ",
"CDR3a", "TRAV", "TRAJ")]
name component_id CDR3b TRBV TRBJ CDR3a
a|8 a|8 12 CASSGRGGDNEQFF TRBV7-9 TRBJ2-1 CASGTLSYNEQFF
a|271 a|271 12 CASSQAGGRNEQFF TRBV7-9 TRBJ2-1 CTSSQVGVGMNTEAFF
b|8 b|8 12 CASSGRGGDNEQFF TRBV7-9 TRBJ2-1 CASGTLSYNEQFF
b|271 b|271 12 CASSQAGGRNEQFF TRBV7-9 TRBJ2-1 CTSSQVGVGMNTEAFF
c|8 c|8 12 CASSGRGGDNEQFF TRBV7-9 TRBJ2-1 CASGTLSYNEQFF
c|271 c|271 12 CASSQAGGRNEQFF TRBV7-9 TRBJ2-1 CTSSQVGVGMNTEAFF
a|22 a|22 13 CASSQAGYSYNEQFF TRBV9 TRBJ2-1 CASSPPLAEETQYF
b|22 b|22 13 CASSQAGYSYNEQFF TRBV9 TRBJ2-1 CASSPPLAEETQYF
c|22 c|22 13 CASSQAGYSYNEQFF TRBV9 TRBJ2-1 CASSPPLAEETQYF
a|23 a|23 8 CASSYSGHYNEQFF TRBV6-5 TRBJ2-1 CSVVGQLTGYEQYF
b|23 b|23 8 CASSYSGHYNEQFF TRBV6-5 TRBJ2-1 CSVVGQLTGYEQYF
c|23 c|23 8 CASSYSGHYNEQFF TRBV6-5 TRBJ2-1 CSVVGQLTGYEQYF
a|124 a|124 11 CASAPGTSYNEQFF TRBV7-6 TRBJ2-1 CASSLYGTERADEQYF
b|124 b|124 11 CASAPGTSYNEQFF TRBV7-6 TRBJ2-1 CASSLYGTERADEQYF
c|124 c|124 11 CASAPGTSYNEQFF TRBV7-6 TRBJ2-1 CASSLYGTERADEQYF
a|166 a|166 2 CASSLGTPYNEQFF TRBV13 TRBJ2-1 CASGQTDPNGYTF
b|166 b|166 2 CASSLGTPYNEQFF TRBV13 TRBJ2-1 CASGQTDPNGYTF
c|166 c|166 2 CASSLGTPYNEQFF TRBV13 TRBJ2-1 CASGQTDPNGYTF
a|340 a|340 1 CASSGRSSYNEQFF TRBV12-4 TRBJ2-1 CASSLPAERPDYGYTF
b|340 b|340 1 CASSGRSSYNEQFF TRBV12-4 TRBJ2-1 CASSLPAERPDYGYTF
c|340 c|340 1 CASSGRSSYNEQFF TRBV12-4 TRBJ2-1 CASSLPAERPDYGYTF
a|379 a|379 5 CASSQGRSYNEQFF TRBV4-1 TRBJ2-1 CSASGTEDNEQFF
b|379 b|379 5 CASSQGRSYNEQFF TRBV4-1 TRBJ2-1 CSASGTEDNEQFF
c|379 c|379 5 CASSQGRSYNEQFF TRBV4-1 TRBJ2-1 CSASGTEDNEQFF
a|408 a|408 4 CASSPLTGRNEQFF TRBV27 TRBJ2-1 CASSLGASEYEQYF
b|408 b|408 4 CASSPLTGRNEQFF TRBV27 TRBJ2-1 CASSLGASEYEQYF
c|408 c|408 4 CASSPLTGRNEQFF TRBV27 TRBJ2-1 CASSLGASEYEQYF
a|429 a|429 3 CASSGAGQFNEQFF TRBV19 TRBJ2-1 CSVRDLGETQYF
b|429 b|429 3 CASSGAGQFNEQFF TRBV19 TRBJ2-1 CSVRDLGETQYF
c|429 c|429 3 CASSGAGQFNEQFF TRBV19 TRBJ2-1 CSVRDLGETQYF
a|456 a|456 6 CASSELQSYNEQFF TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
b|456 b|456 6 CASSELQSYNEQFF TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
c|456 c|456 6 CASSELQSYNEQFF TRBV6-1 TRBJ2-1 CSGAGWIGFPDQYNEQFF
a|464 a|464 9 CASSLAGGYNEQFF TRBV6-8 TRBJ2-1 CASSIYMGKNTEAFF
b|464 b|464 9 CASSLAGGYNEQFF TRBV6-8 TRBJ2-1 CASSIYMGKNTEAFF
c|464 c|464 9 CASSLAGGYNEQFF TRBV6-8 TRBJ2-1 CASSIYMGKNTEAFF
a|482 a|482 7 CASSLLGSYNEQFF TRBV6-2 TRBJ2-1 CASMGRQGRDGNEKLFF
b|482 b|482 7 CASSLLGSYNEQFF TRBV6-2 TRBJ2-1 CASMGRQGRDGNEKLFF
c|482 c|482 7 CASSLLGSYNEQFF TRBV6-2 TRBJ2-1 CASMGRQGRDGNEKLFF
a|500 a|500 10 CASSSSPGDNEQFF TRBV7-3 TRBJ2-1 CASSYLAAQETQYF
b|500 b|500 10 CASSSSPGDNEQFF TRBV7-3 TRBJ2-1 CASSYLAAQETQYF
c|500 c|500 10 CASSSSPGDNEQFF TRBV7-3 TRBJ2-1 CASSYLAAQETQYF
TRAV TRAJ
a|8 TRAV19 TRAJ2-1
a|271 TRAV4-2 TRAJ1-1
b|8 TRAV19 TRAJ2-1
b|271 TRAV4-2 TRAJ1-1
c|8 TRAV19 TRAJ2-1
c|271 TRAV4-2 TRAJ1-1
a|22 TRAV7-9 TRAJ2-5
b|22 TRAV7-9 TRAJ2-5
c|22 TRAV7-9 TRAJ2-5
a|23 TRAV29-1 TRAJ2-7
b|23 TRAV29-1 TRAJ2-7
c|23 TRAV29-1 TRAJ2-7
a|124 TRAV12-4 TRAJ2-7
b|124 TRAV12-4 TRAJ2-7
c|124 TRAV12-4 TRAJ2-7
a|166 TRAV25-1 TRAJ1-2
b|166 TRAV25-1 TRAJ1-2
c|166 TRAV25-1 TRAJ1-2
a|340 TRAV7-2 TRAJ1-2
b|340 TRAV7-2 TRAJ1-2
c|340 TRAV7-2 TRAJ1-2
a|379 TRAV20-1 TRAJ2-1
b|379 TRAV20-1 TRAJ2-1
c|379 TRAV20-1 TRAJ2-1
a|408 TRAV12-4 TRAJ2-7
b|408 TRAV12-4 TRAJ2-7
c|408 TRAV12-4 TRAJ2-7
a|429 TRAV29-1 TRAJ2-5
b|429 TRAV29-1 TRAJ2-5
c|429 TRAV29-1 TRAJ2-5
a|456 TRAV29-1 TRAJ2-1
b|456 TRAV29-1 TRAJ2-1
c|456 TRAV29-1 TRAJ2-1
a|464 TRAV19 TRAJ1-1
b|464 TRAV19 TRAJ1-1
c|464 TRAV19 TRAJ1-1
a|482 TRAV19 TRAJ1-4
b|482 TRAV19 TRAJ1-4
c|482 TRAV19 TRAJ1-4
a|500 TRAV6-6 TRAJ2-5
b|500 TRAV6-6 TRAJ2-5
c|500 TRAV6-6 TRAJ2-5
dco
Do we see expanding or contracting communities in a given repertoire?
We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities in each repertoire.
The model output for repertoire \(a\) is the parameter vector \(\beta^a=\beta^a_1,\beta^a_2,\ldots,\beta^a_k\), which describes the effect of repertoire \(a\) on the relative occupancy in each community.
Given two repertoires, \(a\) and \(b\), we can quantify the differential community occupancy (DCO):
\[\begin{align} \delta^{a-b}_i = \beta^a_i - \beta^b_i \end{align}\]d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
mcmc_control = list(mcmc_warmup = 300,
mcmc_iter = 600,
mcmc_chains = 2,
mcmc_cores = 1,
mcmc_algorithm = "NUTS",
adapt_delta = 0.9,
max_treedepth = 10))
SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000262 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.62 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 600 [ 0%] (Warmup)
Chain 1: Iteration: 60 / 600 [ 10%] (Warmup)
Chain 1: Iteration: 120 / 600 [ 20%] (Warmup)
Chain 1: Iteration: 180 / 600 [ 30%] (Warmup)
Chain 1: Iteration: 240 / 600 [ 40%] (Warmup)
Chain 1: Iteration: 300 / 600 [ 50%] (Warmup)
Chain 1: Iteration: 301 / 600 [ 50%] (Sampling)
Chain 1: Iteration: 360 / 600 [ 60%] (Sampling)
Chain 1: Iteration: 420 / 600 [ 70%] (Sampling)
Chain 1: Iteration: 480 / 600 [ 80%] (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%] (Sampling)
Chain 1: Iteration: 600 / 600 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 10.6 seconds (Warm-up)
Chain 1: 9.251 seconds (Sampling)
Chain 1: 19.851 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0.00026 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 2.6 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 600 [ 0%] (Warmup)
Chain 2: Iteration: 60 / 600 [ 10%] (Warmup)
Chain 2: Iteration: 120 / 600 [ 20%] (Warmup)
Chain 2: Iteration: 180 / 600 [ 30%] (Warmup)
Chain 2: Iteration: 240 / 600 [ 40%] (Warmup)
Chain 2: Iteration: 300 / 600 [ 50%] (Warmup)
Chain 2: Iteration: 301 / 600 [ 50%] (Sampling)
Chain 2: Iteration: 360 / 600 [ 60%] (Sampling)
Chain 2: Iteration: 420 / 600 [ 70%] (Sampling)
Chain 2: Iteration: 480 / 600 [ 80%] (Sampling)
Chain 2: Iteration: 540 / 600 [ 90%] (Sampling)
Chain 2: Iteration: 600 / 600 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 10.765 seconds (Warm-up)
Chain 2: 9.243 seconds (Sampling)
Chain 2: 20.008 seconds (Total)
Chain 2:
get_beta_violins
beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
beta = d$posterior_summary$beta,
ag_species = NULL,
db = "vdjdb",
db_dist = 0,
chain = "both")
beta_violins$violins
[[1]]
beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
beta = d$posterior_summary$beta,
ag_species = c("CMV", "EBV", "Influenza"),
ag_genes = c("MLANA"),
db = "vdjdb",
db_dist = 0,
chain = "both")
patchwork::wrap_plots(beta_violins$violins, ncol = 2)
get_beta_scatterplot
beta_scatterplots <- get_beta_scatterplot(node_summary = gcd$node_summary,
beta = d$posterior_summary$beta,
ag_species = c("CMV"),
db = "vdjdb",
db_dist = 0,
chain = "both")
patchwork::wrap_plots(beta_scatterplots$scatterplots[[1]], ncol = 3)
Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.
General idea of posterior predictive checks:
ClustIRR provides \(y\) and \(\hat{y}\) of each repertoire, which we can visualize with ggplot:
ggplot(data = d$posterior_summary$y_hat)+
facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width=0)+
geom_point(aes(x = y_obs, y = mean), size = 0.8)+
xlab(label = "observed y")+
ylab(label = "predicted y (and 95% HDI)")
Given two repertoires, \(a\) and \(b\), we can quantify the differential community occupancy (DCO):
\[\begin{align} \delta^{a-b}_i = \beta^a_i - \beta^b_i \end{align}\]Importantly, ClustIRR always computes both contrasts (\(a\) vs. \(b\) and \(b\) vs. \(a\)): \(\delta^{a-b}_i\) and \(\delta^{b-a}_i\).
ggplot(data = d$posterior_summary$delta)+
facet_wrap(facets = ~contrast, ncol = 2)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "lightgray", width = 0)+
geom_point(aes(x = community, y = mean), shape = 21, fill = "black",
stroke = 0.4, col = "white", size = 1.25)+
theme(legend.position = "top")+
ylab(label = expression(delta))+
scale_x_continuous(expand = c(0,0))
Given two repertoires, \(a\) and \(b\), ClustIRR also quantifies absolute differences in community probabilities:
\[\begin{align} \epsilon^{a-b} = \mathrm{softmax}(\alpha + \beta^a) - \mathrm{softmax}(\alpha + \beta^b) \end{align}\]Again, both contrasts are computed: \(\epsilon^{a-b}_i\) and \(\epsilon^{b-a}_i\).
ggplot(data = d$posterior_summary$epsilon)+
facet_wrap(facets = ~contrast, ncol = 2)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "lightgray", width = 0)+
geom_point(aes(x = community, y = mean), shape = 21, fill = "black",
stroke = 0.4, col = "white", size = 1.25)+
theme(legend.position = "top")+
ylab(label = expression(epsilon))+
scale_x_continuous(expand = c(0,0))
The function dco
takes as its main input a community occupancy matrix.
This enables users who are accustomed to using complementary algorithm
for detecting specificity groups, such as, GLIPH, TCRdist3, GIANA, and
iSMART, to skip steps 1-3 of the ClustIRR workflow, and
to proceed with analysis for DCO.