1 Introduction

The ambitions of collaborative single cell biology will only be achieved through the coordinated efforts of many groups, to help clarify cell types and dynamics in an array of functional and environmental contexts. The use of formal ontology in this pursuit is well-motivated and research progress has already been substantial.

Bakken et al. (2017) discuss “strategies for standardized cell type representations based on the data outputs from [high-content flow cytometry and single cell RNA sequencing], including ‘context annotations’ in the form of standardized experiment metadata about the specimen source analyzed and marker genes that serve as the most useful features in machine learning-based cell type classification models.” Aevermann et al. (2018) describe how the FAIR principles can be implemented using statistical identification of necessary and sufficient conditions for determining cell class membership. They propose that Cell Ontology can be transformed to a broadly usable knowledgebase through the incorporation of accurate marker gene signatures for cell classes.

In this vignette, we review key concepts and tasks required to make progress in the adoption and application of ontological discipline in Bioconductor-oriented data analysis.

We’ll start by setting up some package attachments and ontology objects.

2 Scope of package

The following table describes the resources available with get* commands defined in ontoProc.

X func purpose nclass nprop nroots datav fmtv
1 getCellLineOnto Cell line catalog 41780 6 18 NA NA
2 getCellOnto Cell biology concepts 6708 59 38 releases/2018-07-07 1.2
3 getCellosaurusOnto Cell line concepts 87311 6 87311 23 1.2
4 getChebiLite Chemicals of biological interest 108496 6 12 155 1.2
5 getChebiOnto 108496 33 12 155 1.2
6 getDiseaseOnto Human disease 11283 24 13 releases/2018-06-29 1.2
7 getEFOOnto Experimental factors 20115 6 36 2.87 1.2
8 getGeneOnto Gene ontology 47123 43 10 releases/2018-03-27 1.2
9 getHCAOnto Human cell atlas 11047 6 76 NA NA
10 getOncotreeOnto Tumor relations 1298 15 3 ncit/releases/2017-12-15/ncit-oncotree.ttl 1.2
11 getPATOnto Phenotypes and traits 2670 43 21 releases/2018-11-12 1.2
12 getPROnto Protein ontology 315957 6 53 57 1.2
13 getUBERON_NE Anatomy 14937 6 135 releases/2017-09-09 1.2

3 Methods

3.1 Conceptual overview of ontology with cell types

Definitions, semantics. For concreteness, we provide some definitions and examples. We use ontology to denote the systematic organization of terminology used in a conceptual domain. The Cell Ontology is a graphical data structure with carefully annotated terms as nodes and conventionally defined semantic relationships among terms serving as edges. As an example, lung ciliated cell has URI . This URI includes a fixed-length identifier CL_1000271 with unambiguous interpretation wherever it is encountered. There is a chain of relationships from lung ciliated cell up through ciliated cell, then native cell, then cell, each possessing its own URI and related interpretive metadata. The relationship connecting the more precise to the less precise term in this chain is denoted SubclassOf. Ciliated cell is equivalent to a native cell that has plasma membrane part cilium. Semantic characteristics of terms and relationships are used to infer relationships among terms that may not have relations directly specified in available ontologies.

Barriers to broad adoption. Given the wealth of material available in biological ontologies, it is somewhat surprising that formal annotation is so seldom used in practice. Barriers to more common use of ontology in data annotation include: (i) Non-existence of exact matching between intended term and terms available in ontologies of interest. (ii) The practical problem of decoding ontology identifiers. A GO tag or CL tag is excellent for programming, but it is clumsy to co-locate with the tag the associated natural language term or phrase. (iii) Likelihood of disagreement of suitability of terms for conditions observed at the boundaries of knowledge. To help cope with the first of these problems, Bioconductor’s ontologyProc package includes a function liberalMap which will search an ontology for terms lexically close to some target term or phrase. The second problem can be addressed with more elaborate data structures for variable annotation and programming in R, and the third problem will diminish in importance as the value of ontology adoption becomes manifest in more applications.

Class vs. instance. It is important to distinguish the practice of designing and maintaining ontologies from the use of ontological class terms to annotate instances of the concepts. The combination of an ontology and a set of annotated instances is called a knowledge base. To illustrate some of the salient distinctions here, consider the cell line called A549, which is established from a human lung adenocarcinoma sample. There is no mention of A549 in the Cell Ontology. However, A549 is present in the EBI Experimental Factor Ontology as a subclass of the “Homo sapiens cell line” class. Presumably this is because A549 is a class of cells that are widely used experimentally, and this cell line constitutes a concept deserving of mapping in the universe of experimental factors. In the universe of concepts related to cell structure and function per se, A549 is an individual that can be characterized through possession of or lack of properties enumerated in Cell Ontology, but it is not deserving of inclusion in that ontology.

3.2 Illustration in a single-cell RNA-seq dataset

The 10X Genomics corporation has distributed a dataset on results of sequencing 10000 PBMC from a healthy donor . Subsets of the data are used in tutorials for the Seurat analytical suite (Butler et al. (2018)).

3.2.1 Labeling PBMC in the Seurat tutorial

One result of the tutorial analysis of the 3000 cell subset is a table of cell types and expression-based markers of cell identity. The first three columns of the table below are from concluding material in the Seurat tutorial; the remaining columns are created by “manual” matching between the Seurat terms and terms found in Cell Ontology.

grp markers seurTutType formal tag
0 IL7R CD4 T cells CD4-positive helper T cell CL:0000492
1 CD14, LYZ CD14+ Monocytes CD14-positive monocyte CL:0001054
2 MS4A1 B cells B cell CL:0000236
3 CD8A CD8 T cells CD8-positive, alpha-beta T cell CL:0000625
4 FCGR3A, MS4A7 FCGR3A+ Monocytes monocyte CL:0000576
5 GNLY, NKG7 NK cells natural killer cell CL:0000623
6 FCER1A, CST3 Dendritic Cells dendritic cell CL:0000451
7 PPBP Megakaryocytes megakaryocyte CL:0000556

3.2.2 Relationships asserted in the Cell Ontology

Given the informally selected tags in the table above, we can sketch the Cell Ontology graph connecting the associated cell types. The ontoProc package adds functionality to ontologyPlot with make_graphNEL_from_ontology_plot. This allows use of all Rgraphviz and igraph visualization facilities for graphs derived from ontology structures.

Here we display the PBMC cell sets reported in the Seurat tutorial.

3.2.3 Molecular features asserted in the Cell Ontology

The CLfeats function traces relationships and properties from a given Cell Ontology class. Briefly, each class can assert that it is the intersection_of other classes, and has_part, lacks_part, has_plasma_membrane_part, lacks_plasma_membrane_part can be asserted as relationships holding between cell type instances and cell components. The components are often cross-referenced to Protein Ontology or Gene Ontology. When the Protein Ontology component has a synonym for which an HGNC symbol is provided, that symbol is retrieved by CLfeats. Here we obtain the listing for a mature CD1a-positive dermal dendritic cell.

## no recognized predicate references for CL:0002531
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

The ctmarks function starts a shiny app that generates tables of this sort for selected cell types.

ctmarks snapshot

ctmarks snapshot

3.2.4 Mapping from gene ‘presence/role’ to cell type

The sym2CellOnto function helps find mention of given gene symbols in properties or parts of cell types.

## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

## < table of extent 0 >
## Warning in kable_pipe(x = structure(character(0), .Dim = c(0L, 0L), .Dimnames =
## list(: The table should have a header (column names)

|| || || ||

3.3 Adding terms to ontology_index structures to ‘extend’ Cell Ontology

The task of extending an ontology is partly bureaucratic in nature and depends on a collection of endorsements and updates to centralized information structures. In order to permit experimentation with interfaces and new content that may be quite speculative, we include an approach to combining new ontology ‘terms’ of structure similar to those endorsed in Cell Ontology, to ontologyIndex-based ontology_index instances.

3.3.1 Use case: a set of cell types defined by “diagonal expression”

For a demonstration, we consider the discussion in Bakken et al. (2017), of a ‘diagonal’ expression pattern defining a group of novel cell types. A set of genes is identified and cells are distinguised by expressing exactly one gene from the set.