We read in input.scone.csv, which is our file modified (and renamed) from the get.marker.names() function. The K-nearest neighbor generation is derived from the Fast Nearest Neighbors (FNN) R package, within our function Fnn(), which takes as input the “input markers” to be used, along with the concatenated data previously generated, and the desired k. We advise the default selection to the total number of cells in the dataset divided by 100, as has been optimized on existing mass cytometry datasets. The output of this function is a matrix of each cell and the identity of its k-nearest neighbors, in terms of its row number in the dataset used here as input.
library(Sconify)
# Markers from the user-generated excel file
marker.file <- system.file('extdata', 'markers.csv', package = "Sconify")
markers <- ParseMarkers(marker.file)
# How to convert your excel sheet into vector of static and functional markers
markers
## $input
## [1] "CD3(Cd110)Di" "CD3(Cd111)Di" "CD3(Cd112)Di"
## [4] "CD235-61-7-15(In113)Di" "CD3(Cd114)Di" "CD45(In115)Di"
## [7] "CD19(Nd142)Di" "CD22(Nd143)Di" "IgD(Nd145)Di"
## [10] "CD79b(Nd146)Di" "CD20(Sm147)Di" "CD34(Nd148)Di"
## [13] "CD179a(Sm149)Di" "CD72(Eu151)Di" "IgM(Eu153)Di"
## [16] "Kappa(Sm154)Di" "CD10(Gd156)Di" "Lambda(Gd157)Di"
## [19] "CD24(Dy161)Di" "TdT(Dy163)Di" "Rag1(Dy164)Di"
## [22] "PreBCR(Ho165)Di" "CD43(Er167)Di" "CD38(Er168)Di"
## [25] "CD40(Er170)Di" "CD33(Yb173)Di" "HLA-DR(Yb174)Di"
##
## $functional
## [1] "pCrkL(Lu175)Di" "pCREB(Yb176)Di" "pBTK(Yb171)Di" "pS6(Yb172)Di"
## [5] "cPARP(La139)Di" "pPLCg2(Pr141)Di" "pSrc(Nd144)Di" "Ki67(Sm152)Di"
## [9] "pErk12(Gd155)Di" "pSTAT3(Gd158)Di" "pAKT(Tb159)Di" "pBLNK(Gd160)Di"
## [13] "pP38(Tm169)Di" "pSTAT5(Nd150)Di" "pSyk(Dy162)Di" "tIkBa(Er166)Di"
# Get the particular markers to be used as knn and knn statistics input
input.markers <- markers[[1]]
funct.markers <- markers[[2]]
# Selection of the k. See "Finding Ideal K" vignette
k <- 30
# The built-in scone functions
wand.nn <- Fnn(cell.df = wand.combined, input.markers = input.markers, k = k)
# Cell identity is in rows, k-nearest neighbors are columns
# List of 2 includes the cell identity of each nn,
# and the euclidean distance between
# itself and the cell of interest
# Indices
str(wand.nn[[1]])
## int [1:1000, 1:30] 898 731 292 298 289 135 650 287 505 933 ...
wand.nn[[1]][1:20, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 898 206 529 453 840 899 173 718 403 349
## [2,] 731 154 141 191 980 36 820 747 640 63
## [3,] 292 755 7 262 650 424 745 672 118 553
## [4,] 298 350 312 890 561 480 68 803 157 885
## [5,] 289 881 324 606 866 326 787 773 438 513
## [6,] 135 304 630 129 662 206 303 620 759 53
## [7,] 650 454 672 785 515 270 745 553 922 697
## [8,] 287 612 348 38 454 861 78 118 7 499
## [9,] 505 463 725 692 494 930 176 279 724 665
## [10,] 933 332 650 923 738 557 525 285 553 591
## [11,] 754 35 717 344 929 848 432 831 334 315
## [12,] 857 462 904 339 880 538 837 126 271 318
## [13,] 552 688 318 904 274 586 958 823 501 32
## [14,] 285 62 835 557 763 292 976 723 246 935
## [15,] 22 61 678 993 484 910 339 303 274 554
## [16,] 177 771 137 273 126 782 12 166 909 349
## [17,] 26 939 214 28 389 808 381 957 974 663
## [18,] 466 874 616 276 600 638 761 675 476 812
## [19,] 144 198 90 532 262 213 722 557 641 723
## [20,] 144 641 573 270 292 193 697 913 182 275
# Distance
str(wand.nn[[2]])
## num [1:1000, 1:30] 3.42 3.34 3.98 3.09 4.14 ...
wand.nn[[2]][1:20, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 3.417182 3.823747 3.992522 4.097273 4.266755 4.362201 4.485037 4.499253
## [2,] 3.339616 4.293531 4.398688 4.509582 4.560838 4.635994 4.639125 4.650246
## [3,] 3.976396 4.026170 4.247808 4.313530 4.328556 4.387221 4.389244 4.430290
## [4,] 3.088149 3.261006 3.830697 3.834289 3.861594 4.118591 4.120388 4.127187
## [5,] 4.142060 4.170939 4.191703 4.291209 4.310931 4.331572 4.372369 4.452250
## [6,] 3.431967 3.569947 3.583181 3.637451 3.663413 3.721846 3.736576 3.748994
## [7,] 2.847317 3.118686 3.189816 3.487595 3.533406 3.549104 3.708658 3.773533
## [8,] 4.478560 4.660287 4.966901 4.999660 5.011496 5.039738 5.046384 5.056376
## [9,] 2.684652 2.728091 2.854364 2.946794 2.976749 2.986093 3.040271 3.053628
## [10,] 3.103825 3.159509 3.301469 3.500499 3.563065 3.696267 3.716144 3.739260
## [11,] 3.217416 3.420631 3.756897 3.768833 3.773349 3.800538 3.806209 3.820494
## [12,] 2.822860 2.897202 3.203155 3.211140 3.320592 3.467987 3.555936 3.577952
## [13,] 3.100244 3.210254 3.324507 3.392146 3.414274 3.501026 3.503116 3.596555
## [14,] 3.516165 3.604853 3.639298 3.829133 3.846842 3.860399 3.875227 3.912860
## [15,] 3.157874 3.291503 3.309315 3.314453 3.373050 3.385717 3.584945 3.615325
## [16,] 4.989852 5.196302 5.204365 5.471801 5.494584 5.524886 5.551937 5.607712
## [17,] 3.413191 3.930539 3.932257 3.983984 4.231706 4.281360 4.339549 4.376458
## [18,] 2.961004 2.966159 3.089898 3.099248 3.127436 3.127559 3.128041 3.149268
## [19,] 2.853666 2.940074 3.139441 3.192579 3.283071 3.447538 3.463379 3.491702
## [20,] 2.852405 2.856276 2.958130 3.065151 3.132133 3.136652 3.215971 3.239431
## [,9] [,10]
## [1,] 4.535784 4.553176
## [2,] 4.664854 4.669488
## [3,] 4.433452 4.495263
## [4,] 4.215418 4.251469
## [5,] 4.541525 4.610779
## [6,] 3.927853 3.939806
## [7,] 3.774934 3.787945
## [8,] 5.080678 5.085931
## [9,] 3.055422 3.067935
## [10,] 3.746585 3.987337
## [11,] 3.840619 3.859671
## [12,] 3.648404 3.686812
## [13,] 3.614842 3.638911
## [14,] 3.917201 3.917543
## [15,] 3.645284 3.666585
## [16,] 5.629288 5.819080
## [17,] 4.490640 4.585162
## [18,] 3.182050 3.186249
## [19,] 3.515789 3.528559
## [20,] 3.307469 3.312932
This function iterates through each KNN, and performs a series of calculations. The first is fold change values for each maker per KNN, where the user chooses whether this will be based on medians or means. The second is a statistical test, where the user chooses t test or Mann-Whitney U test. I prefer the latter, because it does not assume any properties of the distributions. Of note, the p values are adjusted for false discovery rate, and therefore are called q values in the output of this function. The user also inputs a threshold parameter (default 0.05), where the fold change values will only be shown if the corresponding statistical test returns a q value below said threshold. Finally, the “multiple.donor.compare” option, if set to TRUE will perform a t test based on the mean per-marker values of each donor. This is to allow the user to make comparisons across replicates or multiple donors if that is relevant to the user’s biological questions. This function returns a matrix of cells by computed values (change and statistical test results, labeled either marker.change or marker.qvalue). This matrix is intermediate, as it gets concatenated with the original input matrix in the post-processing step (see the relevant vignette). We show the code and the output below. See the post-processing vignette, where we show how this gets combined with the input data, and additional analysis is performed.
wand.scone <- SconeValues(nn.matrix = wand.nn,
cell.data = wand.combined,
scone.markers = funct.markers,
unstim = "basal")
wand.scone
## # A tibble: 1,000 × 34
## `pCrkL(Lu175)Di.IL7.qvalue` pCREB(Yb176)Di.IL7.qvalu…¹ pBTK(Yb171)Di.IL7.qv…²
## <dbl> <dbl> <dbl>
## 1 0.903 1 0.858
## 2 0.920 0.988 0.659
## 3 1 0.988 0.954
## 4 0.903 0.988 1
## 5 0.962 0.988 0.939
## 6 0.962 1 0.582
## 7 0.892 0.988 0.996
## 8 0.989 0.988 0.996
## 9 0.903 0.988 0.926
## 10 0.962 0.988 0.879
## # ℹ 990 more rows
## # ℹ abbreviated names: ¹`pCREB(Yb176)Di.IL7.qvalue`,
## # ²`pBTK(Yb171)Di.IL7.qvalue`
## # ℹ 31 more variables: `pS6(Yb172)Di.IL7.qvalue` <dbl>,
## # `cPARP(La139)Di.IL7.qvalue` <dbl>, `pPLCg2(Pr141)Di.IL7.qvalue` <dbl>,
## # `pSrc(Nd144)Di.IL7.qvalue` <dbl>, `Ki67(Sm152)Di.IL7.qvalue` <dbl>,
## # `pErk12(Gd155)Di.IL7.qvalue` <dbl>, `pSTAT3(Gd158)Di.IL7.qvalue` <dbl>, …
If one wants to export KNN data to perform other statistics not available in this package, then I provide a function that produces a list of each cell identity in the original input data matrix, and a matrix of all cells x features of its KNN.
I also provide a function to find the KNN density estimation independently of the rest of the “scone.values” analysis, to save time if density is all the user wants. With this density estimation, one can perform interesting analysis, ranging from understanding phenotypic density changes along a developmental progression (see post-processing vignette for an example), to trying out density-based binning methods (eg. X-shift). Of note, this density is specifically one divided by the aveage distance to k-nearest neighbors. This specific measure is related to the Shannon Entropy estimate of that point on the manifold (https://hal.archives-ouvertes.fr/hal-01068081/document).
I use this metric to avoid the unusual properties of the volume of a sphere as it increases in dimensions (https://en.wikipedia.org/wiki/Volume_of_an_n-ball). This being said, one can modify this vector to be such a density estimation (example http://www.cs.haifa.ac.il/~rita/ml_course/lectures_old/KNN.pdf), by treating the distance to knn as the radius of a n-dimensional sphere and incoroprating said volume accordingly.
An individual with basic programming skills can iterate through these elements to perform the statistics of one’s choosing. Examples would include per-KNN regression and classification, or feature imputation. The additional functionality is shown below, with the example knn.list in the package being the first ten instances:
# Constructs KNN list, computes KNN density estimation
wand.knn.list <- MakeKnnList(cell.data = wand.combined, nn.matrix = wand.nn)
wand.knn.list[[8]]
## # A tibble: 30 × 51
## `CD3(Cd110)Di` `CD3(Cd111)Di` `CD3(Cd112)Di` `CD235-61-7-15(In113)Di`
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.257 -0.215 0.205 -0.996
## 2 -0.0527 -0.0208 0.460 0.588
## 3 -1.05 -0.824 -0.453 0.767
## 4 -0.154 1.04 0.596 -1.28
## 5 -0.923 -0.105 0.824 0.756
## 6 -0.259 0.765 -0.193 -0.342
## 7 -0.460 1.67 0.910 -0.576
## 8 -0.123 -0.0398 -0.0825 -0.193
## 9 -0.641 -0.621 -0.456 0.695
## 10 -0.306 -0.664 -0.507 -1.23
## # ℹ 20 more rows
## # ℹ 47 more variables: `CD3(Cd114)Di` <dbl>, `CD45(In115)Di` <dbl>,
## # `CD19(Nd142)Di` <dbl>, `CD22(Nd143)Di` <dbl>, `IgD(Nd145)Di` <dbl>,
## # `CD79b(Nd146)Di` <dbl>, `CD20(Sm147)Di` <dbl>, `CD34(Nd148)Di` <dbl>,
## # `CD179a(Sm149)Di` <dbl>, `CD72(Eu151)Di` <dbl>, `IgM(Eu153)Di` <dbl>,
## # `Kappa(Sm154)Di` <dbl>, `CD10(Gd156)Di` <dbl>, `Lambda(Gd157)Di` <dbl>,
## # `CD24(Dy161)Di` <dbl>, `TdT(Dy163)Di` <dbl>, `Rag1(Dy164)Di` <dbl>, …
# Finds the KNN density estimation for each cell, ordered by column, in the
# original data matrix
wand.knn.density <- GetKnnDe(nn.matrix = wand.nn)
str(wand.knn.density)
## num [1:1000] 0.213 0.208 0.215 0.23 0.215 ...