Contents

1 Introduction

Feature rankings are important in analyzing high-throughput data, particularly for bioinformatics studies. These rankings are typically obtained by calculating the marginal correlations with the outcomes. The ordered feature list reflects the importance of features, which plays a vital role in guiding subsequent research. For instance, researchers usually focus on a small subset of important features that are associated with research objectives. However, the feature ranking can be distorted by a single case. The case exerts abnormal influence on the feature ranking is termed as influential points (IPs). The presence of IPs renders the feature ranking unreliable, consequently affecting the subsequent analysis based on feature rankings.

The findIPs R package is specifically designed to detect IPs for feature rankings. The method utilized in this package is based on the leave-one-out strategy, which involves comparing the rank difference between the original ranking and a new ranking that is obtained by removing a single observation (1). The new rankings are leave-one-out rankings. The rank difference obtained through this comparison helps to determine the influence of the deleted observation.

The whole process can be divided into three steps,

Step 1, generate the original and leave-one-out rankings using a feature ranking method, such as t-test. A dataset with n cases will result in one original ranking and n leave-one-out feature rankings.

Step 2, calculate rank changes. It is advisable to use top-prioritized weights when comparing ranks.

Step 3, calculate the cumulative rank changes for each observation. A diagnostic check is also required to identify any potential influential points.

2 Installation

The findIPs package can be installed from Bioconductor using the following commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("findIPs")

3 Dataset

The findIPs package includes the miller05 microarray data (2). The data contains 236 samples and 1,000 genes. It has two types of responses: binary and survival. The binary response is classified based on the p53 mutation: 58 cases with a p53 mutant and 193 cases with the wild-type p53 mutation. The survival response is associated with a total of 55 recorded events.

library(findIPs)
data(miller05)
X <- miller05$X
y <- miller05$y
surv <- miller05$surv

4 Detect IPs using getdrop1ranks() and sumRanks()

We use a simple example where features are ranked based on t.test to demonstrate the use of findIPs package for IPs detection. We use getdrop1ranks() to derive the original ranking and leave-one-out rankings. Features are simply ranked according to the p.values of t.test. Of note, the rank criteria is the p.value if fun = “t.test”. P.values are ranked in ascending order by specifying decreasing = FALSE. We select the top 100 important features in the original ranking. The function returns an object containing a vector of original ranking (origRank) and a matrix of leave-one-out rankings (drop1Rank).

obj <- getdrop1ranks(X, y,
                     fun = "t.test",
                     decreasing = FALSE,
                     topN = 100)
str(obj)
## List of 2
##  $ origRank : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
##  $ drop1Rank: chr [1:100, 1:236] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...

After obtaining the original ranking and leave-one-out rankings using the getdrop1ranks() function, we use the sumRanks() function to compute the distance between them. This function provides three methods for comparing ranks: unweighted, weighted Spearman, and method with adaptive weights. The unweighted method directly compares the ranks and assumes that all ranks have equal importance. However, this is not always the case as the top-ranked methods are usually more important. The weighted Spearman and adaptive weights methods address this issue by emphasizing the importance of the top-ranked methods (3). The adaptive weights method can further adjust the weights based on the distribution of rank changes in the data.

results <- sumRanks(origRank = obj$origRank,
                    drop1Rank = obj$drop1Rank,
                    topN = 100,
                    method = "adaptive")

str(results)
## List of 6
##  $ kappa            : num 0.0226
##  $ score            : Named num [1:236] 0.2 2.476 2.01 1.648 0.254 ...
##   ..- attr(*, "names")= chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
##  $ origRank         : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
##  $ drop1Rank        : int [1:100, 1:236] 1 2 3 4 6 5 7 8 9 10 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
##  $ origRankWeighted : num [1:100] 0.025 0.0244 0.0239 0.0233 0.0228 ...
##  $ drop1RankWeighted: num [1:100, 1:236] 0.025 0.0244 0.0239 0.0233 0.0223 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...

The outputs of sumRanks() are diverse across the selected methods. For method = “adaptive”, sumRanks() returns a list with the following elements:

1, kappa, the shape parameter of the adaptive weights method;
2, score, the accumulated weighted rank changes, reflecting the influence of each sample;
3, origRank, the original ranking;
4, drop1Rank, the leave-one-out rankings;
5, origRankWeighted, weighted original ranking;
6, drop1RankWeighted, weighted leave-one-out rankings.

However, if the method is “weightedSpearman” or “unweighted”, the function will only return three elements: “score”, “origRank”, and “drop1RankWeighted”. The elements “kappa”, “origRankWeighted”, and “drop1RankWeighted” will not be returned.

5 Use findIPs() to detect IPs in one-step

findIPs() combines getdrop1ranks() and sumRanks() into one step. The output is identical to that using the two-step process.

results.ipsr <- findIPs(X, y, 
                        fun = "t.test",
                        decreasing = FALSE,
                        method = "adaptive")

identical(results, results.ipsr)
## [1] TRUE

6 Results visualization

findIPs package offers three visualization functions: plotIPs(), plotRankScatters(), and plotAdaptiveWeights(). plotIPs() can directly utilize the output of findIPs() or sumRanks() to create a lollipop plot that displays the influence of each case. In Figure 1, we can see that the observation 68 (obs68) seems to be more influential on the final results. However, the difference between obs68 and the other observations is not that distinct, indicating a lower possibility of the presence of an influential observation.

par(mar = c(4, 4, 2, 2))
plotIPs(results.ipsr, topn = 5, ylim = c(0, 8))
Figure 1, the influence scores for each observation

Figure 1: Figure 1, the influence scores for each observation

In addition to the lollipop, findIPs also provides a simple visualization function plotRankScatters() that exhibits the distribution of rank changes using a scatter plot (Figure 2). Alike plotIPs(), plotRankScatters() simply uses the output of findIPs() or sumRanks(). According to Figure 2, we can observe more rank changes in the tail side, but less changes in the head. The black points denote the rank changes caused by the most influential case.

par(mar = c(4, 4, 2, 2))
plotRankScatters(results.ipsr)
Figure 2, the distribution of rank changes

Figure 2: Figure 2, the distribution of rank changes

The plotAdaptiveWeights() function aims to visualize weight function if adaptive weights are used for rank comparison, that is method = “adaptive” for findIPs() or sumRanks(). The argument kappa refers to the shape parameter of the weight function. Here, the optimized kappa is 0.023 (Figure 3). n is the length of the feature list. We select the top 100 features, hence, n = 100. We can observe that more weights are allocated to the top-ranked features.

par(mar = c(4, 4, 2, 2))
plotAdaptiveWeights(results.ipsr$kappa, n = 100, type = "line")
Figure 3, the weight function of the adaptive weights

Figure 3: Figure 3, the weight function of the adaptive weights

7 Use findIPs in survival data

For survival analysis, we offer the option to rank features using univariate Cox model by setting fun = “cox”. The features are ranked in ascending order based on their P-values.

par(mar = c(4, 4, 2, 2))
results.cox <- findIPs(X, surv, 
                       fun = "cox",
                       decreasing = FALSE,
                       method = "adaptive")

plotIPs(results.cox)
Figure 4, IPs detection for survival data

Figure 4: Figure 4, IPs detection for survival data

8 Customize the rank criteria

In addition to the provided ranking criteria in findIPs() or sumRanks(), which includes “t.test”, “cox”, “log2fc”, and “kruskal.test”. We can also rank features based on a specified rank criterion. To this end, we can pass a function to the fun argument. The function should take x and y as inputs and output the rank criterion, such as p-values.

As an example, we can rank features based on the p-values obtained from the kruskal.test. We can either specify fun = “kruskal.test”, as this test has been implemented in the package, or define our own function passing to getdrop1ranks. Both methods produce the same results.

fun <- function(x, y){
  kruskal.test(x, y)$p.value
}

kruskal.test1 <- getdrop1ranks(X, y, 
                               fun = fun, 
                               decreasing = FALSE)

kruskal.test2 <- getdrop1ranks(X, y, 
                               fun = "kruskal.test", 
                               decreasing = FALSE)

identical(kruskal.test1, kruskal.test2)
## [1] TRUE

9 The choice of rank comparison methods

findIPs provides three rank comparison methods: unweighted, weighted Spearman, and adaptive weights. We recommend using the adaptive weights. Here, we compare the three methods. Weighted Spearman and the adaptive weights method demonstrates similar results. But both are different from the unweighted method.

re.unweighted <- findIPs(X, y, 
                         fun = "t.test",
                         decreasing = FALSE,
                         method = "unweighted")
re.weighted <- findIPs(X, y, 
                       fun = "t.test",
                       decreasing = FALSE,
                       method = "weightedSpearman")
re.adaptive <- findIPs(X, y, 
                       fun = "t.test",
                       decreasing = FALSE,
                       method = "adaptive")

par(mfrow = c(1, 3), mar = c(4, 4, 2, 2))
plotIPs(re.unweighted)
mtext("A, unweighted", side = 3, line = 0, adj = 0)
plotIPs(re.weighted)
mtext("B, weighted Spearman", side = 3, line = 0, adj = 0)
plotIPs(re.adaptive)
mtext("C, adaptive weights", side = 3, line = 0, adj = 0)
Figure 5, IPs detection using three rank comparison methods

Figure 5: Figure 5, IPs detection using three rank comparison methods

References

Appendix

1, Wang, Shuo, and Junyan Lu. “Detect influential points of feature rankings.” arXiv preprint arXiv:2303.10516 (2023).
2, Miller, Lance D., et al. “An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.” Proceedings of the National Academy of Sciences 102.38 (2005): 13550-13555.doi:10.1073pnas.0506230102
3, Da Pinto Costa, Joaquim; Soares, Carlos (2005): A weighted rank measure of correlation. In Australian & New Zealand Journal of Statistics 47 (4), pp.  515–529.

Session info

## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] findIPs_1.1.0    BiocStyle_2.33.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9                  SparseArray_1.5.0          
##  [3] lattice_0.22-6              digest_0.6.35              
##  [5] magrittr_2.0.3              evaluate_0.23              
##  [7] grid_4.4.0                  bookdown_0.39              
##  [9] fastmap_1.1.1               jsonlite_1.8.8             
## [11] Matrix_1.7-0                GenomeInfoDb_1.41.0        
## [13] survival_3.6-4              tinytex_0.50               
## [15] BiocManager_1.30.22         httr_1.4.7                 
## [17] UCSC.utils_1.1.0            codetools_0.2-20           
## [19] jquerylib_0.1.4             abind_1.4-5                
## [21] cli_3.6.2                   rlang_1.1.3                
## [23] crayon_1.5.2                XVector_0.45.0             
## [25] Biobase_2.65.0              splines_4.4.0              
## [27] cachem_1.0.8                DelayedArray_0.31.0        
## [29] yaml_2.3.8                  S4Arrays_1.5.0             
## [31] tools_4.4.0                 parallel_4.4.0             
## [33] BiocParallel_1.39.0         GenomeInfoDbData_1.2.12    
## [35] SummarizedExperiment_1.35.0 BiocGenerics_0.51.0        
## [37] R6_2.5.1                    matrixStats_1.3.0          
## [39] stats4_4.4.0                lifecycle_1.0.4            
## [41] magick_2.8.3                zlibbioc_1.51.0            
## [43] S4Vectors_0.43.0            IRanges_2.39.0             
## [45] bslib_0.7.0                 Rcpp_1.0.12                
## [47] xfun_0.43                   GenomicRanges_1.57.0       
## [49] highr_0.10                  MatrixGenerics_1.17.0      
## [51] knitr_1.46                  htmltools_0.5.8.1          
## [53] rmarkdown_2.26              compiler_4.4.0