Feature rankings are important in analyzing high-throughput data, particularly for bioinformatics studies. These rankings are typically obtained by calculating the marginal correlations with the outcomes. The ordered feature list reflects the importance of features, which plays a vital role in guiding subsequent research. For instance, researchers usually focus on a small subset of important features that are associated with research objectives. However, the feature ranking can be distorted by a single case. The case exerts abnormal influence on the feature ranking is termed as influential points (IPs). The presence of IPs renders the feature ranking unreliable, consequently affecting the subsequent analysis based on feature rankings.

The *findIPs* R package is specifically designed to detect IPs for feature
rankings. The method utilized in this package is based on the leave-one-out
strategy, which involves comparing the rank difference between the original
ranking and a new ranking that is obtained by removing a single observation (1).
The new rankings are leave-one-out rankings. The rank difference obtained
through this comparison helps to determine the influence of the deleted
observation.

The whole process can be divided into three steps,

Step 1, generate the original and leave-one-out rankings using a feature ranking method, such as t-test. A dataset with n cases will result in one original ranking and n leave-one-out feature rankings.

Step 2, calculate rank changes. It is advisable to use top-prioritized weights when comparing ranks.

Step 3, calculate the cumulative rank changes for each observation. A diagnostic check is also required to identify any potential influential points.

The *findIPs* package can be installed from Bioconductor using the
following commands:

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("findIPs")
```

The *findIPs* package includes the miller05 microarray data (2). The data
contains 236 samples and 1,000 genes. It has two types of responses: binary
and survival. The binary response is classified based on the p53 mutation: 58
cases with a p53 mutant and 193 cases with the wild-type p53 mutation. The
survival response is associated with a total of 55 recorded events.

```
library(findIPs)
data(miller05)
X <- miller05$X
y <- miller05$y
surv <- miller05$surv
```

We use a simple example where features are ranked based on t.test to demonstrate
the use of *findIPs* package for IPs detection. We use *getdrop1ranks()* to
derive the original ranking and leave-one-out rankings. Features are simply
ranked according to the p.values of t.test. Of note, the rank criteria is the
p.value if *fun = “t.test”*. P.values are ranked in ascending order by
specifying *decreasing = FALSE*. We select the top 100 important features in
the original ranking. The function returns an object containing a vector of
original ranking (origRank) and a matrix of leave-one-out rankings (drop1Rank).

```
obj <- getdrop1ranks(X, y,
fun = "t.test",
decreasing = FALSE,
topN = 100)
str(obj)
```

```
## List of 2
## $ origRank : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## $ drop1Rank: chr [1:100, 1:236] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
```

After obtaining the original ranking and leave-one-out rankings using the
*getdrop1ranks()* function, we use the *sumRanks()* function to compute the
distance between them. This function provides three methods for comparing ranks:
unweighted, weighted Spearman, and method with adaptive weights. The unweighted
method directly compares the ranks and assumes that all ranks have equal
importance. However, this is not always the case as the top-ranked methods are
usually more important. The weighted Spearman and adaptive weights methods
address this issue by emphasizing the importance of the top-ranked methods (3).
The adaptive weights method can further adjust the weights based on the
distribution of rank changes in the data.

```
results <- sumRanks(origRank = obj$origRank,
drop1Rank = obj$drop1Rank,
topN = 100,
method = "adaptive")
str(results)
```

```
## List of 6
## $ kappa : num 0.0226
## $ score : Named num [1:236] 0.2 2.476 2.01 1.648 0.254 ...
## ..- attr(*, "names")= chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
## $ origRank : chr [1:100] "202580_x_at" "205240_at" "205394_at" "212494_at" ...
## $ drop1Rank : int [1:100, 1:236] 1 2 3 4 6 5 7 8 9 10 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
## $ origRankWeighted : num [1:100] 0.025 0.0244 0.0239 0.0233 0.0228 ...
## $ drop1RankWeighted: num [1:100, 1:236] 0.025 0.0244 0.0239 0.0233 0.0223 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:236] "GSM79114" "GSM79115" "GSM79116" "GSM79118" ...
```

The outputs of *sumRanks()* are diverse across the selected methods. For
*method = “adaptive”*, *sumRanks()* returns a list with the following elements:

1, kappa, the shape parameter of the adaptive weights method;

2, score, the accumulated weighted rank changes, reflecting the influence of
each sample;

3, origRank, the original ranking;

4, drop1Rank, the leave-one-out rankings;

5, origRankWeighted, weighted original ranking;

6, drop1RankWeighted, weighted leave-one-out rankings.

However, if the method is “weightedSpearman” or “unweighted”, the function will only return three elements: “score”, “origRank”, and “drop1RankWeighted”. The elements “kappa”, “origRankWeighted”, and “drop1RankWeighted” will not be returned.

*findIPs()* combines *getdrop1ranks()* and *sumRanks()* into one step. The
output is identical to that using the two-step process.

```
results.ipsr <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "adaptive")
identical(results, results.ipsr)
```

`## [1] TRUE`

*findIPs* package offers three visualization functions: *plotIPs()*,
*plotRankScatters()*, and *plotAdaptiveWeights()*. *plotIPs()* can directly
utilize the output of *findIPs()* or *sumRanks()* to create a lollipop plot that
displays the influence of each case. In Figure 1, we can see that the
observation 68 (obs68) seems to be more influential on the final results.
However, the difference between obs68 and the other observations is not that
distinct, indicating a lower possibility of the presence of an influential
observation.

```
par(mar = c(4, 4, 2, 2))
plotIPs(results.ipsr, topn = 5, ylim = c(0, 8))
```

In addition to the lollipop, *findIPs* also provides a simple visualization
function *plotRankScatters()* that exhibits the distribution of rank changes
using a scatter plot (Figure 2). Alike *plotIPs()*, *plotRankScatters()* simply
uses the output of *findIPs()* or *sumRanks()*. According to Figure 2, we can
observe more rank changes in the tail side, but less changes in the head. The
black points denote the rank changes caused by the most influential case.

```
par(mar = c(4, 4, 2, 2))
plotRankScatters(results.ipsr)
```

The *plotAdaptiveWeights()* function aims to visualize weight function if
adaptive weights are used for rank comparison, that is *method = “adaptive”*
for *findIPs()* or *sumRanks()*. The argument *kappa* refers to the shape
parameter of the weight function. Here, the optimized kappa is 0.023 (Figure 3).
n is the length of the feature list. We select the top 100 features, hence,
*n = 100*. We can observe that more weights are allocated to the top-ranked
features.

```
par(mar = c(4, 4, 2, 2))
plotAdaptiveWeights(results.ipsr$kappa, n = 100, type = "line")
```

For survival analysis, we offer the option to rank features using univariate
Cox model by setting *fun = “cox”*. The features are ranked in ascending order
based on their P-values.

```
par(mar = c(4, 4, 2, 2))
results.cox <- findIPs(X, surv,
fun = "cox",
decreasing = FALSE,
method = "adaptive")
plotIPs(results.cox)
```

In addition to the provided ranking criteria in *findIPs()* or *sumRanks()*,
which includes “t.test”, “cox”, “log2fc”, and “kruskal.test”. We can also rank
features based on a specified rank criterion. To this end, we can pass a function
to the *fun* argument. The function should take *x* and *y* as inputs and
output the rank criterion, such as p-values.

As an example, we can rank features based on the p-values obtained from the
kruskal.test. We can either specify *fun = “kruskal.test”*, as this test has
been implemented in the package, or define our own function passing to
*getdrop1ranks*. Both methods produce the same results.

```
fun <- function(x, y){
kruskal.test(x, y)$p.value
}
kruskal.test1 <- getdrop1ranks(X, y,
fun = fun,
decreasing = FALSE)
kruskal.test2 <- getdrop1ranks(X, y,
fun = "kruskal.test",
decreasing = FALSE)
identical(kruskal.test1, kruskal.test2)
```

`## [1] TRUE`

*findIPs* provides three rank comparison methods: unweighted, weighted Spearman,
and adaptive weights. We recommend using the adaptive weights. Here, we compare
the three methods. Weighted Spearman and the adaptive weights method
demonstrates similar results. But both are different from the unweighted method.

```
re.unweighted <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "unweighted")
re.weighted <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "weightedSpearman")
re.adaptive <- findIPs(X, y,
fun = "t.test",
decreasing = FALSE,
method = "adaptive")
par(mfrow = c(1, 3), mar = c(4, 4, 2, 2))
plotIPs(re.unweighted)
mtext("A, unweighted", side = 3, line = 0, adj = 0)
plotIPs(re.weighted)
mtext("B, weighted Spearman", side = 3, line = 0, adj = 0)
plotIPs(re.adaptive)
mtext("C, adaptive weights", side = 3, line = 0, adj = 0)
```

1, Wang, Shuo, and Junyan Lu. “Detect influential points of feature rankings.”
arXiv preprint arXiv:2303.10516 (2023).

2, Miller, Lance D., et al. “An expression signature for p53 status in human
breast cancer predicts mutation status, transcriptional effects, and patient
survival.” Proceedings of the National Academy of Sciences 102.38 (2005):
13550-13555.doi:10.1073pnas.0506230102

3, Da Pinto Costa, Joaquim; Soares, Carlos (2005): A weighted rank measure of
correlation. In Australian & New Zealand Journal of Statistics 47 (4), pp.
515–529.

```
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] findIPs_1.0.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.4.0
## [3] lattice_0.22-6 digest_0.6.35
## [5] magrittr_2.0.3 evaluate_0.23
## [7] grid_4.4.0 bookdown_0.39
## [9] fastmap_1.1.1 jsonlite_1.8.8
## [11] Matrix_1.7-0 GenomeInfoDb_1.40.0
## [13] survival_3.6-4 tinytex_0.50
## [15] BiocManager_1.30.22 httr_1.4.7
## [17] UCSC.utils_1.0.0 codetools_0.2-20
## [19] jquerylib_0.1.4 abind_1.4-5
## [21] cli_3.6.2 rlang_1.1.3
## [23] crayon_1.5.2 XVector_0.44.0
## [25] Biobase_2.64.0 splines_4.4.0
## [27] cachem_1.0.8 DelayedArray_0.30.0
## [29] yaml_2.3.8 S4Arrays_1.4.0
## [31] tools_4.4.0 parallel_4.4.0
## [33] BiocParallel_1.38.0 GenomeInfoDbData_1.2.12
## [35] SummarizedExperiment_1.34.0 BiocGenerics_0.50.0
## [37] R6_2.5.1 matrixStats_1.3.0
## [39] stats4_4.4.0 lifecycle_1.0.4
## [41] magick_2.8.3 zlibbioc_1.50.0
## [43] S4Vectors_0.42.0 IRanges_2.38.0
## [45] bslib_0.7.0 Rcpp_1.0.12
## [47] xfun_0.43 GenomicRanges_1.56.0
## [49] highr_0.10 MatrixGenerics_1.16.0
## [51] knitr_1.46 htmltools_0.5.8.1
## [53] rmarkdown_2.26 compiler_4.4.0
```