A function to sort pairs of integers or factors and identify the pairs between two classifications

sort_pairs(c1, c2, spMat = FALSE)

Arguments

c1

A vector of length $n$ with values between 0 and $N_1 < n$ representing the first classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list.

c2

A vector of length $n$ with values between 0 and $N_2 < n$ representing the second classification. Supported types: integer, numeric, or factor. Avoid character vectors for better performance. Must not be a list.

spMat

Logical. If TRUE, returns the contingency table as a sparse matrix. Note: sparse encoding may be more computationally expensive than the algorithm itself. Default is FALSE.

Value

A list containing the following elements:

  • spMat: A sparsely encoded contingency matrix (only if spMat = TRUE).

  • levels: A list containing the retained levels for each classification.

  • nij: A vector of positive pair counts.

  • ni., n.j: Vectors of class counts for c1 and c2, respectively.

  • pair_c1, pair_c2: Integer vectors specifying the classes in c1 and c2 corresponding to the counts in nij. These provide the row and column indices for the contingency matrix.

Details

Pair sorting, which is at the heart of computing all clustering comparison measures, has been carefully optimized. Hence, even basic R operations (checking for the presence of NAs, type conversion, or constructing a sparse contingency matrix as an output) have non-negligible cost compared to the pair sorting itself. For optimal performance, please provide the vectors as integers or factors without any NAs.

Examples

data(iris)
cl <- cutree(hclust(dist(iris[, -5])), 4)
out <- sort_pairs(cl, iris$Species)