uccdf provides typed consensus clustering for structured mixed-type data frames.
The package is designed for the following workflow:
- validate a tabular input object
- infer a simple column schema
- build more than one clustering representation of the same table
- aggregate clustering runs across resamples and learners
- run a global null test for non-trivial cluster structure
- select the best supported
K - return row-level labels, confidence, ambiguity, and exploratory assignments
Documentation website:
Current scope:
continuousbinarynominalordinal
Installation
Install from GitHub with pak:
install.packages("pak")
pak::pak("dai540/uccdf")Or with remotes:
install.packages("remotes")
remotes::install_github("dai540/uccdf")Or from a source tarball:
install.packages("path/to/uccdf_0.1.0.tar.gz", repos = NULL, type = "source")Minimal example
library(uccdf)
fit <- fit_uccdf(
toy_mixed_data,
id_column = "sample_id",
candidate_k = 1:4,
n_resamples = 20,
n_null = 99,
seed = 42
)
fit$selection
select_k(fit)
head(augment(fit))
plot(fit, type = "selection")
plot_embedding(fit, color_by = "selected")
plot_consensus_heatmap(fit)The practical readout is:
-
fit$selectionfor the global decision -
select_k(fit)for the per-Ksupport table -
augment(fit)for row-level assignments -
plot_embedding(fit)for latent separation -
plot_consensus_heatmap(fit)for hierarchical agreement structure
Selection design
uccdf separates two decisions:
- whether the table shows evidence of non-trivial cluster structure
- which
Kis the best supported solution conditional on that detection
When the global null is not rejected:
-
selected_kis1 -
clusterremains1 -
confidenceandambiguityareNA -
exploratory_clusterstores the strongest unsupported split
This avoids overstating unsupported multi-cluster solutions while still exposing structure that may be worth inspecting.
Returned object
The main return value is a uccdf_fit object containing:
- inferred schema
- mixed-distance and mixed-latent views
- run-level metadata
- consensus matrices by candidate
K - null-score summaries
- selected and exploratory assignments
confidence is a consensus-derived assignment stability score. It is not a Bayesian posterior probability.
Built-in example datasets
Core example data:
toy_mixed_data
Bundled real-data panels:
-
all_gene_panel, derived from the BioconductorALLleukemia dataset -
airway_gene_panel, derived from the BioconductorairwayRNA-seq dataset -
bladder_gene_panel, derived from the Bioconductorbladderbatchdataset -
golub_gene_panel, derived frommulttest::golub -
pima_biomarker_panel, derived frommlbench::PimaIndiansDiabetes2
Website structure
The pkgdown site is organized into:
Get StartedReferenceArticles
The article collection includes:
- design and method notes
- comparison with existing consensus clustering toolkits
- real-data analyses covering clinical, biomarker, and omics examples
Real-data articles currently include:
airqualityCO2IndomethInsectSprayssurveyCars93irismtcarsChickWeightattitudeUSJudgeRatingsgolubALLbladderEsetPimaIndiansDiabetes2