Background
uccdf is designed for a very specific failure mode in
practical clustering: analysts often have a mixed table, can produce
several plausible clusterings, but lack a compact way to decide whether
a discovered structure is both stable and stronger than a simple null
baseline.
The package therefore does not implement a single clustering algorithm. It implements a small typed consensus workflow with explicit null calibration.
Objective
The goal of the design is to answer two separate questions:
- is there evidence for non-trivial cluster structure?
- conditional on that, which
Kis the most stable supported solution?
That separation is deliberate. Earlier versions could collapse every
example to K = 1 because the Monte Carlo null sample was
too small to make per-K p-values meaningful. The current
design treats global detection and K selection as distinct
stages.
Architecture
uccdf is organized into four layers:
- schema inference
- representation building
- ensemble consensus
- null-calibrated selection
The corresponding public functions are:
Mathematical sketch
Let the input table be
with row index and column index . Each active column has a coarse type
Schema layer
The inferred schema stores:
- column type
- role
- initial weight
- missingness and uniqueness summaries
This keeps the workflow explicit about which columns participate in clustering.
Representation layer
Two views are built.
Ensemble layer
For each resample , view , learner , and candidate cluster count , the package computes a partition
The current default learners are:
- PAM on the distance view
- hierarchical clustering on the distance view
- k-means on the latent view
Each run contributes a co-membership matrix
These are averaged into a consensus matrix .
Stability score
Pairwise consensus uncertainty is summarized with binary entropy:
The package reports
High means low pairwise ambiguity and therefore a more stable consensus structure.
Null calibration
Null tables are generated by column-wise permutation of observed values within each active column, preserving the marginal distribution while breaking cross-column structure.
For each null replicate , the same pipeline produces . From those replicates we compute:
We then define:
The Monte Carlo p-value is
Selection logic
The final decision is two-stage.
Stage 1: global detection
Let
For each null replicate we compute the analogous maximum over
candidate K, which yields a global null distribution. This
gives a global p-value for the statement “the table has no stronger
clustering signal than the null baseline”.
If that null is not rejected, the package reports:
selected_k = 1cluster = 1confidence = NA-
exploratory_clusteras the strongest unsupported split
Output design
The most important practical design choice is that
augment() now returns both selected and exploratory
assignments.
This avoids a common reporting failure:
- claiming a supported multi-cluster solution when there is none
- or hiding a plausible unsupported split that the analyst may still want to inspect
Interpretation guidance
In practice, a fit should be read in this order:
fit$selection$detected_structureselect_k(fit)augment(fit)plot_embedding(fit)plot_consensus_heatmap(fit)
If those all point in the same direction, the solution is usually
easy to defend. If they disagree, the exploratory columns and the
heatmap are often the fastest way to see whether the issue is weak
signal, boundary uncertainty, or an overly large candidate
K.