Skip to contents

Background

uccdf is designed for a very specific failure mode in practical clustering: analysts often have a mixed table, can produce several plausible clusterings, but lack a compact way to decide whether a discovered structure is both stable and stronger than a simple null baseline.

The package therefore does not implement a single clustering algorithm. It implements a small typed consensus workflow with explicit null calibration.

Objective

The goal of the design is to answer two separate questions:

  1. is there evidence for non-trivial cluster structure?
  2. conditional on that, which K is the most stable supported solution?

That separation is deliberate. Earlier versions could collapse every example to K = 1 because the Monte Carlo null sample was too small to make per-K p-values meaningful. The current design treats global detection and K selection as distinct stages.

Architecture

uccdf is organized into four layers:

  1. schema inference
  2. representation building
  3. ensemble consensus
  4. null-calibrated selection

The corresponding public functions are:

Mathematical sketch

Let the input table be

X=(xij)1in,1jp X = (x_{ij})_{1 \le i \le n,\ 1 \le j \le p}

with row index ii and column index jj. Each active column has a coarse type

τj{continuous,binary,nominal,ordinal}. \tau_j \in \{\text{continuous}, \text{binary}, \text{nominal}, \text{ordinal}\}.

Schema layer

The inferred schema stores:

  • column type τj\tau_j
  • role qj{id,active,exclude}q_j \in \{\text{id}, \text{active}, \text{exclude}\}
  • initial weight αj\alpha_j
  • missingness and uniqueness summaries

This keeps the workflow explicit about which columns participate in clustering.

Representation layer

Two views are built.

Mixed-distance view

For each active column we define a local distance djd_j. For example:

dj(x,x)=|xx|sj+ε d_j(x, x') = \frac{|x - x'|}{s_j + \varepsilon}

for continuous columns, and

dj(x,x)=𝟏(xx) d_j(x, x') = \mathbf 1(x \ne x')

for nominal or binary columns.

These are aggregated into a Gower-like distance:

Dii=jαjδiijdj(xij,xij)jαjδiij+ε. D_{ii'} = \frac{ \sum_j \alpha_j \delta_{ii'j} d_j(x_{ij}, x_{i'j}) }{ \sum_j \alpha_j \delta_{ii'j} + \varepsilon }.

Mixed-latent view

Continuous columns are scaled, categorical columns are expanded, ordinal columns are rank-coded, and the resulting matrix is projected into a low-dimensional embedding with PCA-like machinery.

This gives a second representation:

Zn×r. Z \in \mathbb{R}^{n \times r}.

Ensemble layer

For each resample bb, view vv, learner aa, and candidate cluster count KK, the package computes a partition

πbvaK. \pi_{bvaK}.

The current default learners are:

  • PAM on the distance view
  • hierarchical clustering on the distance view
  • k-means on the latent view

Each run contributes a co-membership matrix

MbvaK(i,j)=𝟏{πbvaK(i)=πbvaK(j)}. M_{bvaK}(i, j) = \mathbf 1\{\pi_{bvaK}(i) = \pi_{bvaK}(j)\}.

These are averaged into a consensus matrix CKC_K.

Stability score

Pairwise consensus uncertainty is summarized with binary entropy:

h(p)=plogp(1p)log(1p). h(p) = -p\log p - (1-p)\log(1-p).

The package reports

SK=11log2meani<jh(CK(i,j)). S_K = 1 - \frac{1}{\log 2}\text{mean}_{i<j}\, h(C_K(i,j)).

High SKS_K means low pairwise ambiguity and therefore a more stable consensus structure.

Null calibration

Null tables are generated by column-wise permutation of observed values within each active column, preserving the marginal distribution while breaking cross-column structure.

For each null replicate rr, the same pipeline produces SK(0,r)S_K^{(0,r)}. From those replicates we compute:

μ0,K=meanrSK(0,r),σ0,K=sdrSK(0,r). \mu_{0,K} = \text{mean}_r\, S_K^{(0,r)}, \qquad \sigma_{0,K} = \text{sd}_r\, S_K^{(0,r)}.

We then define:

ΔK=SKμ0,K,ZK=ΔKσ0,K+ε. \Delta_K = S_K - \mu_{0,K}, \qquad Z_K = \frac{\Delta_K}{\sigma_{0,K} + \varepsilon}.

The Monte Carlo p-value is

pK=1+r𝟏{SK(0,r)SK}R0+1. p_K = \frac{ 1 + \sum_r \mathbf 1\{S_K^{(0,r)} \ge S_K\} }{ R_0 + 1 }.

Selection logic

The final decision is two-stage.

Stage 1: global detection

Let

Tobs=maxKΔK. T_{\text{obs}} = \max_K \Delta_K.

For each null replicate we compute the analogous maximum over candidate K, which yields a global null distribution. This gives a global p-value for the statement “the table has no stronger clustering signal than the null baseline”.

If that null is not rejected, the package reports:

  • selected_k = 1
  • cluster = 1
  • confidence = NA
  • exploratory_cluster as the strongest unsupported split

Stage 2: supported K selection

If the global null is rejected, the package selects among supported K using

JK=ZKλlogKγg=1K𝟏{|Gg|<nmin}. J_K = Z_K - \lambda \log K - \gamma \sum_{g=1}^{K} \mathbf 1\{|G_g| < n_{\min}\}.

This keeps the chosen solution stable, null-improving, and resistant to very small accidental clusters.

Output design

The most important practical design choice is that augment() now returns both selected and exploratory assignments.

This avoids a common reporting failure:

  • claiming a supported multi-cluster solution when there is none
  • or hiding a plausible unsupported split that the analyst may still want to inspect

Interpretation guidance

In practice, a fit should be read in this order:

  1. fit$selection$detected_structure
  2. select_k(fit)
  3. augment(fit)
  4. plot_embedding(fit)
  5. plot_consensus_heatmap(fit)

If those all point in the same direction, the solution is usually easy to defend. If they disagree, the exploratory columns and the heatmap are often the fastest way to see whether the issue is weak signal, boundary uncertainty, or an overly large candidate K.