Method and Design

Background

uccdf is designed for a very specific failure mode in practical clustering: analysts often have a mixed table, can produce several plausible clusterings, but lack a compact way to decide whether a discovered structure is both stable and stronger than a simple null baseline.

The package therefore does not implement a single clustering algorithm. It implements a small typed consensus workflow with explicit null calibration.

Objective

The goal of the design is to answer two separate questions:

is there evidence for non-trivial cluster structure?
conditional on that, which K is the most stable supported solution?

That separation is deliberate. Earlier versions could collapse every example to K = 1 because the Monte Carlo null sample was too small to make per-K p-values meaningful. The current design treats global detection and K selection as distinct stages.

Architecture

uccdf is organized into four layers:

schema inference
representation building
ensemble consensus
null-calibrated selection

The corresponding public functions are:

Mathematical sketch

Let the input table be

$X = (x_{ij})_{1 \le i \le n,\ 1 \le j \le p}$

with row index $i$ and column index $j$ . Each active column has a coarse type

$\tau_j \in \{\text{continuous}, \text{binary}, \text{nominal}, \text{ordinal}\}.$

Schema layer

The inferred schema stores:

column type $\tau_j$
role $q_j \in \{\text{id}, \text{active}, \text{exclude}\}$
initial weight $\alpha_j$
missingness and uniqueness summaries

This keeps the workflow explicit about which columns participate in clustering.

Representation layer

Two views are built.

Mixed-distance view

For each active column we define a local distance $d_j$ . For example:

$d_j(x, x') = \frac{|x - x'|}{s_j + \varepsilon}$

for continuous columns, and

$d_j(x, x') = \mathbf 1(x \ne x')$

for nominal or binary columns.

These are aggregated into a Gower-like distance:

$D_{ii'} = \frac{ \sum_j \alpha_j \delta_{ii'j} d_j(x_{ij}, x_{i'j}) }{ \sum_j \alpha_j \delta_{ii'j} + \varepsilon }.$

Mixed-latent view

Continuous columns are scaled, categorical columns are expanded, ordinal columns are rank-coded, and the resulting matrix is projected into a low-dimensional embedding with PCA-like machinery.

This gives a second representation:

$Z \in \mathbb{R}^{n \times r}.$

Ensemble layer

For each resample $b$ , view $v$ , learner $a$ , and candidate cluster count $K$ , the package computes a partition

$\pi_{bvaK}.$

The current default learners are:

PAM on the distance view
hierarchical clustering on the distance view
k-means on the latent view

Each run contributes a co-membership matrix

$M_{bvaK}(i, j) = \mathbf 1\{\pi_{bvaK}(i) = \pi_{bvaK}(j)\}.$

These are averaged into a consensus matrix $C_K$ .

Stability score

Pairwise consensus uncertainty is summarized with binary entropy:

$h(p) = -p\log p - (1-p)\log(1-p).$

The package reports

$S_K = 1 - \frac{1}{\log 2}\text{mean}_{i<j}\, h(C_K(i,j)).$

High $S_K$ means low pairwise ambiguity and therefore a more stable consensus structure.

Null calibration

Null tables are generated by column-wise permutation of observed values within each active column, preserving the marginal distribution while breaking cross-column structure.

For each null replicate $r$ , the same pipeline produces $S_K^{(0,r)}$ . From those replicates we compute:

$\mu_{0,K} = \text{mean}_r\, S_K^{(0,r)}, \qquad \sigma_{0,K} = \text{sd}_r\, S_K^{(0,r)}.$

We then define:

$\Delta_K = S_K - \mu_{0,K}, \qquad Z_K = \frac{\Delta_K}{\sigma_{0,K} + \varepsilon}.$

The Monte Carlo p-value is

$p_K = \frac{ 1 + \sum_r \mathbf 1\{S_K^{(0,r)} \ge S_K\} }{ R_0 + 1 }.$

Selection logic

The final decision is two-stage.

Stage 1: global detection

Let

$T_{\text{obs}} = \max_K \Delta_K.$

For each null replicate we compute the analogous maximum over candidate K, which yields a global null distribution. This gives a global p-value for the statement “the table has no stronger clustering signal than the null baseline”.

If that null is not rejected, the package reports:

selected_k = 1
cluster = 1
confidence = NA
exploratory_cluster as the strongest unsupported split

Stage 2: supported `K` selection

If the global null is rejected, the package selects among supported K using

$J_K = Z_K - \lambda \log K - \gamma \sum_{g=1}^{K} \mathbf 1\{|G_g| < n_{\min}\}.$

This keeps the chosen solution stable, null-improving, and resistant to very small accidental clusters.

Output design

The most important practical design choice is that augment() now returns both selected and exploratory assignments.

This avoids a common reporting failure:

claiming a supported multi-cluster solution when there is none
or hiding a plausible unsupported split that the analyst may still want to inspect

Interpretation guidance

In practice, a fit should be read in this order:

fit$selection$detected_structure
select_k(fit)
augment(fit)
plot_embedding(fit)
plot_consensus_heatmap(fit)

If those all point in the same direction, the solution is usually easy to defend. If they disagree, the exploratory columns and the heatmap are often the fastest way to see whether the issue is weak signal, boundary uncertainty, or an overly large candidate K.